Data mart outsourcing – DBMS 2 : DataBase Management System Services

Notes on the transition to the cloud

Curt Monash — Tue, 04 Oct 2016 02:22:21 +0000

1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:

The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
Software as a service, aka SaaS.
Co-location in off-premises data centers, aka colo.
On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.

Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.

This is a good example of Monash’s Laws of Commercial Semantics.

2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.

This fact now seems to be widely understood.

3. The public cloud is a natural fit for those use cases in which elasticity truly matters. Many websites and other consumer internet backends have that characteristic. Such systems are often also a good fit for cloud technologies in general.

This is frequently a good reason for new — i.e. “greenfield” — apps to run in the cloud.

4. Security and privacy can be concerns in moving to the cloud. But I’m hearing that more and more industries are overcoming those concerns.

In connection to that point, it might be interesting to note:

In the 1960s and 1970s, one of the biggest industries for remote computing services — i.e. SaaS — was commercial banking.
Other big users were hospitals and stockbrokers.
The US intelligence agencies are building out their own shared, dedicated cloud.

5. Obviously, Amazon is the gorilla in the cloud business. Microsoft Azure gets favorable mentions as well. I don’t hear much about other public cloud providers, however, except that there are a lot of plans to support Google’s cloud just in case.

In particular, I hear less than I expected to about public clouds run by national-champion telecom companies around the world.

6. It’s inconvenient for an application vendor to offer both traditional and SaaS versions of a product. Release cycles and platform support are different in the two cases. But there’s no reason a large traditional application vendor couldn’t pull it off, and the largest are already more or less claiming to. Soon, this will feel like a market necessity across the board.

7. The converse is less universally true. However, some SaaS vendors do lose out from their lack of on-premises options. Key considerations include:

Does your application need to run close to your customers’ largest databases?
Do your customers still avoid the public cloud?

If both those things are true, and you don’t have an on-premises option, certain enterprises are excluded from your addressable market.

8. Line-of-business departments are commonly more cloud-friendly than central IT is. Reasons include:

Departments don’t necessarily see central IT as any “closer” to them than the cloud is.
Departments don’t necessarily care about issues that give central IT pause.
Departments sometimes buy things that only are available via remote delivery, e.g. narrowly focused SaaS applications or market data.

I discussed some of this in my recent post on vendor lock-in.

9. When the public cloud was younger, it had various technological limitations. You couldn’t easily get fast storage like flash. You couldn’t control data movement well enough for good MPP (Massively Parallel Processing) in use cases like analytic SQL.

Those concerns seem to have been largely alleviated.

10. It takes a long time for legacy platforms to be decommissioned. At some enterprises, however, that work has indeed been going on for a long time, via virtualization.

11. If you think about system requirements:

There is a lot of computing power in devices that may be regarded as IoT nodes — phones, TV boxes, thermostats, cars, industrial equipment, sensors, etc. Client-side computing is getting ever more diverse.
Server-side computing, however, is more homogenous. Enterprises can, should and likely will meet the vast majority of their server requirements on a relatively small number of clusters each.

I argued the latter point in my 2013 post on appliances, clusters, and clouds, using terminology and reasoning that are now only slightly obsolete.

So what will those clusters be? Some will be determined by app choices. Most obviously, if you use SaaS, the SaaS vendor decides which cloud(s) your data is in. And if you’re re-hosting legacy systems via virtualization, that’s another cluster.

Otherwise, clusters will probably be organized by database, in the most expansive sense of term. For example, there could be separate clusters for:

Operational data managed by your general-purpose RDBMS (Oracle, SQL Server, DB2, whatever).
Relational data warehousing, whether in an analytic RDBMS or otherwise.
Log files, perhaps managed in Hadoop or Splunk.
Your website and other internet back-ends, perhaps running over NoSQL data stores.
Text documents managed by some kind of search engine.
Media block or object storage, if the organization’s audio/video/whatever would overwhelm a text search engine. (Text search or document management systems can often also handle low volumes of non-text media.)

Indeed, since computing is rarely as consolidated as CIOs dream of it being, a large enterprise might have several clusters for any of those categories — each running different software for data and storage management — with different deployment choices among colo, true on-prem, and true cloud.

Differentiation in business intelligence

Curt Monash — Mon, 26 Oct 2015 19:34:09 +0000

Parts of the business intelligence differentiation story resemble the one I just posted for data management. After all:

Both kinds of products query and aggregate data.
Both are offered by big “enterprise standard” behemoth companies and also by younger, nimbler specialists.
You really, really, really don’t want your customer data to leak via a security breach in either kind of product.

That said, insofar as BI’s competitive issues resemble those of DBMS, they are those of DBMS-lite. For example:

BI is less mission-critical than some other database uses.
BI has done a lot less than DBMS to deal with multi-structured data.
Scalability demands on BI are less than those on DBMS — indeed, they’re the ones that are left over after the DBMS has done its data crunching first.

And full-stack analytic systems — perhaps delivered via SaaS (Software as a Service) — can moot the BI/data management distinction anyway.

Of course, there are major differences between how DBMS and BI are differentiated. The biggest are in user experience. I’d say:

For many people, BI is the user experience over the underlying data store(s).
Two crucial aspects of user experience are navigational power and speed of response.
- At one extreme, people hated the old green paper reports.
- At the other, BI in the QlikView/Tableau era is one of the few kinds of enterprise software that competes on the basis of being
- This is also somewhat true with respect to snazzy BI demos, such as interactive maps or way-before-their-day touch screens.*
Features like collaboration and mobile UIs also matter.
Since BI is commonly adopted via quick departmental projects — at least as the hoped-for first-step of a “land-and-expand” campaign — administrative usability is at a premium as well.

* Computer Pictures and thus Cullinet used a touch screen over 30 years ago. Great demo, but not so useful as an actual product, due to the limitations on data structure.

Where things get tricky is in my category of accuracy. In the early 2000s, I pitched and wrote a white paper arguing that BI helps bring “integrity” to an enterprise in various ways. But I don’t think BI vendors have done a good job of living up to that promise.

They’ve moved slowly in accuracy-intensive areas such as alerting or predictive modeling.
“Single source of truth” and similar protestations turned out to be much oversold.

Indeed, it’s tempting to say that business intelligence has been much too stupid. I really like some attempts to make BI sharper, e.g. at Rocana or ClearStory, but it remains to be seen whether many customer care about their business intelligence actually being smart.

So how does all this fit into my differentiation taxonomy/framework? Referring liberally to what has already been written above, we get:

Scope:
- For traditional tabular analysis, BI products compete on a bunch of UI features.
- Non-tabular analysis is much more primitive. Event series interfaces may be the closest thing to an exception.
- Collaboration is in the mix as well.
Accuracy: I discussed this one above.
Other trustworthiness:
- Security is a big deal.
- Mission-critical robustness is usually, in truth, just a nice-to-have. But some (self-)important executives may disagree.
Speed:
- For some functionality — e.g. cross-database joins — BI tools almost have to rely on their own DBMS-like engines for performance.
- For other it’s more optional. You can do single-RDBMS query straight against the underlying system, or you can pre-position some of the data in memory.
- Please also see the adoption and administration section below.
User experience: I discussed this one above.
Adoption and administration:
- When BI is “owned” by a department, especially one that also doesn’t manage the underlying data, set-up and administration need to be super-easy.
- Sometimes, departmental BI is used as an excuse to pressure central IT into making data available.
- Much like analytic DBMS, BI adoption can sometimes be tied to huge first-time-data-warehouse building projects.
- Administration of big enterprise-standard BI is, to re-use a term, much like DBMS-lite.
Cost: The true cost of BI usage is commonly governed more by the underlying data management (and data acquisition) than by the BI software (and supporting servers) itself. That said:
- BI “hard” costs — licenses, servers, cloud fees, whatever — commonly have to fit into departmental budgets.
- So do BI people costs.
- BI people requirements also often have to fit into departmental skillets.

Snowflake Computing

Curt Monash — Wed, 22 Oct 2014 08:45:50 +0000

I talked with the Snowflake Computing guys Friday. For starters:

Snowflake is offering an analytic DBMS on a SaaS (Software as a Service) basis.
The Snowflake DBMS is built from scratch (as opposed, to for example, being based on PostgreSQL or Hadoop).
The Snowflake DBMS is columnar and append-only, as has become common for analytic RDBMS.
Snowflake claims excellent SQL coverage for a 1.0 product.
Snowflake, the company, has:
- 50 people.
- A similar number of current or past users.
- 5 referenceable customers.
- 2 techie founders out of Oracle, plus Marcin Zukowski.
- Bob Muglia as CEO.

Much of the Snowflake story can be summarized as cloud/elastic/simple/cheap.*

*Excuse me — inexpensive. Companies rarely like their products to be labeled as “cheap”.

In addition to its purely relational functionality, Snowflake accepts poly-structured data. Notes on that start:

Ingest formats are JSON, XML or AVRO for now.
I gather that the system automagically decides which fields/attributes are sufficiently repeated to be broken out as separate columns; also, there’s a column for the documents themselves.

I don’t know enough details to judge whether I’d call that an example of schema-on-need.

A key element of Snowflake’s poly-structured data story seems to be lateral views. I’m not too clear on that concept, but I gather:

A lateral view is something like a join on a table function, inner or outer join as the case may be.
“Lateral view” is an Oracle term, while “Cross apply” is the term for the same thing in Microsoft SQL Server.
Lateral views are one of the ways of making SQL handle hierarchical data structures (others evidently are WITH and CONNECT BY).

Lateral views seem central to how Snowflake handles nested data structures. I presume Snowflake also uses or plans to use them in more traditional ways (subqueries, table functions, and/or complex FROM clauses).

If anybody has a good link explaining lateral views, please be so kind as to share! Elementary googling isn’t turning much up, and the Snowflake folks didn’t send over anything clearer than this and this.

Highlights of Snowflake’s cloud/elastic/simple/inexpensive story include:

Snowflake’s product is SaaS-only for the foreseeable future.
Data is stored in compressed 16 megabyte files on Amazon S3, and pulled into Amazon EC2 servers for query execution on an as-needed basis. Allegedly …
… this makes data storage significantly cheaper than it would be in, for example, an Amazon version of HDFS (Hadoop Distributed File System).
When you fire up Snowflake, you get a “virtual data warehouse” across one or more nodes. You can have multiple “virtual data warehouses” accessing identical or overlapping sets of data. Each of these “virtual data warehouses” has a physical copy of the data; i.e., this is not related to the Oliver Ratzesberger concept of a virtual data mart defined by workload management.
Snowflake has no indexes. It does have zone maps, aka data skipping. (Speaking of simple/inexpensive — both those aspects remind me of Netezza.)
Snowflake doesn’t distribute data on any kind of key. I.e. it’s round-robin. (I think that’s accurate; they didn’t have time to get back to me and confirm.)
This is not an in-memory story. Data pulled onto Snowflake’s EC2 nodes will commonly wind up in their local storage.

Snowflake pricing is based on the sum of:

Per EC2 server-hour, for a couple classes of node.
Per S3 terabyte-month of compressed storage.

Right now the cheaper class of EC2 node uses spinning disk, while the more expensive uses flash; soon they’ll both use flash.

DBMS 1.0 versions are notoriously immature, but Snowflake seems — or at least seems to think it is — further ahead than is typical.

Snowflake’s optimizer is fully cost-based.
Snowflake thinks it has strong SQL coverage, including a large fraction of SQL 2003 Analytics. Apparently Snowflake has run every TPC-H and TPC-DS query in-house, except that one TPC-DS query relied on a funky rewrite or something like that.
Snowflake bravely thinks that it’s licked concurrency from Day 1; you just fire up multiple identical virtual DWs if needed to handle the query load. (Note: The set of Version 1 DBMS without concurrent-usage bottlenecks has cardinality very close to 0.)
Similarly, Snowflake encourages you to fire up a separate load-only DW instance, and load mainly through trickle feeds.
Snowflake’s SaaS-only deployment obviates — or at least obscures — a variety of management, administration, etc. features that often are lacking in early DBMS releases.

Other DBMS technology notes include:

Compression is columnar (various algorithms, including file-at-a-time dictionary/token).
Joins and other database operations are performed on compressed data. (Yay!)
Those 16-megabyte files are column-organized and immutable. This strongly suggests which kinds of writes can or can’t be done efficiently. Note that adding a column — perhaps of derived data — is one of the things that could go well.
There’s some kind of conflict resolution if multiple virtual DWs try to write the same records — but as per the previous point, the kinds of writes for which that’s an issue should be rare anyway.

In the end, a lot boils down to how attractive Snowflake’s prices wind up being. What I can say now is:

I don’t actually know Snowflake’s pricing …
… nor the amount of work it can do per node.
It’s hard to imagine that passing queries from EC2 to S3 is going to give great performance. So Snowflake is more likely to do well when whatever parts of the database wind up being “cached” in the flash of the EC2 servers suffice to answer most queries.
In theory, Snowflake could offer aggressive loss-leader pricing for a while. But nobody should make a major strategic bet on Snowflake’s offerings unless it shows it has a sustainable business model.

Data as an asset

Curt Monash — Mon, 22 Sep 2014 03:49:00 +0000

We all tend to assume that data is a great and glorious asset. How solid is this assumption?

Yes, data is one of the most proprietary assets an enterprise can have. Any of the Goldman Sachs big three* — people, capital, and reputation — are easier to lose or imitate than data.
In many cases, however, data’s value diminishes quickly.
Determining the value derived from owning, analyzing and using data is often tricky — but not always. Examples where data’s value is pretty clear start with:
- Industries which long have had large data-gathering research budgets, in areas such as clinical trials or seismology.
- Industries that can calculate the return on mass marketing programs, such as internet advertising or its snail-mail predecessors.

*”Our assets are our people, capital and reputation. If any of these is ever diminished, the last is the most difficult to restore.” I love that motto, even if Goldman Sachs itself eventually stopped living up to it. If nothing else, my own business depends primarily on my reputation and information.

This all raises the idea — if you think data is so valuable, maybe you should get more of it. Areas in which enterprises have made significant and/or successful investments in data acquisition include:

Actual scientific, clinical, seismic, or engineering research.
Actual selling of (usually proprietary) data, with the straightforward economic proposition of “Get once, sell to multiple customers more cheaply than they could get it themselves.” Examples start:
- This is the essence of the stock quote business. And Michael Bloomberg started building his vast fortune by adding additional data to what the then-incumbents could offer, for example by getting fixed-income prices from Cantor Fitzgerald.*
- Multiple marketing-data businesses operate on this model.
- Back when there was a small but healthy independent paper newsletter and directory business, its essence was data.
- And now there are many online data selling efforts, in niches large and small.
Internet ad-targeting businesses. Making money from your great ad-targeting technology usually involves access to lots of user-impression and de-anonymization data as well.
Aggressive testing by internet businesses, of substantive offers and marketing-display choices alike. At the largest, such as eBay, you’ll rarely see a page that doesn’t have at least one experiment on it. Paper-based direct marketers take a similar approach. Call centers perhaps should follow suit more than they do.
Surveys, focus groups, etc. These are commonly expensive and unreliable (and the cheap internet ones commonly irritate people who do business with you). But sometimes they are, or seem to be, the only kind of information available.
Free-text data. On the whole I’ve been disappointed by the progress in text analytics. Still — and this overlaps with some previous points — there’s a lot of information in text or narrative form out there for the taking.
- Internally you might have customer emails, call center notes, warranty reports and a lot more.
- Externally there’s a lot of social media to mine.

*Sadly, Cantor Fitzgerald later became famous for being hit especially hard on 9/11/2001.

And then there’s my favorite example of all. Several decades ago, especially in the 1990s, supermarkets and mass merchants implemented point-of-sale (POS) systems to track every item sold, and then added loyalty cards through which they bribed their customers to associate their names with their purchases. Casinos followed suit. Airlines of course had loyalty/frequent-flyer programs too, which were heavily related to their marketing, although in that case I think loyalty/rewards were truly the core element, with targeted marketing just being an important secondary benefit. Overall, that’s an awesome example of aggressive data gathering. But here’s the thing, and it’s an example of why I’m confused about the value of data — I wouldn’t exactly say that grocers, mass merchants or airlines have been bastions of economic success. Good data will rarely save a bad business.

Related links

I first wrote up this point in a 2005 Computerworld column, and added a text-analytics nuance a year later, but since then I seem to have talked about it much more than I’ve written it down.
Please always keep in mind the risks to privacy in whatever you do.

Thoughts on SaaS

Curt Monash — Mon, 25 Nov 2013 01:16:05 +0000

Generalizing about SaaS (Software as a Service) is hard. To prune some of the confusion, let’s start by noting:

SaaS has been around for over half a century, and at times has been the dominant mode of application delivery.
The term multi-tenancy is being used in several different ways.
Multi-tenancy, in the purest sense, is inessential to SaaS. It’s simply an implementation choice that has certain benefits for the SaaS provider. And by the way, …
… salesforce.com, the chief proponent of the theory that true multi-tenancy is the hallmark of true SaaS, abandoned that position this week.
Internet-based services are commonly, if you squint a little, SaaS. Examples include but are hardly limited to Google, Twitter, Dropbox, Intuit, Amazon Web Services, and the company that hosts this blog (KnownHost).
Some of the core arguments for SaaS’ rise, namely the various efficiencies of data center outsourcing and scale, apply equally to the public cloud, to SaaS, and to AEaaS (Anything Else as a Service).
These benefits are particularly strong for inherently networked use cases. For example, you really don’t want to be hosting your website yourself. And salesforce.com got its start supporting salespeople who worked out of remote offices.
In theory and occasionally in practice, certain SaaS benefits, namely the outsourcing of software maintenance and updates, could be enjoyed on-premises as well. Whether I think that could be a bigger deal going forward will be explored in future posts.

For smaller enterprises, the core outsourcing argument is compelling. How small? Well:

What’s the minimum level of IT operations headcount needed for mission-critical systems? Let’s just say “several”.
What does that cost? Fully burdened, somewhere in the six figures.
What fraction of the IT budget should such headcount be? As low a double digit percentage as possible.
What fraction of revenues should be spent on IT? Some single-digit percentage.

So except for special cases, an enterprise with less than $100 million or so in revenue may have trouble affording on-site data processing, at least at a mission-critical level of robustness. It may well be better to use NetSuite or something like that, assuming needed features are available in SaaS form.*

*Truth be told, I’m not up to speed on mid-range SaaS application suite alternatives.

Continuing that thought — if you’re a mid-range application software provider, you have to develop a SaaS version of your product line. That’s a very different business model than the apps + OEMed platform you’re probably providing now, but it’s the best way to serve your customers going forward. And by the way — while mid-range application software is commonly sold on a regional basis, SaaS can be sold more globally; after all, the the need for onsite service is eliminated, and price points should in most cases fit with telephone sales. Yes, national language and regional data privacy rules are both concerns, but they still leave the available markets looking much bigger than regional resellers have traditionally enjoyed. So expect shake-outs in a whole lot of vertical markets, as vendors horn in on each other’s territories, and a few elephantine winners perhaps emerge.

The argument above assumes that extreme reliability is needed. So there’s nothing necessarily wrong with a small team of business analysts sticking an RDBMS appliance* in a corner and managing it themselves. If it sputters from time to time, who cares; using it still may be easier than getting that data in and out of the cloud. But eventually, if all the data is remote anyway — SaaS, website, etc. — then it may make sense to do analytics remotely as well.

*Previously, that appliance might have been from Netezza; now, my first thought is the cheaper — albeit more limited — Infobright.

The arguments that direct smaller companies toward SaaS apply to large enterprises to, but they aren’t as dispositive. Larger enterprises can actually afford to do their own IT operations if they want to. What’s more, moving away from in-house operations is harder for big firms, due to the larger and more customized portfolio of legacy systems they’re likely to have. That said:

Almost all enterprises should have their internet-facing systems offsite, even if just via co-location. The core reasons are that ingesting high-volume inbound network traffic is inherently difficult, and security issues make it much tougher yet. In addressing these challenges, specialists enjoy significant economies of scale.
Most enterprises will have plenty of SaaS silos. If nothing else:
- Complex machinery will increasingly “phone home” for help staying in good working order. That’s a form of SaaS.
- Information providers and aggregators tend to deliver via SaaS.
- Various kinds of collaboration and communication apps, from Google Mail to Dropbox, live in the cloud. Personal productivity applications, from word processing to Photoshop, may be following.
- “Rodney Dangerfield” departments — i.e., ones unhappy with the respect and attention they get from central IT — often turn to SaaS or similar outsourcing. Human resources is an obvious example, from Automatic Data Processing to Employease to, these days, Workday.

That leaves us with the questions as to when and how large enterprises should or will move their core applications to SaaS and/or the cloud. Given the length of this post, I won’t try to answer them now. But for starters:

Enterprises don’t like to rip and replace their apps, except in consolidation projects, as long as they can avoid doing so.
Cloud/remote computing economies are less convincing if you already have your computer rooms staffed and set up.
A key benefit of SaaS is that vendors control and drive the upgrade cycles. One cost of that is restrictions on customization, although you can also build apps and app extensions on Paas//DBaaS/Waas (Platform/DataBase/Whatever as a Service) offerings such as force.com.
Lock-in is a serious concern, for application and platform offerings alike. Not only are you betting on one vendor’s software black box, you’re also betting on its remote computing operation. If you grow dissatisfied with either, or with their pricing, you may not have much opportunity to escape.

Amazon Redshift and its implications

Curt Monash — Sun, 09 Dec 2012 16:59:10 +0000

Merv Adrian and Doug Henschen both reported more details about Amazon Redshift than I intend to; see also the comments on Doug’s article. I did talk with Rick Glick of ParAccel a bit about the project, and he noted:

Amazon Redshift is missing parts of ParAccel, notably the extensibility framework.
ParAccel did some engineering to make its DBMS run better in the cloud.
Amazon did some engineering in the areas it knows better than ParAccel — cloud provisioning, cloud billing, and so on.

“We didn’t want to do the deal on those terms” comments from other companies suggest ParAccel’s main financial take from the deal is an already-reported venture investment.

The cloud-related engineering was mainly around communications, e.g. strengthening error detection/correction to make up for the lack of dedicated switches. In general, Rick seemed more positive on running in the (Amazon) cloud than analytic RDBMS vendors have been in the past.

So who should and will use Amazon Redshift? For starters, I’d say:

If data isn’t already in the Amazon cloud, getting it there remains a pain. Locating your analytic RDBMS on the same premises where the data is created makes life simpler.
Over 3 years ago, $20,000/terabyte was a great list price for purchasing a data warehouse appliance that required little administration. Imagine negotiated discounts and further declines from there. Even so, Amazon’s <$1K/terabyte/year is a low figure.
Amazon’s marketing suggests companies should put their whole data warehousing on Redshift. But in fact, that almost never happens even with ParAccel.

Also — if Amazon Redshift is your analytic RDBMS, what’s the rest of your analytic environment? I can think of three possibilities that could work pretty straightforwardly:

Business intelligence and just BI.
Statistics and just statistics.
Hadoop (i.e. Elastic MapReduce) plus a lot of hand-coding.

Anything else would seem hard to stitch together at this time.

Putting that together, I see three kinds of users for whom Amazon Redshift might make sense:

Web startups, whose data is all in the Amazon cloud anyway, and who need better analytic SQL performance than they can get from Hadoop.
Data mart outsourcers/data sellers, again probably startups, whose whole business is in the cloud.
Individual analysts with small budgets, or very small analytic groups within enterprises or other organizations.

All three of those are “traditional” markets for new-generation analytic DBMS and data warehouse appliances, except that those DBMS are rarely put into production in the cloud. But for the most part, vendors have moved upscale — enterprise users, analytic platform features, etc. So the biggest threat from Amazon Redshift is to markets that other vendors have somewhat left behind.

So how should and will the analytic RDBMS industry respond? My thoughts on that begin:

Doing nothing would be a poor choice.
They’re already open to having cheap or free low-end offerings — Vertica Community Edition, open-source Infobright, and so on.
Tweaking their systems to work well in the cloud becomes easier all the time, as cloud platforms mature.
A natural solution would be something like a Starter/Standard/Enterprise Edition split, with at least the Starter and Standard Editions being cloud-friendly.

Introduction to Metamarkets and Druid

Curt Monash — Sat, 16 Jun 2012 21:51:14 +0000

I previously dropped a few hints about my clients at Metamarkets, mentioning that they:

Have built vertical-market analytic platform technology.
Use a lot of Hadoop.
Throw good parties. (That’s where the background photo on my Twitter page comes from.)

But while they’re a joy to talk with, writing about Metamarkets has been frustrating, with many hours and pages of wasted of effort. Even so, I’m trying again, in a three-post series:

Introduction to Metamarkets and Druid (this post)
Druid overview
Metamarkets’ back-end technology

Much like Workday, Inc., Metamarkets is a SaaS (Software as a Service) company, with numerous tiers of servers and an affinity for doing things in RAM. That’s where most of the similarities end, however, as Metamarkets is a much smaller company than Workday, doing very different things.

Metamarkets’ business is SaaS (Software as a Service) business intelligence, on large data sets, with low latency in both senses (fresh data can be queried on, and the queries happen at RAM speed). As you might imagine, Metamarkets is used by digital marketers and other kinds of internet companies, whose data typically wants to be in the cloud anyway. Approximate metrics for Metamarkets (and it may well have exceeded these by now) include 10 customers, 100,000 queries/day, 80 billion 100-byte events/month (before summarization), 20 employees, 1 popular CEO, and a metric ton of venture capital.

To understand how Metamarkets’ technology works, it probably helps to start by realizing:

Metamarkets has one technology stack for receiving and managing data when it is ingested in batch mode.
Metamarkets has a different, overlapping technology stack for receiving and managing data when it is ingested in streaming mode.
Metamarkets is open-sourcing part of the two stacks, called Druid.
In the Venn diagram for these three things, the intersection of no two of them is strictly contained in the third.

and further:

Metamarkets doesn’t surface all the raw data for analysis or viewing. Rather, there’s some early aggregation, with the raw data preserved off to the side in case you want to create more aggregates later on.
Metamarkets’ application is a dashboard, supporting drilldown but not, at this time, other forms of analytics. A lot of what is measured are time series and/or top lists.
Druid is in essence an analytic DBMS; indeed, it’s so strictly analytic that it isn’t suited to manage its own metadata. MySQL is used for that.
Apache Zookeeper is also assumed as part of the environment to manage Druid.
The batch pipeline relies on Hadoop.
The streaming pipeline relies on Kafka (a publish-subscribe project out of LinkedIn).

The whole thing is fully multi-tenant, at least by the point that data is being stored and visualized. Metamarkets customers either live in the Amazon cloud (the smaller ones), or else used to live there and don’t mind shipping their data back there for analysis by Metamarkets. Some “not exactly Ted Codd’s tabular DBMS” features are:

Multi-valued fields (just vectors, not unlimited arrays).
A couple of fast approximate algorithms (uniques, top lists).

One thing MetaMarkets does that’s pretty much a best practice these days is roll out new code, mid-day if they like, without ever taking their system down. Why is this possible? Because the data is replicated across nodes, so you can do a rolling deployment of a node at a time without making any data unavailable. Notes on that include:

Performance could be affected, as the read load is generally balanced across all the data replicas.
Data locking is not an issue — Metamarkets doesn’t have any read locks, as Druid is an MVCC (Multi-Version Concurrency Control) system.

Thinking about market segments

Curt Monash — Tue, 01 May 2012 11:00:08 +0000

It is a reasonable (over)simplification to say that my business boils down to:

Advising vendors what/how to sell.
Advising users what/how to buy.

One complication that commonly creeps in is that different groups of users have different buying practices and technology needs. Usually, I nod to that point in passing, perhaps by listing different application areas for a company or product. But now let’s address it head on. Whether or not you care about the particulars, I hope the sheer length of this post reminds you that there are many different market segments out there.

Last June I wrote:

In almost any IT decision, there are a number of environmental constraints that need to be acknowledged. Organizations may have standard vendors, favored vendors, or simply vendors who give them particularly deep discounts. Legacy systems are in place, application and system alike, and may or may not be open to replacement. Enterprises may have on-premise or off-premise preferences; SaaS (Software as a Service) vendors probably have multitenancy concerns. Your organization can determine which aspects of your system you’d ideally like to see be tightly integrated with each other, and which you’d prefer to keep only loosely coupled. You may have biases for or against open-source software. You may be pro- or anti-appliance. Some applications have a substantial need for elastic scaling. And some kinds of issues cut across multiple areas, such as budget, timeframe, security, or trained personnel.

I’d further say that it matters whether the buyer:

Is a large central IT organization.
Is the well-staffed IT organization of a particular business department.
Is a small, frazzled IT organization.
Has strong engineering or technical skills, but less in the way of IT specialists.
Is trying to skate by without much technical knowledge of any kind.

Now let’s map those considerations (and others) to some specific market segments.

Traditional large enterprises’ central IT organizations commonly:
- Favor large, proven vendors and well-accepted IT methodologies.
- Would like to consolidate their IT vendors as much as possible.
- Have major challenges with legacy systems and data integration …
- … which are often exacerbated by mergers.
- Spend a lot of cycles on bureaucracy and company politics.
- Notwithstanding the forgoing, have resources to invest in some “sizzle” initiatives.
The very largest enterprises are more likely than their slightly smaller counterparts to:
- View IT as a potential area of competitive differentiation.
- Believe much of what they do should be custom, due to their unique needs and resources.
- Experiment with unproven technologies.
Smaller enterprises may:
- Have small, generalist, overwhelmed staffs.
- Hope for turnkey application solutions (SaaS or otherwise).
- Get very committed to/reliant on a small number of vendors.
In particular, IBM or Microsoft loyalists can be:
- Extremely locked into their preferred vendor’s strategies.
- Not very fruitful for rival vendors to attempt to sell to.
Humongous consumer internet companies tend to:
- Have very high opinions of themselves and their technical abilities.
- Be open source zealots, for reasons both of free-like-beer and free-like-speech.
- In particular, not want to buy anybody else’s software.
- Not be big fans of relational database designs.
Other large consumer internet companies tend to:
- Be like the humongous ones they look up to, but maybe not to the same extremes.
- In particular, be more willing to pay for software.
- Be mired in company politics only/mainly to the extent they are both large and old(er).
Smaller consumer internet companies tend to:
- Be like the large ones they look up to, but …
- … be quite short on traditional IT skills, and work around that shortage by reinventing various wheels.
Business-oriented SaaS (Software as a Service) companies commonly:
- Are drawn to the cool open source technologies consumer internet companies use …
- … but may wind up using more traditional kinds of DBMS, for the same reasons those DBMS are used in other business applications.
- Are more primitive in the analytic capabilities they offer their customers than I think they should be (analytics-only vendors sometimes excepted).
- Are refreshingly free of traditional IT politics, because technology is too important to them to mess around with too badly. (Of course, any other kinds of company politics may still come into play.)
Internet operations of traditional enterprises:
- Sometimes are just like stand-alone internet businesses.
- Sometimes are just like — and part of — the rest of the enterprise’s IT operations.
- More commonly are somewhere in between.
Marketing departments of traditional enterprises sometimes:
- Want to do their own data acquisition, management, and/or analysis …
- … without having great IT resources of their own.
- Invest in departmental analytics efforts or even …
- … have line executives who are analytically proficient.
- Make heavy use of SaaS, as an alternative to relying on central IT, or as a natural byproduct of acquiring third-party data.
Large investment firms commonly:
- Have numerous departments, each with its own IT experts.
- Care about sub-millisecond latency …
- … and sub-week time-to-value.
- Experience return-on-investment in a very different way than most businesses do.
Telecom service companies commonly differ from other similarly-sized enterprises in that:
- They are more aggressive about using innovative technology to manage (and analyze) data.
- Somewhat resemble investment firms in having multiple departments that each have broad engineering discretion.
National security customers often:
- Want the best, cutting-edge, sometimes custom technology, yet …
- … make themselves very cumbersome to sell to and support.
- Are not forthcoming about how they use what they buy.

I could keep going for quite a while — but for now I won’t. Vertical markets I’m thus omitting include but are not limited to:

Pharmaceutical researchers
Hospitals
Insurers
Academic scientists

Finally, for yet another omission — in my original outline I contemplated distinguishing among various geographical areas, with my first-pass segmentation being:

North America
Europe
Japan
China
Smaller geographies

Notes on the ClearStory Data launch, including an inaccurate quote from me

Curt Monash — Mon, 26 Mar 2012 09:03:33 +0000

ClearStory Data launched, with nice coverage in the New York Times, Computerworld, and elsewhere. But from my standpoint, there were some serious problems:

(Bad.) I was planning to cover the launch as well, in a split exclusive, but that plan was changed, costing me considerable wasted work.
(Worse.) I wasn’t told of the change as soon as it was known. Indeed, I wasn’t told at all; I was left to infer it from the fact that I was now being asked to talk with other reporters.
(Horrific.) I was quoted in the ClearStory launch press release, but while the sentiments were reasonably in line with my own, the quote was incorrect.*

I’m utterly disgusted with this whole mess, although after talking with her a lot I’m fine with CEO Sharmila Mulligan’s part in it, which is to say with ClearStory’s part in general.

*I avoid the term “platform” as much as possible; indeed, I still don’t really know what the “new platforms” part was supposed to refer to. The Frankenquote wound up with some odd grammar as well.

Actually, in principle I’m a pretty close adviser to ClearStory (for starters, they’re one of my stealth-mode clients). That hasn’t really ramped up yet; in particular, I haven’t had a technical deep dive. So for now I’ll just say:

1. I’m a huge Sharmila fan. I worked with her a lot at Aster, and she was that rarity — a chief marketing officer who excelled at all aspects of marketing, process management perhaps aside. (Aster marketing process management actually worked pretty well; but Steve Wooledge was there even before Sharmila, and Steve’s great at that stuff.)

Sharmila also has been a spot-on adviser to several other start-ups. I generally tell start-ups they’d do well to talk with her, and vice-versa.

2. Of ClearStory’s two techie cofounders, I interacted with John Cieslewicz a bit at Aster, and all my impressions are favorable.

3. My eyes glaze over a bit at the “cool BI UI” part of the story. I’m sure it will be wonderful, and cool business intelligence demos are really important for getting business. Even so, I think user interface is not what will make or break ClearStory. It’s also not the background of ClearStory’s founders.

4. What’s really important technically at ClearStory, I believe, will be the middleware. “Semantic layer” requirements are much more demanding than they used to be, in at least two dimensions:

Semantics as the data arrives.
What you do with the data after you tame it.

Applications of an analytic kind

Curt Monash — Sun, 12 Feb 2012 01:32:17 +0000

The most straightforward approach to the applications business is:

Take general-purpose technology and think through how to apply it to a specific application domain.
Produce packaged application software accordingly.

However, this strategy is not as successful in analytics as in the transactional world, for two main reasons:

Analytic applications of that kind are rarely complete.
Incomplete applications rarely sell well.

I first realized all this about a decade ago, after Henry Morris coined the term analytic applications and business intelligence companies thought it was their future. In particular, when Dave Kellogg ran marketing for Business Objects, he rattled off an argument to the effect that Business Objects had generated more analytic app revenue over the lifetime of the company than Cognos had. I retorted, with only mild hyperbole, that the lifetime numbers he was citing amounted to “a bad week for SAP”. Somewhat hoist by his own petard, Dave quickly conceded that he agreed with my skepticism, and we changed the subject accordingly.

Reasons that analytic applications are commonly less complete than the transactional kind include:

Transactional apps often serve to automate rigid business processes. Analytic technology use is inherently more flexible and varied.
Transactional apps are often used by cheaper/lower-status people. Analytic technology may be used by managers who treasure the right of individualized decision making.

There are indeed scenarios in which incomplete analytic applications can be useful. For example:

If a user has sufficiently simple needs, cookie-cutter analytic apps — perhaps offered on a SaaS (Software as a Service) — basis might suffice.
Small teams of technical workers can kick-start their analytic efforts with pre-built booster kits. Two examples come to mind:
- SAS Institute has done quite well with statistical “applications” that really are just accelerators for custom statistical work of the usual kind.
- Starter-kit data models for data warehousing have some value as well.

But otherwise, I think the best opportunities for application-specific analytic technology aren’t really classical “analytic apps”. Rather, they arise in three sometimes-overlapping areas, adjacent to the analytic application core:

Operational applications enhanced with some analytics so as to improve routine business processes.
Information services enhanced with some analytic technology that retrieves (and perhaps also helps analyze) the information.
Analytic-application-specific “platform” technology.

Operational applications have been enhanced with analytics for as long as we have had reports. Indeed, meeting that reporting need was the core business for Crystal Reports, the only business intelligence company ever to build a large OEM/VAR business (it was eventually merged into Business Objects). Analytic enhancement is also a major direction for application behemoths Oracle and SAP, but I won’t address that aspect in this post.

If you offer a service whose essence is tabular-structured information — e.g. a third-party data source or some stakeholder-facing analytics — then you also need to provide business intelligence capability to the information’s consumers. Too often, however, those BI capabilities are unimpressive, and there’s an “easy” improvement in upgrading them that should happen before more serious analytic-app capabilities are addressed.

What I’m most excited about right now is analytic-application-specific “platform” technology, an area in which I’ve sensed a groundswell of interest over the past 6-12 months. It’s at the heart of a significant fraction of the new startup ideas I’m hearing, and rightly so; on the other hand, it’s also been going on for decades. Here is a grab-bag of examples.

Simulation and optimization have been around since the 1970s, if not before. One cool effort was by River Logic, which developed a visual programming language especially geared to profitability/logistics kinds of simulations in the 1990s. (While still around, the company unfortunately doesn’t seem to have done much for the past 1 – 1 1/2 decades.)
Much more established is SAP’s APO (Advanced Planner and Optimizer), dating back to at least the 1990s. Given the magnitude of the mixed-integer programming problems it tackles, I would conjecture it includes some built-in domain-specific heuristics you might not find in a generic set of mathematical packages.
The financial services industry has long featured domain-specific technology. From the 1800s through the 1970s, this was focused on communications, from stock tickers (one of Thomas Edison’s first important inventions) to networks of stock quote machines. In the 1980s, that expanded to include what we’d recognize even today as real-time business intelligence tools, and then also to complex security-valuation analytics.
What’s more, the whole area of CEP/streaming has traditionally been focused on financial trading, for reasons including low latency, time series orientation, and the opportunity to parameterize queries across a broad set of ticker symbols.
Despite a lot of application potential, general-purpose text analytics technology has floundered. But when text analytics technology is specifically extended for marketing applications, it does better. Indeed, marketing applications don’t use general-purpose text mining technology to its fullest power. But they do add the relatively new analytic techniques of sentiment analysis. They further add capabilities to analyze short, ungrammatical “verbatims”, such as text messages.
My clients at Metamarkets — Mike Driscoll et al. — have built a pretty cool technology stack focused on real-time/in-memory BI, well-suited for digital advertising and similar markets. I question whether it has much applicability outside of that space, however, because every industry that I can think of that needs real-time BI needs something rather different.
WibiData is focused in a similar area, but on actually personalizing things rather than on monitoring personalization’s effects. WibiData believes this requires aggressive use of derived data and the associated schema evolution.
Log analyzer Sumo Logic probably doesn’t rely on an off-the-shelf machine learning engine.
Other apparent examples showed up in the comment thread to my November, 2011 post on agile predictive analytics.

It will be fascinating to see how this all plays out.