In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I’ll cover four more kinds of analytic database — even newer, for the most part, with a use case/product short list match that is even less clear.
- Kinds of data likely to be included: Logs, other technical/external
- Likely use styles: Staging/ETL, investigative
- Canonical example: Log files in a Hadoop cluster
- Stresses: TCO, scale-out, transform/big-query performance, ETL functionality
With the explosion of machine-generated data has come the need for a place to put it all, sometimes called the big bit bucket. This is like the investigative data mart for big databases, but more poly-structured. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.
The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.
Archival data store
- Kinds of data likely to be included: Operational, CDR (call detail record), security log
- Likely use styles: Archival, reporting (for compliance), possibly also investigative
- Examples: Any long-term detailed historical store
- Stresses: TCO, compression, scale-out, performance (if multi-use)
Analytic DBMS vendors have been insulting each other with the claim “that’s just an archival data store,” dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only Rainstor truly embraces the archival positioning, and I’ve become pretty dubious about their technical claims and their company alike.
Still, there’s a legitimate need for data stores — especially relational analytic DBMS that:
- Store data cheaply, with high rates of compression.
- Have decent performance if you do want to query the data.
- May have archiving/compliance-specific features as well.
Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.
Outsourced data mart
- Kinds of data likely to be included: All
- Likely use styles: Traditional BI, investigative analytics, staging/ETL
- Examples: Advertising tracking, SaaS CRM
- Stresses: Performance, TCO, reliability, concurrency
Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I’ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that’s an analytic data management challenge. The possibilities expand from there.
Data outsourcers are in the IT business, and so their IT development is — hopefully! — more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.* Multitenancy is commonly an issue, as is running in the cloud.
*Even so, there’s often That Guy who doesn’t want to migrate away from Oracle, no matter what.
Vertica gets the nod in a number of these cases; it’s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you’re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.
Operational analytic(s) server
- Kinds of data likely to be included: Customer-centric, log, financial trade
- Likely use styles: Advanced operational analytics
- Lower latency: Web or call-center personalization, anti-fraud
- Higher latency: Customer profiling, Basel 3 risk analysis
- Stresses: Performance, reliability, analytic functionality, perhaps concurrency
Even with eight different choices, I need a “catch-all” category; this is it.
Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in integrating short-request and analytic processing. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you’re set. Otherwise, you may want to pipe derived data into a more “industrial-strength” DBMS, ideally the one that runs your operational apps anyway
Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you’re starting out with the data in a convenient bit bucket.
Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.
So did I get them all? Or are there yet other analytic data management use cases that I don’t fit into my eight categories?