I’d like to survey a few related ideas:
- Enterprises should each have a variety of different analytic data stores.
- Vendors — especially but not only IBM and Teradata — are acknowledging and marketing around the point that enterprises should each have a number of different analytic data stores.
- In addition to having multiple analytic data management technology stacks, it is also desirable to have an agile way to spin out multiple virtual or physical relational data marts using a single RDBMS. Vendors are addressing that need.
- Some observers think that the real essence of analytic data management will be in data integration, not the actual data management.
The idea of an analytic data store separate from your transactional one has been around since before the relational era. Its approximate evolution was:
- The DBMS aspects of 4GLs such as Focus.
- IBM’s relational “Information Center”, back when IBM thought transactions should still be in IMS.
- Early days of Teradata.
- Early days of MOLAP, when Ted Codd argued the analytic store should not be relational.
- The growth of relational data warehousing.
- The rise of the big bit bucket.
In the past, large DBMS vendors liked to argue that enterprises should have a single analytic data store — commonly known as an enterprise data warehouse (EDW) — but that theory holds ever less water. A sample of my writing on that subject includes:
- Two July, 2011 posts on eight kinds of analytic database.
- An April, 2010 debunking of the myth of the enterprise data warehouse.
- Lots of Hadoop coverage.
Recently, the big vendors have capitulated. In particular:
- Teradata introduced “purpose-built” data warehousing appliance product lines with a variety of configurations and price points.
- Teradata bought Aster Data.
- Teradata deemphasized the term “EDW” in favor of “IDW” (Integrated Data Warehouse), which is technically like an EDW, but doesn’t have to hold absolutely all your analytic data.
- IBM bought Netezza.
- IBM then put out a marketing concept of “Smart Consolidation“, which incorporates four different kinds of analytic data store or quasi-store — DB2, Netezza, Hadoop (“BigInsights”), and Streams.**
- Oracle introduced Exalytics and the Oracle Big Data Appliance, to go with Exadata.
- Teradata made a Barney announcement with Hortonworks, emphasizing its love for Hadoop as a companion to Teradata Classic and Aster technology.***
- MarkLogic made an announcement with Hortonworks even more Barney than Teradata’s.
* Teradata also uses the term “ADW”, for Active Data Warehouse, which in essence means “Low latency! High concurrency! Rah rah rah!”
**Calling that “Smart Consolidation” is like naming a swinger club “Smart Fidelity”. But terminology aside, I endorse the idea.
***Teradata definitely expects its Hortonworks relationship to ascend beyond the Barney level; Tasso Argyros gave enough NDA details to be convincing about that. But it’s not there yet.
So data marts should often be managed by different technology than your core IDW. But even if you want to use the same technology, there are good reasons to have separate data marts, including the desire to manage:
- Derived data based on other data already in the data warehouse.
- Data that had never been put into the data warehouse in the first place.
- Data you get from outside your enterprise.
In each case, the point is that:
- Your normal data governance bureaucracy is an obstacle — reasonable or otherwise — to analysis of a particular set of data, but …
- … a separate data mart can serve as a safe workaround.
I call this data mart spin-out, and am no longer sure where I first picked up that term. Oliver Ratzesberger popularized the concept when he was at eBay, and then Greenplum ran with it.
More precisely, Greenplum ran with it from a marketing standpoint. Delivery of what eventually became Chorus was more like a crawl.
Data mart spin-out can be either physical, in which case there’s real data (re)copying going on, or virtual, in which case the whole thing is being done as a trick in the core DBMS software, especially its workload management subsystem. Virtual spin-out is faster, more flexible, and less costly, all else being equal. But it does lead to a more complex mixed-workload scenario, which you’re relying on your workload management technology to sort out.
- Star customer Oliver built his version of the idea, which is virtual, on Teradata gear, so virtual is naturally the way Teradata itself has gone.
- Sybase has also gone with virtual data mart spin-out in Sybase IQ.
- ParAccel’s approach is also on the virtual side, but assumes a SAN (Storage-Area Network).
- Greenplum, last I looked, was on the physical side, apparently because that was all they could pull off.
So where is this all going? Mark Beyer of Gartner came up with the term “Logical Data Warehouse” three years ago, and evidently has been trying to refine its definition ever since. Forrester Research has been known to mention similar-sounding ideas. At this point, Gartner still seems to be trying to recreate the EDW fallacy at a higher level of abstraction, which is going to work even less well than EDWs did.
Informatica, which one might think would be the biggest fan of the idea, doesn’t seem to have embraced it yet. But then, the whole thing sounds somewhat like Oracle’s 1990s Project Sedona, which was one of the bigger fiascos in software history, and certainly was the greatest failure of Informatica CEO Sohaib Abbasi’s distinguished career.
My own opinion is:
- It’s good for data stores, data sources, and data sinks to be accessible in as consistent a way as possible, and …
- … cataloguing data stores, sources, and sinks in some sort of live way is a worthy endeavor, but …
- … universal data mediators will never work, because tighter coupling will often be needed for reasons of performance, reliability, security, privacy, and/or economic/legal relationship.
Of course, one can retreat to saying “OK, but how about partly-universal, in line with the quasi-EDWs many enterprises have”? On that basis, I think some of the ideas of the “Logical Data Warehouse” will hold up, for example the ones that amount to glorified MDM (Master Data Management), and probably some of the ILM (Information Lifecycle Management) ones as well. The kind of low-level “Let’s build a mini-Facebook to keep track of and talk about our data stores” collaboration that Oliver open-sourced on his way out of eBay — and that seems to be part of Greenplum Chorus too — could also succeed.
But if you’re looking for some kind of logical/virtual Grand Data Unification – well, that won’t work any better than any other Grand Data Unification idea has over the past 40 years