Greenplum is making two product announcements this morning. Greenplum 4.0 is a revision of the core Greenplum database technology. In addition, Greenplum is announcing Greenplum Chorus, which is the first product release instantiating last year’s EDC (Enterprise Data Cloud) vision statement and marketing campaign.
Greenplum 4.0 highlights and related observations include:
- For the most part, Greenplum 4.0 is focused on general robustness catch-up and Bottleneck Whack-A-Mole, much like the latest releases from fellow analytic DBMS vendors Vertica and Aster Data.
- Greenplum has switched its replication approach from logical (execute transactions against two different disks) to block-level (just ship over the blocks that were changed by the original transaction). This seems to increase a Greenplum database’s robustness/performance/uptime in the face of disk/node failure. It also provides Greenplum with an ongoing performance advantage in that data only has to be compressed once in total for both disk writes.
- The Greenplum DBMS now has something called “tablespaces,” which sounds as if it extends Greenplum’s “polymorphic storage” to accommodate different kinds of storage device. Everybody has to do and for the most part is doing this, e.g. Teradata and Sybase. At least for now, you need to have the same mix of storage technology at every Greenplum node. That said, while Greenplum’s customers will surely want solid-state storage in the future, that’s not quite yet a major current issue.
- The timetable on Greenplum 4.0 is a salami-thin-slicer’s delight:
- Greenplum 4.0 has been used in POCs (Proofs of Concept) for a while.
- Greenplum 4.0 has been in early access for a few weeks.
- Greenplum 4.0 controlled availability is planned for the end of April.
- Greenplum 4.0 general availability is planned around the end of May or early June.
- (Note: Everything in Greenplum 4.0 has been built, and is undergoing QA).
- Greenplum has put together a nice list of big-name customers, including Fox/MySpace, eBay, Sears, and T-Mobile. While Fox/MySpace never got to the predicted 1-petabyte level of user data, T-Mobile is loosely projected to indeed get there. The same 1-petabyte projection is made more confidently about another Greenplum telecom customer (unnamed), which seems to be in the process of acquiring a 300-node Greenplum system.
The really interesting part of this announcement, however, is Greenplum Chorus. Greenplum agrees with my assertion that Greenplum Chorus is a new kind of data integration/ETL technology. In particular, Greenplum Chorus is designed around a stance I agree with, namely it’s unrealistic to put everything into a single enterprise data warehouse (EDW); you need to manage data marts as well, preferably in a coordinated way. Mainstream data integration/ETL (Extract/Integration/Load) vendors such as Informatica or Ab Initio would surely say “That’s often quite true, and our technology can handle such scenarios just as it handles single-EDW-data-sink environments.” But Greenplum Chorus offers three capabilities not generally found in traditional data integration products (and offers only those three capabilities), namely:
- Spin out data marts, whether by recopying the data or by creating a virtual data mart inside another data warehouse/mart.
- Find/discover data in databases across your enterprise.
- Do social networking around databases/data marts.
Greenplum Chorus is heading into early access soon, with general availability slated around midyear. Also in the mix is a Greenplum “Hypervisor” that can somehow relate to an almost unlimited number of nodes or databases; however, I didn’t get a lot of details on the Greenplum Hypervisor technology or on the target dates for delivering and integrating the Hypervisor with other parts of Greenplum’s technology.
When Greenplum first talked about about the enterprise data cloud (EDC) idea, it emphasized the spinning out of physical data marts in an easy way, as opposed to the virtual data marts pushed by Oliver Ratzesberger and Teradata. Greenplum Chorus, however, supports both kinds (as, at least directionally, does Teradata), specifically letting you choose between:
- “Independent sandboxes” – physical copies of the data, in a separate Greenplum database instance.
- “Satellite sandboxes” – virtual data marts, of course managed by the same Greenplum database instance.
Actually, if you want to recopy data in the same Greenplum database instance, you can do that too, via something called “data sets,” but that’s not the main focus. Either option, I presume, can be configured to provide either or both of the two main benefits of spun-out data marts, namely:
- Control over the performance and SLAs (Service-Level Agreements) of your analytic workload
- Ability to mix in new raw data and/or new aggregations
in either case without messing up the performance, SLAs, security, or “one truth-ness” of the existing database.
To provide those capabilities in an analytic DBMS, you need sufficiently robust parallel data movement (for the physical sandboxes) and workload management (for the virtual ones). Greenplum obviously believes it has both. Teradata makes the same claim. Other vendors would make similar assertions, and presumably will offer similar capabilities soon. You also want some kind of ability to ingest data from foreign databases, but that can be pretty routine stuff; e.g., in Release 1 of Chorus, Greenplum is content to offer ODBC access to Oracle, SQL Server, et al.
The “data discovery” and “social networking” aspects of Greenplum Chorus seem to be quite Release 1 as well. Basically, Greenplum lets people post discussion threads about databases and data marts, discussing what value can be derived from them. I guess somebody could include links to web-technology reports based on those databases, but otherwise there’s no integration with business intelligence tools and their collaboration capabilities. Even so, Greenplum reports that business executives liked this capability in early access testing.
Greenplum Chorus is ETL without a lot of T, and without a lot of performance optimizations either. That may not be much of a problem in its paradigmatic use case, spinning out a data mart quickly for some analysis to see if valuable conclusions can be drawn. Presumably, in the most successful cases, business and technical processes would emerge after the fact to pipe up-to-date versions of that analysis into operational systems, mooting any ETL deficiencies in the initial exploration moot. In a world where “data exploration” is becoming an increasingly important concept, something like Greenplum Chorus may suffice to provide significant customer value. But whether Greenplum Chorus’s capabilities are eventually co-opted by more fully-featured data integration suites remains an open question for the future.