If Mike Stonebraker is to be believed, the era of columnar data stores is upon us.
Whether or not you buy completely into Mike’s claims, there certainly are cool ideas in his latest columnar offering, from startup Vertica Systems. The Vertica corporate site offers little detail, but Mike tells me that the product’s architecture closely resembles that of C-Store, which is described in this November, 2005 paper.
The core ideas behind Vertica’s product are as follows.
- Data warehouse queries only need to retrieve the data in certain columns from disk. Therefore, storing data in columns reduces I/O.
- But a pure column store is hard to update in real-time, and data warehouses need real-time updates (both for “real-time” uses and just for error correction). Hence, there is a small (1 gigabyte or so) conventional row store to receive updates, the contents of which are periodically bulk-moved to the column store. It’s in main memory, and hence super-fast. (That’s not how the paper says C-Store was architected, but it seems to be one of the things that got changed for the commercial Vertica implementation.)
- Timestamps are used for inserts and deletes; otherwise, there are no data changes. (Without that kind of approach, the update strategy in Point #2 couldn’t be viable.) A big benefit to these timestamps is that you can assure integrity via “snapshot isolation”; i.e., by a virtual rollback to a recent point in time. Thus, Vertica can get away without any kind of locks or, for that matter, transaction/redo logs. Row-oriented Netezza uses a similar logless, lockless approach.
- Columnar data stores lend themselves to aggressive compression. After all, most sophisticated compression techniques depend upon deltas (or lack of delta!) vs. other values in the same column. And compression works a lot better when the column itself is sorted. Vertica’s compression is carried straight through into query processing. One benefit: It also allows for more use of on-processor Level 2 cache. (Efficient use of Level 2 cache gets mentioned to me a lot these days …)
- Data is stored in overlapping projection views, each of which is sorted by at least one of the columns in the view. Presorting obviously helps with query performance. Of course, this redundancy carries a penalty at load or update time. But the same is true of conventional RDBMS’s indices and, yes, materialized views.
- Data is partitioned “horizontally,” in a shared-nothing environment. I.e., different “rows” go to different nodes. Queries are resolved on each node, and the result sets are combined centrally, with no attempt to ship intermediate results from node to node. Despite the experience of other shared-nothing data warehouse vendors that this approach leads to bottlenecks, Mike is confident it works fine in Vertica’s case.
Obviously, my post title was exaggerated; nobody, including Mike, thinks row-oriented data stores are obsolete for OLTP. But what about data warehousing? Will an approach like Vertica’s eventually win versus, say, the shared-nothing row-oriented RDBMS leaders (that would be some combination of IBM, Teradata, Netezza, and DATAllegro, depending on what you mean by “leader”)? Well, apparently Vertica has a bunch of tests going on, at database sizes from the low 100s of gigabytes to the low 10s of terabytes. And of course they have those great-looking benchmark results, for which they swear they tuned competitor’s products with passionate care.
If I have to make an early guess, I’d say that the success of columnar systems will depend in no small part on what kind of data warehouse applications we’re talking about. Referencing a taxonomy I previously posted:
- Pinpoint data lookup doesn’t seem like a great fit for columnar systems. Indeed, traditional rows-and-B-trees would seem to be best.
- Constrained query and reporting would seem to be a sweet spot, even though it’s a sweet spot for some of the best competition as well.
- Cube-filling calculations involve big intermediate result sets. I’m not sure that’s a great fit for columnar systems.
- Hardcore tabular data crunching would seem in many cases to be another sweet spot, again against a lot of competition, at least in some of its sub-categories.
- Text and media search are best done by specialized systems that, at least in the case of text, wind up being quasi-columnar. The same goes for other specialty areas. Systems like Vertica’s have nothing to offer directly to these applications. However, it might be possible for Vertica to integrate with them fairly quickly, given that they’re starting from vaguely similar philosophical roots.