Mike Stonebraker has a post up on Vertica’s blog trying to differentiate “real” from “pretend” column stores. (Edit: That post seems to have come back down, but as of 1/19 it can be found in Google Cache.) In essence, Mike argues that the One Right Way to design a column store is Vertica’s, a position that Daniel Abadi used to share but since has retreated from.
There are some good things about that post, and some not-so-good. The worst paragraph is probably
Several row-store vendors (including Oracle, Greenplum and Aster Data) now claim to be selling a column store. Obviously, this would require a complete rewrite of a DBMS to move from Figure 1 to Figure 2. Hence, none of the “pretenders” have actually done this. Instead all have implemented some aspects of column stores, and then claim to be the real thing. This blog defines what the “real enchilada” looks like, and how to tell it from the pretenders.
which I question on two levels. First, the vendors cited don’t actually claim to be selling a column store; thus, the whole premise of Mike’s post is incorrect. Second, neither those vendors nor Mike are really correct. What Mike is really doing is differentiating, in his opinion,* good column stores from bad or mediocre ones.
*That Mike’s opinion in that regard is neither (wholly) unreasonable nor (wholly) unbiased should go pretty much without saying.
A lesser oopsie is Mike’s criterion “IO-1″, which is written so confusingly that it technically seems not to be met by any of the vendors cited — including Vertica, which introduced Vertica FlexStore in mid-2009. And while I’m at it — Aster Data nCluster definitely meets criterion IO-3; I confirmed that by asking Tasso Agyros. Mike’s “No” for Sybase IQ on his criterion CPU-5 is also pretty questionable, given that Sybase IQ operates on compressed data until “the last possible moment.”
With the minor stuff cleared away, let’s get to the heart of the matter. Mike in essence concedes that multiple competitors can get the I/O benefits of a column store, even “aggressive compression.” However, he asserts that a designed-from-the-ground-up column store also can and should have major CPU advantages over row stores or row/column hybrids, for three reasons (as I paraphase them):
- CPU-5 Good column stores operate on compressed data, while other DBMS decompress first.
- CPU-6 Good column stores benefit from storing data in multiple sort orders on disk, while other DBMS don’t.
- CPU-4 Good column stores have column-oriented inner loops, while other DBMS don’t.
Actually, I have my doubts about the competitive-comparison aspect CPU-5. I think multiple DBMS that have dictionary/token compression, for example, operate on tokenized data in memory. I’ll confess to not having a current list memorized as to who does or doesn’t, but anyhow it’s a solvable technical problem. Also, as Tasso points out, if you use a bitmapped index you’re surely operating on compressed data.
On the other hand, the goodness of CPU-5 functionality is beyond reasonable dispute. For many queries (albeit by no means all), operating on compressed data is a major advantage.
For CPU-6, things are the other way around. Vertica is probably alone in the flexibility of how it orders columns on disk. Any other system I can think of is generally restricted to two storage orders at most — e.g., some kind of universal ID/row-ID, plus a sort on the actual values of the column. But is this a significant advantage at all?
Competitors like to argue that storing even in sort-by-value order is not advantageous at all, because of the overhead at data loading time, and the questionable number of queries that benefit. That extreme seems overstated. Why would the overhead be higher than that to, for example, maintain a b-tree index? And surely queries try to pick out specific values and/or value ranges, for a significant fraction of all columns.
On the other hand, total flexibility in storage sort order might require yet more overhead, and would also be of rarer benefit. And while Vertica claims to have fixed a prior drawback to the feature — administrative complexity — in Vertica 4.0, I don’t have hard facts as to how complete the fix really is.
As for the CPU-4 inner loop point — I must confess to not knowing much about it.