Column-store proponents are prone to argue, in effect, that the only reason to implement an analytic DBMS with row-based storage is laziness. Their case generally runs along the lines:
- Analytic queries commonly return only a fraction of all possible columns.
- Only returning the columns needed
- Saves I/O
- Saves cache space
- Reduces processing
- Facilitates compression
- Presumably all those row-based MPP vendors just went row-based because they had a fine row-based DBMS (usually but not always PostgreSQL) to build on.
Pushbacks to this argument from row-based vendors include:
- Yes, but it’s harder to update a column store
- Yes, but there are more steps to retrieving a bunch of columns than there are to retrieving the same information from row stores
plus generous dollops of:
- We’re doing just fine, thank you
- We’re not seeing column stores much in the marketplace
- Don’t believe all that academic hype
- Column stores reek of elderberries, and are powered by hamster wheels
(OK, I made that last one up, but I do hear the other claims frequently.)
However, there are at least two ways in which row- and column-stores are beginning to come together. First, there are lots of rumors about row-store vendors bringing out column-store options, even beyond the recent Ingres/VectorWise announcement. (But anything I may know about same beyond noticing the rumors fly by is surely under NDA.) Second, column-store vendors Vertica and VectorWise are bringing out a kind of row/column hybrid storage option.
Vertica 3.5 introduces what Vertica calls “FlexStore.” A key part of FlexStore is the ability to store data not just in pure columnar format, but also to group columns together in what amounts to sub-rows. This is advantageous when data is retrieved together and, I presume, when it is updated. There’s a tradeoff in giving up column stores’ compression advantages, however, and use of this feature is not recommended for columns that are frequently retrieved independently. Vertica also notes that since it typically uses 1 megabyte block sizes, any table smaller than that shouldn’t be broken into columns at all.
VectorWise, of course, doesn’t have a product right now, but has gotten a bunch of recent publicity around the column-store product it plans to ship via its partner Ingres in 2010. When I asked Peter Boncz about row/column hybridization inside VectorWise (not federating between Ingres and VectorWise, but rather truly within VectorWise), he said one of the storage options was PAX, and pointed me at a 2001 paper by a group of academics that includes the ubiquitous Dave DeWitt. PAX turns out to stand, in creative spelling, for Partition Attributes Across.
The PAX idea is to store as many rows of data as can fit into a block, but within the block store them in columns. This preserves some of the compression and cache-efficiency benefits of column stores, while also bringing back whole rows in a single step. (I think Vertica’s FlexStore does something similar to this, but I’m not sure.)
Further confusing things, Peter Boncz of VectorWise told me VectorWise can support “any hybrid” of columnar storage and PAX.
Bottom line: The distinction between row- and column-stores isn’t going to go away any time soon, but it is at least beginning to blur a bit.