A consensus has evolved that:
- Columnar compression (i.e., value-based compression) compresses better than block-level compression (i.e., compression of bit strings).
- Columnar compression can be done pretty well in row stores.
Still somewhat controversial is the claim that:
- Columnar compression can be done even better in column stores than in row-based systems.
A strong plausibility argument for the latter point is that new in-memory analytic data stores tend to be columnar — think HANA or Platfora; compression is commonly cited as a big reason for the choice. (Another reason is that I/O bandwidth matters even when the I/O is from RAM, and there are further reasons yet.)
One group that made the in-memory columnar choice is the Spark/Shark guys at UC Berkeley’s AMP Lab. So when I talked with them Thursday (more on that another time, but it sounds like cool stuff), I took some time to ask why columnar stores are better at compression. In essence, they gave two reasons — simplicity, and speed of decompression.
In each case, the main supporting argument seemed to be that finding the values in a column is easier when they’re all together in a column store. That makes sense. I imagine the difference would be smallest if the row store had strictly fixed-length fields, with anything like a VARCHAR being tokenized down to a known length. In that case the database could be treated as an array — but you’d also have to wonder whether any significant row-based benefits still remained.
Let’s draw a further distinction between:
- Columnar compression techniques that depend primarily upon individual values — tokenization, prefix compression, etc.
- Columnar compression techniques that depend upon sequences of values, such as run-length encoding (RLE) or delta compression. (Note: Some people think this is the only kind that should be called “columnar”, but I disagree.)
Row stores seem to do OK with individual-value compression, especially tokenization, albeit with some awkwardness. (For example, if cardinality forces you to change token length, that’s a bigger deal in a row store than a columnar system — which may be why DB2 caps the number of tokens at 4096.) But row stores rarely tackle the sequence-of-values stuff; the only counterexample I can think of is Netezza’s delta compression, and Netezza has FPGAs (Field-Programmable Gate Arrays) to speed the whole thing up.
Summing up, I think it’s fair to say:
- Column stores will almost always have a compression and/or compression performance advantage over row stores.
- Sometimes that advantage will be small.
- Sometimes it won’t be.
- To counteract some marketing confusion, I explained the difference between columnar compression and columnar storage. (February, 2011)
- I outlined the Netezza and DB2 approaches to row-store columnar compression. (June, 2010 — before IBM bought Netezza)
- I commented on Mike Stonebraker’s thoughts regarding column stores and compression. (March, 2007 — when Vertica was just a pup)
- Oracle’s Hybrid Columnar Compression (September, 2009) is indeed a columnar architecture from the standpoint of compression, even though it’s very row-based from the standpoint of I/O.