In my opinion, the key part of Mike Stonebraker’s fascinating note on data compression was (emphasis mine):
The standard wisdom in most row stores is to use block compression. Hence, a storage block is compressed using a single technique (say Lempel-Ziv or dictionary). The technique chosen then compresses all the attributes in all the columns which occur on the block. In contrast, Vertica compresses a storage block that only contains one attribute. Hence, it can use a different compression scheme for each attribute. Obviously a compression scheme that is type-specific will beat an implementation that is “one size fits all”.
It is possible for a row store to use a type-specific compression scheme. However, if there are 50 attributes in a record, then it must remember the state for 50 type-specific implementations, and complexity increases significantly.
In addition, all row stores we are familiar with decompress each storage block on access, so that the query executor processes uncompressed tuples. In contrast, the Vertica executor processes compressed tuples. This results in better L2 cache locality, less main memory copying and generally much better performance.
Of course, any row store implementation can rewrite their executor to run on compressed data. However, this is a rewrite – and a lot of work.
That’s a pretty compelling argument. But in theory, I can think of a number of ways for a row store vendor to trump it, including:
- We agree, and we’ve built a whole set of specialized indices to have the same benefits.
- (Similarly) We agree, but fortunately we have the money and talent to pull off this very hard development task.
- That’s nice, but column stores have a natural disadvantage in updates, which matters a lot.
- Yes, but we have a lot less internode data movement than a column store does.
#1 and #2 surely are not true at this time, as Mike points out. #3 is an area of active debate, and should perhaps be evaluated on an application-by-application basis. #4 is just something I’m throwing out there, which might or might not prove to be valid. (The idea behind it, by the way, is that vertical partitioning comes at the partial expense of other kinds of partitioning, which can sometimes be used to help both sides of a join condition to be satisfied by the data on a particular node.)
But one thing seems sure – unless the row store vendors also come up with great compression stories, Vertica or some other columnar outfit will beat the pants off of them. Compression has arrived, big time.