I’m getting the increasing impression that certain industry observers, such as Gartner, are really confused about columnar technology. (I further suspect that certain vendors are encouraging this confusion, as vendors commonly do.) So here are some basic points.
A simple way to think about the difference between columnar storage and columnar (or any other kind of) compression is this:
- Columnar storage is a reference to how data is grouped together on disk (or in solid-state memory).
- (Columnar) compression is a reference to whether the actual data is on disk, or whether you save space by storing some smaller substitute for the actual data.
Specifically, if data in a relational table is grouped together according to what row it’s in, then the database manager is called “row-based” or a “row store.” If it’s grouped together according to what column it’s in, then the database management system is called “columnar” or a “column store.” Increasingly, row-based and columnar storage are being hybridized.
There are two main kinds of compression — compression of bit strings and more intelligent compression of actual data values. Compression of actual data values can reasonably be called “columnar,” in that different columns of data can be compressed in different ways, often depending only on the data in that column.*
The most important form of data value compression is commonly called tokenization, although the term dictionary compression is also used. Frequently, a column of data will have many repeated values — person names, city names, product IDs, product prices, etc. If so, these can be replaced by “tokens,” one per value, where the length of the token is just enough to account for all the unique values, and may be much shorter than the original data field. E.g., even if there are 1 million distinct values in a name field, tokenization could theoretically reduce 256 bytes down to 4.* A particular benefit of tokenization is that its compression can often be carried through into RAM, with joins being executed on tokenized columns.
*Admittedly, that extreme case assumes 256-byte character fields are stored fixed-length, which is ridiculous. There’s a reason for VARCHAR fields.
Where these ideas come together is that columnar compression usually works (even) better in column stores than in row stores. For example, if you’re using tokenization, and you just do your tokenization block-by-block, it will do more for you if there are more total values from the same column in each block — and that will happen when all the data in a single block comes from the same column.
Vendors that truly offer some form of hybrid row/column storage include Vertica, EMC/Greenplum, and Aster Data. Oracle Exadata, in my opinion, does not, but I can see why people might get confused and think otherwise. The tangled Oracle columnar story goes something like this:
- As per a white paper on Oracle Exadata’s “hybrid columnar compression,” Oracle Exadata clearly offers columnar compression.
- A diagram in that white paper shows data being partly grouped by column, in a PAX-like way. Oracle marketing sometimes tries to take “columnar” credit for that.
- That same white paper, as Daniel Abadi noted, contains the phrase “entire rows typically being retrieved with a single I/O.” That is NOT how column stores behave. A DBMS doesn’t need to have all the features advocated by Mike Stonebraker or Barry Zane to be a column store; but if it doesn’t have the fundamental feature of “brings back only the columns you want,” calling it “columnar” is pretty silly.
- Kevin Closson of Oracle graciously confessed, in a blog post and comment thread, that Oracle Exadata is not much of a column store.
- Oracle had a true columnar storage project going in 2009, although I don’t know its status. Also, Oracle evidently came pretty close to buying Vertica at one point.