February 6, 2011

Columnar compression vs. column storage

I’m getting the increasing impression that certain industry observers, such as Gartner, are really confused about columnar technology. (I further suspect that certain vendors are encouraging this confusion, as vendors commonly do.) So here are some basic points.

A simple way to think about the difference between columnar storage and columnar (or any other kind of) compression is this:

Specifically, if data in a relational table is grouped together according to what row it’s in, then the database manager is called “row-based” or a “row store.” If it’s grouped together according to what column it’s in, then the database management system is called “columnar” or a “column store.” Increasingly, row-based and columnar storage are being hybridized.

There are two main kinds of compression — compression of bit strings and more intelligent compression of actual data values. Compression of actual data values can reasonably be called “columnar,” in that different columns of data can be compressed in different ways, often depending only on the data in that column.* 

*In a variant, a few columns can be compressed together, e.g. the various parts of an address. IBM DB2 and RainStor both do this.

The most important form of data value compression is commonly called tokenization, although the term dictionary compression is also used. Frequently, a column of data will have many repeated values — person names, city names, product IDs, product prices, etc. If so, these can be replaced by “tokens,” one per value, where the length of the token is just enough to account for all the unique values, and may be much shorter than the original data field. E.g., even if there are 1 million distinct values in a name field, tokenization could theoretically reduce 256 bytes down to 4.* A particular benefit of tokenization is that its compression can often be carried through into RAM, with joins being executed on tokenized columns.

*Admittedly, that extreme case assumes 256-byte character fields are stored fixed-length, which is ridiculous. There’s a reason for VARCHAR fields.

Where these ideas come together is that columnar compression usually works (even) better in column stores than in row stores. For example, if you’re using tokenization, and you just do your tokenization block-by-block, it will do more for you if there are more total values from the same column in each block — and that will happen when all the data in a single block comes from the same column.

Vendors that truly offer some form of hybrid row/column storage include Vertica, EMC/Greenplum, and Aster Data. Oracle Exadata, in my opinion, does not, but I can see why people might get confused and think otherwise. The tangled Oracle columnar story goes something like this:

Comments

21 Responses to “Columnar compression vs. column storage”

  1. Greg Rahn on February 6th, 2011 2:45 pm

    Curt-

    Would you share your criteria on what constitutes a “true” hybrid row/column storage solution, in your opinion, and how the listed vendors technologies satisfy those criteria? It seems difficult to say something is or is not something w/o having a clear definition to measure against.

  2. Curt Monash on February 6th, 2011 10:05 pm

    Greg,

    As you surely know, I’m not a big fan of precise category criteria. But if you’re not doing column-at-a-time I/O, you’re not a column store.

    Now, everybody has at least a kluge that lets you do something like that — materialized views if nothing else. But if, say, you’re schlepping along row IDs in every quasi-columnar I/O, that’s not much of a column store.

    And just to be clear — strong predicate pushdown doesn’t make something a column store. It’s just another way to try to capture one of columnar storage’s signature benefits. Exadata partisans might say “Hey, we have columnar compression plus predicate pushdown — not to mention bitmapped indices — so that meets many of the same needs and specs as column stores.” They might even be right, in which case the fact that Exadata isn’t a column store wouldn’t be very important. But that’s not the point I was addressing this time around.

    Now, when somebody like Greenplum or Aster puts out a true column store option and says “Well, we’ve checked off THAT box,” established column-store vendors are entitled to say “Bully for you — now just how mature is it?” But again, that’s not my topic this time around.

  3. Greg Rahn on February 6th, 2011 11:55 pm

    Curt-

    Thanks. To my knowledge I don’t think anyone is passing a feature named Exadata Hybrid Columnar Compression as a pure column store which it most certainly is not, however, even Daniel Abadi classifies Exadata Hybrid Columnar Compression as a hybrid in his writeup which specifically mentions PAX, so I’m wondering why you do not.

  4. Curt Monash on February 7th, 2011 12:47 am

    Greg,

    I’m not at liberty to go into full detail about everything I know about sales tactics, marketing tactics, and so on. So let’s just say:

    1. I think the “true columnar” meme is getting increased attention.

    2. Given that it is, I’d like people to at least make their categorizations CORRECTLY.

    If it were up to me, there would be objective third-party benchmarks that showed performance subject to various kinds of sensitivity analysis — database size, concurrent usage, and all sorts of schema kinds. One of the top distinctions IMO would then be whether or not the schema and workload were inherently column-friendly. It may not be necessary for a row-based system to be fully performance-competitive with a column-based one in the columnar sweet spot, but at least it should offer a respectable alternative.

    Absent such benchmarks, people use truly-columnar-I/O as a not entirely ridiculous indicator of fast query execution.

  5. Greg Rahn on February 7th, 2011 1:41 am

    Curt-

    It seems you missed my question in your response:
    Daniel Abadi classifies Exadata Hybrid Columnar Compression as a hybrid in his writeup which specifically mentions PAX. Why is it that you do not also agree with his classification?

  6. Curt Monash on February 7th, 2011 2:25 am

    Greg,

    Because I’m classifying for a different purpose than he is, making different ground rules appropriate.

    To be even blunter — people are pretending that “hybrid row-column store” in the Oracle sense is the same thing as “hybrid row-column store” in the Aster and Greenplum sense, and people are getting misled as a result.

    There’s a limit even to my bluntness, however, so I won’t spell out exactly who those “people” I’m referring to are.

  7. Greg Rahn on February 7th, 2011 12:13 pm

    Curt-

    I’m really uninterested in the non-technical side of this dialog, but I am still very interested for you to share what your “ground rules” are in this evaluation. For example, does a RDBMS need to have both pure row and pure columnar storage? Does it need to have both row and column operators? What exactly does it need, in your opinion?

  8. Curt Monash on February 7th, 2011 1:22 pm

    Greg,

    Different people have somewhat different interests. I imagine this observation is no surprise to you. ;)

  9. Greg Rahn on February 7th, 2011 3:43 pm

    Curt-

    Is that a refusal to share your evaluation criteria on which you have based your opinion? You have avoided responding to my direct request now in several comments. I’m ok with you stating you refuse to share your criteria, but then please do so.

  10. Curt Monash on February 7th, 2011 4:25 pm

    Greg,

    Based on the fact that you comment here, it is natural for me to think that you read this blog. If you do, it is natural for me to think that you know I don’t believe in precise definitions of product categories. Based on that, it is natural for me to think that you know there’s no answer to your question, and you’re just repeating it for the sake of being annoying.

    What am I missing?

  11. Greg Rahn on February 7th, 2011 4:58 pm

    Curt-

    There certainly is an answer to my question, it is not just for the sake of being annoying and my apologies for doing so. I’m looking for nothing more than some criteria which you used to come to your conclusion. Your conclusion is fine and all, and after all, it’s your opinion, but I think many people, especially me, find the details that were used to draw a conclusion more interesting than the conclusion itself. Also, sometimes people based conclusions on mis-information so including that makes for a more valuable discussion. It seems to me that open discussions are generally much more valuable than closed ones.

  12. Curt Monash on February 7th, 2011 10:38 pm

    Greg,

    All product category definitions, depending upon context, can become misleading in their application.

    If we want to assess the progress of database technology, in practice and academia alike, it is perfectly reasonable to refer to a PAX-like system as a hybrid row-column store, something Daniel Abadi and I both have done.

    But if we want to assess whether a product is likely to have columnar-like performance on column-friendly workloads, and if salespeople or marketers are implying that “hybrid” row-column systems have that benefit — then a stricter definition is called for.

    So there’s no one single definition.

    Yes I’m being vague. I gather that you would like me to be clearer. Would you and your Oracle colleagues also like me to be clearer about the second sentence of my blog post, in which I vaguely observed that vendors can deliberately encourage confusion? It’s something of a package deal.

  13. Anand Waishampayan on February 13th, 2011 9:43 pm

    I am surprised to see no mention of Sybase IQ a true columnar database. All storage and compression, and accesses are indeed at column level in this database.

    In a true columnar database rows are shredded into columns and each column is stored separately. A row is formed by association of column values for a row using a rowid. This is done even for composite index. I.e. Index data is stored by itself just like a column.

    Tokenization may be used in row-based as well as columnar databases. It may have nothing to do with database being columnar or not.

  14. Curt Monash on February 13th, 2011 10:14 pm

    Anand,

    Agreed both re Sybase IQ and tokenization.

  15. Mike Pilcher on February 14th, 2011 1:40 am

    SAND’s architecture is built on compression of bit strings. Our technique is to build and manipulate compressed and encoded bit strings (which we denote as BIT vectors), though the process to do this we don’t see as particularly hard, we believe doing so efficiently is a challenge and one we have overcome. How to leverage this technology, how often and in what way is a key part of compression. It is one thing to give someone the keys to a Ferrari, they need to know what to do once they get behind the wheel, how to use every last ounce of the power that is available, SAND does this automatically the database equivalence of intelligent predictive torque distribution. SAND’s architecture is designed using BIT vectors wherever possible. Used for much more than simple bitmap-indices.

    Regarding Tokenization SAND employs database-wide dictionaries or Domains that function in assigning a unique token to every value in the database. This tokenization delivers the compression you refer to. SAP refers to this as dictionary compression when describing the architecture of HANA. It is rewarding to the team at SAND to see the rest of the database world has finally recognized what we knew and developed back in the days of the DeLorean. Your comments about tokenized operations, such as join processing, is spot-on. SAND performs database operations using tokenized operations wherever possible, not just for joins. Using them for all forms of predicate evaluation, semi- joins, group-by/order-by, text search, and in the Domain structure itself.

  16. Some quick notes on HP-Vertica | DBMS 2 : DataBase Management System Services on February 14th, 2011 12:31 pm

    [...] arguably is the most mature of the modern column-store DBMS — i.e., the ones that don’t have their roots in bitmaps the way Sybase and SAND [...]

  17. Some quick notes on HP-Vertica | DBMS 2 : DataBase Management System Services on February 14th, 2011 12:31 pm

    [...] arguably is the most mature of the modern column-store DBMS — i.e., the ones that don’t have their roots in bitmaps the way Sybase and SAND [...]

  18. Comments on the analytic DBMS industry and Gartner’s Magic Quadrant for same : DBMS 2 : DataBase Management System Services on February 9th, 2012 4:18 am

    [...] row/columnar data management, a feature shared by Teradata and Vertica, among others, but not by Oracle, DB2, or [...]

  19. Do you need an analytic RDBMS? | DBMS 2 : DataBase Management System Services on November 9th, 2012 10:12 am

    [...] I/O bottlenecks via, for example, compression, columnar storage, and/or [...]

  20. Exadata Hybrid Columnar Compression (HCC) for (storage) dummies « Dirty Cache on February 13th, 2013 3:54 am
  21. IBM BLU | DBMS 2 : DataBase Management System Services on May 27th, 2013 1:13 am

    [...] And yes — that means Oracle is now the only major relational DBMS vendor left without a true columnar story. [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.