December 2, 2012

Are column stores really better at compression?

A consensus has evolved that:

Still somewhat controversial is the claim that:

A strong plausibility argument for the latter point is that new in-memory analytic data stores tend to be columnar — think HANA or Platfora; compression is commonly cited as a big reason for the choice. (Another reason is that I/O bandwidth matters even when the I/O is from RAM, and there are further reasons yet.)

One group that made the in-memory columnar choice is the Spark/Shark guys at UC Berkeley’s AMP Lab. So when I talked with them Thursday (more on that another time, but it sounds like cool stuff), I took some time to ask why columnar stores are better at compression. In essence, they gave two reasons — simplicity, and speed of decompression.

In each case, the main supporting argument seemed to be that finding the values in a column is easier when they’re all together in a column store. That makes sense. I imagine the difference would be smallest if the row store had strictly fixed-length fields, with anything like a VARCHAR being tokenized down to a known length. In that case the database could be treated as an array — but you’d also have to wonder whether any significant row-based benefits still remained.

Let’s draw a further distinction between:

Row stores seem to do OK with individual-value compression, especially tokenization, albeit with some awkwardness. (For example, if cardinality forces you to change token length, that’s a bigger deal in a row store than a columnar system — which may be why DB2 caps the number of tokens at 4096.) But row stores rarely tackle the sequence-of-values stuff; the only counterexample I can think of is Netezza’s delta compression, and Netezza has FPGAs (Field-Programmable Gate Arrays) to speed the whole thing up.

Summing up, I think it’s fair to say:

Related links

Comments

10 Responses to “Are column stores really better at compression?”

  1. Reynold Xin on December 2nd, 2012 10:42 pm

    Excellent post, Curt. Columnar storage definitely makes compression easier and faster, for both on-disk data and in-memory data representations.

    In our project (Shark/Spark), another important reason we chose to go columnar is to avoid the in-memory object overhead in the Java virtual machine (JVM). This overhead takes two forms when very large datasets are stored in memory: object storage overhead and excessive garbage collection times,

    The JVM’s memory model dictates that every object has 12 – 16 bytes of memory overhead. In row-oriented format, each row is represented as one or more Java objects. When the rows themselves are small, this per-object overhead is substantial: using a row-based approach, 200 MB of uncompressed data can occupy up to 1 GB of space in-memory, a 5X overhead!

    In addition to storage, there are processing implications due to garbage collection (GC), which increases linearly with respect to the number of objects in the heap. When many rows are being processed in a single heap, it can result in millions or even billions of Java objects. For a large heap (e.g. 32+GB), GC can take several minutes.

    In our column-oriented format, the space footprint is 1/3 to 1/5 of row-oriented counterpart even without applying compression, and a full GC takes less than a second. Thus, a columnar approach lets us retain the benefits of the JVM, while achieving compact and fast in-memory storage.

  2. nikhail on December 3rd, 2012 4:26 pm

    On the ‘speed of decompression’ point:
    Column store gives better L1/L2 CPU cache
    utilization compared to row store when
    decompressing column data. CPU data
    pre-fetches work better with column store.

  3. Leigh Kenendy on December 3rd, 2012 5:49 pm

    I used to work for Sybase a long time ago. One of the IQ engineers explained to me that the compression was a side-effect of the aim for better performance. It was an added bonus rather than the goal. As far as I know Sybase IQ was the first to market with this technology and when I last used it a few years ago, no technology under $1 million came close.. I assume that’s changed by now – all vendors seem to have adopted a similar approach.

  4. Kevin Closson on December 4th, 2012 4:31 pm

    >[HCC] is indeed a columnar architecture from the standpoint of compression, even though it’s very row-based from the standpoint of I/O.

    …I’m not sure I understand this point. Both column and row oriented data ends up stuffed into blocks. Scanning pure columnar block data and hybrid columnar block data is in my experience the same physical I/O profile. Albeit the pure columnar suffers no wasted payload (unneeded columns). But, like I said, I don’t think I understand the (perhaps subtle) point you’re making, Curt. Can you clarify?

    …Also, the way I like to help people understand Oracle’s Hybrid Columnar Compression is to point out the fact that Oracle’s engine is a row engine. Oracle groups adjacent data blocks into a logical units called a Compression Unit. During bulk loading, Oracle more or less treats columns as “single-column rows” which get stuffed into Compression Units. It’s a simple way to grasp the concept.

  5. Curt Monash on December 4th, 2012 6:16 pm

    Kevin,

    It is my understanding that, in Oracle HCC, if you retrieve the columns you want, you also have to retrieve the columns you don’t want. Is that accurate?

    If so, that’s more I/O than only retrieving the columns you want — potentially 10-100X more. And it could be similarly more use of cache, memory bandwidth, and so on.

  6. Kevin Closson on December 4th, 2012 6:38 pm

    Hi Curt,

    Yes.

    A read of a Compression Unit (generally 32KB) will have a number of rows and all of the columns of those rows–albeit re-oriented in the single-column-row manner I described. So, yes, every physical I/O includes unneeded data–unless the SQL statement is “select * from tab” or the users’ table has only a single column.

    To make it easier to picture, consider a row of 32KB with with one of the columns being of type char(1). A select length(c) from the table will suffer just shy of 32KB of wasted physical I/O and memory buffering overhead. If that same compression unit were used in Oracle’s In-Memory Parallel Query it would have to flow over the Exadata iDB interconnect (about 32KB wasted) and installed into the SGA (about 32KB wasted).

    So with that I drew a worst-case scenario to help folks understand the functionality.

    The moral of the story (as you know) is reducing effort is not the same as eliminating effort.

  7. Spark, Shark, and RDDs — technology notes | DBMS 2 : DataBase Management System Services on December 13th, 2012 6:00 pm

    […] finally, Shark gives a columnar storage format to its RDDs, which has it’s already been discussed on this blog. Categories: BDAS, Spark, and Shark, Data models and architecture, Hadoop, MapReduce, […]

  8. Onko sarakepohjainen tietokanta tehokkaampi pakkaamaan dataa? « Olipa kerran Bigdata on December 17th, 2012 2:57 pm

    […] Are column stores really better at compression? […]

  9. Are column stores really better at compression? | Finland Auction on January 29th, 2013 4:20 am

    […] SOURCE: http://www.dbms2.com/2012/12/02/are-column-stores-really-better-at-compression/ This entry was posted in Uncategorized on January 29, 2013 by […]

  10. 大数据分析与列数据库 | Alex的个人Blog on February 28th, 2013 1:27 am

    […] [3]http://www.dbms2.com/2012/12/02/are-column-stores-really-better-at-compression/ […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.