March 24, 2007

Mike Stonebraker on database compression — comments

In my opinion, the key part of Mike Stonebraker’s fascinating note on data compression was (emphasis mine):

The standard wisdom in most row stores is to use block compression. Hence, a storage block is compressed using a single technique (say Lempel-Ziv or dictionary). The technique chosen then compresses all the attributes in all the columns which occur on the block. In contrast, Vertica compresses a storage block that only contains one attribute. Hence, it can use a different compression scheme for each attribute. Obviously a compression scheme that is type-specific will beat an implementation that is “one size fits all”.

It is possible for a row store to use a type-specific compression scheme. However, if there are 50 attributes in a record, then it must remember the state for 50 type-specific implementations, and complexity increases significantly.

In addition, all row stores we are familiar with decompress each storage block on access, so that the query executor processes uncompressed tuples. In contrast, the Vertica executor processes compressed tuples. This results in better L2 cache locality, less main memory copying and generally much better performance.

Of course, any row store implementation can rewrite their executor to run on compressed data. However, this is a rewrite – and a lot of work.

That’s a pretty compelling argument. But in theory, I can think of a number of ways for a row store vendor to trump it, including:

  1. We agree, and we’ve built a whole set of specialized indices to have the same benefits.
  2. (Similarly) We agree, but fortunately we have the money and talent to pull off this very hard development task.
  3. That’s nice, but column stores have a natural disadvantage in updates, which matters a lot.
  4. Yes, but we have a lot less internode data movement than a column store does.

#1 and #2 surely are not true at this time, as Mike points out. #3 is an area of active debate, and should perhaps be evaluated on an application-by-application basis. #4 is just something I’m throwing out there, which might or might not prove to be valid. (The idea behind it, by the way, is that vertical partitioning comes at the partial expense of other kinds of partitioning, which can sometimes be used to help both sides of a join condition to be satisfied by the data on a particular node.)

But one thing seems sure – unless the row store vendors also come up with great compression stories, Vertica or some other columnar outfit will beat the pants off of them. Compression has arrived, big time.

Comments

8 Responses to “Mike Stonebraker on database compression — comments”

  1. DBMS2 — DataBase Management System Services»Blog Archive » Mike Stonebraker explains column-store data compression on March 24th, 2007 1:20 am

    [...] The following is by Mike Stonebraker, CTO of Vertica Systems, copyright 2007, as part of our ongoing discussion of data compression. My comments are in a separate post. [...]

  2. DBMS2 — DataBase Management System Services»Blog Archive » Will data compression change the hardware game? on March 24th, 2007 3:04 am

    [...] I’ve recently made a lot of posts about database compression. 3X or more compression is rapidly becoming standard; 5X+ is coming soon as processor power increases; 10X or more is not unrealistic. True, this applies mainly to data warehouses, but that’s where the big database growth is happening. And new kinds of data — geospatial, telemetry, document, video, whatever — are highly compressible as well. [...]

  3. Chuck on March 24th, 2007 1:13 pm

    #4 isn’t likely. Internode data movement has to do with what data is on which node, not how the data is organized within the node.

  4. Curt Monash on March 24th, 2007 3:07 pm

    Chuck,

    So Vertica does horizontal partitioning (random or range) to the same extent row stores do?

    CAM

  5. Peter Thawley on March 29th, 2007 6:59 pm

    Looks like Mike’s paid attention to what Sybase IQ (aka Expressway Technologies)
    developed in the early 90′s. Just look at the TPC-H FDRs for IQ versus Oracle if
    you want tangible proof on the financial impact of column-level compression, or
    the lack thereof.

    Peter Thawley

  6. Curt Monash on March 29th, 2007 7:31 pm

    Yeah. It’s a pity that you guys never parallelized IQ properly. It might be a fearsome competitor if you had.

    Do you have any thoughts on the implications for updates, loads, and while-compressed processing on the choices of bitmaps vs. other dictionary compression vs. delta compression vs. whatever?

    Thanks,

    CAM

  7. The core of the Vertica story still seems to be compression | DBMS2 -- DataBase Management System Services on July 4th, 2008 2:51 am

    [...] in March, I suggested that compression was a central and compelling aspect of Vertica’s story. Well, in their new blog, the Vertica guys now strongly reinforce that [...]

  8. Déménagement Martin on September 20th, 2014 12:26 pm

    Incredible points. Great arguments. Keep up
    the amazing effort.

    Also visit my web site – Déménagement Martin

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.