June 21, 2010

The Netezza and IBM DB2 approaches to compression

Thursday, I spent 3 ½ hours talking with 10 of Netezza’s more senior engineers. Friday, I talked for 1 ½ hours with IBM Fellow and DB2 Chief Architect Tim Vincent, and we agreed we needed at least 2 hours more. In both cases, the compression part of the discussion seems like a good candidate to split out into a separate post. So here goes.

When you sell a row-based DBMS, as Netezza and IBM do, there are a couple of approaches you can take to compression. First, you can compress the blocks of rows that your DBMS naturally stores. Second, you can compress the data in a column-aware way. Both Netezza and IBM have chosen completely column-oriented compression, with no block-based techniques entering the picture to my knowledge. But that’s about as far as the similarity between Netezza and IBM compression goes.

IBM’s basic DB2 compression strategy is remarkably simple. In every table (not column) – or in each range partition in a range-partitioned table — the 4096 most common* values are identified; these are all encoded into 12-bit strings. And that’s that. This has been happening since DB2 9.1, released 4 ½ years ago. DB2’s compression persists through logs, buffer pools (i.e., RAM cache), and so on. In DB2 9.7, the most recent release, IBM extended the use of the compression to a few areas it hadn’t stretched before, such as log-based replication, native XML, or CLOBs (Character Large OBjects) that happen not to be too big.

*Actually, I’d presume it’s not exactly the “most common”; there surely is some minimum length of a value to be encoded, or some bias toward length. Also, the determination of what to encode is probably a little imprecise. E.g., I forgot to ask whether the choice of values ever changes as data got updated.

The sophisticated part of DB2’s simple compression strategy is its breadth of applicability; DB2 compression can apply to:

Values in columns (numeric, character, whatever)
Substrings of values in columns
Groups of columns (e.g., city/state/zip code)

Except for the 4096 values limit, that sounds at least as flexible as the Rainstor/Clearpace compression approach.

Netezza, unlike IBM, takes a grab-bag approach to compression – try out a bunch of techniques, see which work best, and incorporate those in the product. Netezza first introduced compression a couple of years ago, for numeric columns only, especially integer. Techniques used in Netezza numeric compression include but are not limited to:

Delta compression, wherein you store the increment between a value and its predecessor rather than a whole new value.
Ways of indicating that a value or increment was just the same as in the row before.

This was via something called Compress Engine,* now being renamed to Compress Engine 1. Netezza’s new Compress Engine 2 improves on what Netezza did in Compress Engine 1 for numeric data, most notably by trimming away excess field length. (Netezza says it got 28% better compression on a test data set with almost no character strings, primarily from that enhancement.) Further, Netezza Compress Engine 2 adds new compression techniques, allowing it to handle VARCHAR – i.e. character strings — as well.

*Fortunately, the original name or at least description of “Compiled Tables” is retreating ever more from view.

Netezza’s Compress Engine 2 has two ways to compress character fields/text strings – prefix compression and Huffman coding. By way of contrast, Netezza tested suffix compression and decided it wasn’t beneficial enough to bother messing with.

The idea behind prefix compression is that if two strings start with the same characters, for the second one you only have to record the part that’s different. Prefix compression has a lot of the same merits as delta compression; like delta compression, it works best on sorted columns. (An example of where prefix compression makes obvious sense is URLs, which tend to all start in similar ways.)
In Netezza’s version of Huffman coding, the alphabet is encoded symbol-by-symbol, with more common characters getting codes of shorter length. These codes are chosen on a column-by-column basis. (I presume the “/” character gets shorter code in a URL column than it would, for example, in one that stored addresses.)

While I didn’t ask explicitly, it seems pretty obvious that Compress Engine 2’s functionality is a strict superset of Compress Engine 1’s. Netezza is going to run Compress Engines 1 and 2 side by side, but expects pages to move from Compress Engine 1’s purview to Compress Engine 2’s as part of the new “table grooming” process.

Related links

IBM kindly permitted me to post some of its slides in the area of compression
Microsoft SQL Server seems to rely on prefix and dictionary compression

Categories: Data warehousing, Database compression, IBM and DB2, Microsoft and SQL*Server, Netezza

Subscribe to our complete feed!

Comments

17 Responses to “The Netezza and IBM DB2 approaches to compression”

Ivan Novick on June 21st, 2010 12:56 pm

That’s interesting… did you get any numbers as to what % of compression can be expected from these approaches with “typical” customer data?
Curt Monash on June 21st, 2010 1:17 pm

@Ivan,

Well, twice as good as the current slightly better than 2X. But nothing more detailed than that. I.e., nothing giving different figures for different kinds of data.

Other than the time Omer Trajman, then of Vertica, provided some numbers based on direct field experience, I’ve found that kind of number pretty hard to come by.
Vlad Rodionov on June 21st, 2010 4:05 pm

Is not Huffman coding bad for real-time compression? From the performance point of view?

“Ways of indicating that a value or increment was just the same as in the row before.”

This can be done using simple RLE.
Serge Rielau on June 22nd, 2010 7:14 am

@Ivan:
We (DB2) have seen compression >70% on some SAP BW fact tables. Typically we claim a more conservative number though.

@All
Note that the index compression in DB2 uses prefix compression for the index key and delta-compression for the row-ids.
The higher the number of unique values the more savings you get from prefix compression and the lower the more savings you get from the delta compression on the rowids (more rowids per index value)

@Curt:
The dictionary is static, so updates do not affect it. If you have a table with significant “drift” such as a snowshoe in winter and jetski in summer and it matters (i.e. a big table) then you typically use range-partitioning anyway. So the partition level dictionary will cope with the drift.
Garth Dickie on June 24th, 2010 10:42 am

@Vlad:
All of our (Netezza) compression happens near the disk, so it parallelizes nicely across nodes. Huffman coding does use some CPU, but it is fast enough to keep up with the disk during writes.

Decompression happens in the FPGA. Huffman decoding quickly isn’t trivial, but we found ways to parallelize and pipeline within the FPGA and again, it is fast enough. There are techniques we can use in hardware that don’t apply easily to an Intel-only implementation.
Notes on a spate of Netezza-related blog posts | DBMS2 -- DataBase Management System Services on June 28th, 2010 3:02 pm

[…] A discussion of Netezza’s and IBM’s compression strategies […]
EMC Buys Greenplum – Big Data Realignment Continues « Merv's Market Strategies for IT Suppliers on July 6th, 2010 10:46 pm

[…] has had it for a while too, and IBM has been pushing its database compression as well. See this useful post by Curt Monash on the topic for […]
EMC Buys Greenplum – Big Data Realignment Continues on July 7th, 2010 10:10 am

[…] has had it for a while too, and IBM has been pushing its database compression as well. See this useful post by Curt Monash on the topic for […]
M-A-O-L » Netezza Updates on July 16th, 2010 3:14 am

[…] The Netezza and IBM DB2 approaches to compression […]
Some thoughts on the announcement that IBM is buying Netezza | DBMS 2 : DataBase Management System Services on September 20th, 2010 4:40 pm

[…] been getting some DB2 briefings, which is why I’ve blogged about some specialized technical points from same. But I can’t yet say why the theoretically great-sounding data […]
Notes on data warehouse appliance prices | DBMS 2 : DataBase Management System Services on October 23rd, 2010 5:58 am

[…] was based on 2.25X compression. Since then, Netezza has upgraded its compression. Netezza now quotes 4X compression. Accordingly, Netezza’s list price is now around […]
Martin Rusoff on November 24th, 2010 3:24 pm

One of the interesting and fundamental differences beginning to show up in databases is whetehr they actually work with compressed data or simply store it compressed. Some of the research on column-stores (Vertica, MonetDB etc.) xseems to indicate that significant performance gains can be found by performing some operations on compressed data… It would be interesting to know which vendors have that in the works.
Curt Monash on November 25th, 2010 12:31 am

Vertica is probably the most aggressive at operating on compressed data, but a number of others do it partially.

One catch is that for some compression types (e.g. dictionary/token) it is straightforward to operate on compressed data, whereas for others (e.g. anything block-level) it is well-nigh impossible. Generally, “columnar” types of compression allow the possibility of operating on compressed data, while others don’t. Dictionary/token is the easiest case of all; it would be hard to understand why somebody would NOT operate on compressed data if it was compressed only that way.
More on temp space, compression, and “random” I/O | DBMS 2 : DataBase Management System Services on April 10th, 2011 2:18 am

[…] the plus side, I was reminded of something I should have noted when I wrote about DB2 compression […]
Compression in Sybase ASE 15.7 | DBMS 2 : DataBase Management System Services on October 13th, 2011 11:29 pm

[…] the first release of Sybase ASE with data compression. Sybase fondly believes that it is matching DB2 and leapfrogging Oracle in compression rate with a single compression scheme, namely page-level […]
Paul on March 26th, 2014 8:07 am

Is there somebody who is willing to share information about a plan of approach for a test with the NETEZA box or a POC?

Kind regards, Paul
Columnar compression vs. column storage | DBMS 2 : DataBase Management System Services on February 12th, 2015 2:06 pm

[…] a variant, a few columns can be compressed together, e.g. the various parts of an address. IBM DB2 and RainStor both do […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

The Netezza and IBM DB2 approaches to compression

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin