Thursday, I spent 3 ½ hours talking with 10 of Netezza’s more senior engineers. Friday, I talked for 1 ½ hours with IBM Fellow and DB2 Chief Architect Tim Vincent, and we agreed we needed at least 2 hours more. In both cases, the compression part of the discussion seems like a good candidate to split out into a separate post. So here goes.
When you sell a row-based DBMS, as Netezza and IBM do, there are a couple of approaches you can take to compression. First, you can compress the blocks of rows that your DBMS naturally stores. Second, you can compress the data in a column-aware way. Both Netezza and IBM have chosen completely column-oriented compression, with no block-based techniques entering the picture to my knowledge. But that’s about as far as the similarity between Netezza and IBM compression goes.
IBM’s basic DB2 compression strategy is remarkably simple. In every table (not column) – or in each range partition in a range-partitioned table — the 4096 most common* values are identified; these are all encoded into 12-bit strings. And that’s that. This has been happening since DB2 9.1, released 4 ½ years ago. DB2′s compression persists through logs, buffer pools (i.e., RAM cache), and so on. In DB2 9.7, the most recent release, IBM extended the use of the compression to a few areas it hadn’t stretched before, such as log-based replication, native XML, or CLOBs (Character Large OBjects) that happen not to be too big.
*Actually, I’d presume it’s not exactly the “most common”; there surely is some minimum length of a value to be encoded, or some bias toward length. Also, the determination of what to encode is probably a little imprecise. E.g., I forgot to ask whether the choice of values ever changes as data got updated.
The sophisticated part of DB2′s simple compression strategy is its breadth of applicability; DB2 compression can apply to:
- Values in columns (numeric, character, whatever)
- Substrings of values in columns
- Groups of columns (e.g., city/state/zip code)
Except for the 4096 values limit, that sounds at least as flexible as the Rainstor/Clearpace compression approach.
Netezza, unlike IBM, takes a grab-bag approach to compression – try out a bunch of techniques, see which work best, and incorporate those in the product. Netezza first introduced compression a couple of years ago, for numeric columns only, especially integer. Techniques used in Netezza numeric compression include but are not limited to:
- Delta compression, wherein you store the increment between a value and its predecessor rather than a whole new value.
- Ways of indicating that a value or increment was just the same as in the row before.
This was via something called Compress Engine,* now being renamed to Compress Engine 1. Netezza’s new Compress Engine 2 improves on what Netezza did in Compress Engine 1 for numeric data, most notably by trimming away excess field length. (Netezza says it got 28% better compression on a test data set with almost no character strings, primarily from that enhancement.) Further, Netezza Compress Engine 2 adds new compression techniques, allowing it to handle VARCHAR – i.e. character strings — as well.
*Fortunately, the original name or at least description of “Compiled Tables” is retreating ever more from view.
Netezza’s Compress Engine 2 has two ways to compress character fields/text strings – prefix compression and Huffman coding. By way of contrast, Netezza tested suffix compression and decided it wasn’t beneficial enough to bother messing with.
- The idea behind prefix compression is that if two strings start with the same characters, for the second one you only have to record the part that’s different. Prefix compression has a lot of the same merits as delta compression; like delta compression, it works best on sorted columns. (An example of where prefix compression makes obvious sense is URLs, which tend to all start in similar ways.)
- In Netezza’s version of Huffman coding, the alphabet is encoded symbol-by-symbol, with more common characters getting codes of shorter length. These codes are chosen on a column-by-column basis. (I presume the “/” character gets shorter code in a URL column than it would, for example, in one that stored addresses.)
While I didn’t ask explicitly, it seems pretty obvious that Compress Engine 2′s functionality is a strict superset of Compress Engine 1′s. Netezza is going to run Compress Engines 1 and 2 side by side, but expects pages to move from Compress Engine 1′s purview to Compress Engine 2′s as part of the new “table grooming” process.
- IBM kindly permitted me to post some of its slides in the area of compression
- Microsoft SQL Server seems to rely on prefix and dictionary compression