I’ve posted extensively about data-warehouse-focused DBMS’ compression, which can be a major part of their value proposition. Most notable, perhaps, is a short paper Mike Stonebraker wrote for this blog — before he and his fellow researchers started their own blog — on column-stores’ advantages in compression over row stores. Compression has long been a big part of the DATAllegro story, while Netezza got into the compression game just recently. Part of Teradata’s pricing disadvantage may stem from weak compression results. And so on.
Well, the general-purpose DBMS vendors are working busily at compression too. Microsoft SQL Server 2008 exploits compression in several ways (basic data storage, replication/log shipping, backup). And Oracle offers compression too, as per this extensive writeup by Don Burleson.
If I had to sum up what we do and don’t know about database compression, I guess I’d start with this:
- Columnar DBMS really do get substantially better compression than row-based database systems. The most likely reasons are:
- More elements of a column fit into a single block, so all compression schemes work better.
- More compression schemes wind up getting used (e.g., delta compression as well the token/dictionary compression that row-based systems use too).
- Data-warehouse-based row stores seem to do better at compression than general-purpose DBMS. The reasons most likely are some combination of:
- They’re trying harder.
- They use larger block sizes.
- Notwithstanding these reasonable-sounding generalities, there’s a lot of variation in compression success among otherwise comparable products.
Compression is one of the most important features a database management system can have, since it creates large savings in storage and sometimes non-trivial gains in performance as well. Hence, it should be a key item in any DBMS purchase decision.