August 8, 2008

Database compression coming to the fore

I’ve posted extensively about data-warehouse-focused DBMS’ compression, which can be a major part of their value proposition. Most notable, perhaps, is a short paper Mike Stonebraker wrote for this blog — before he and his fellow researchers started their own blog — on column-stores’ advantages in compression over row stores. Compression has long been a big part of the DATAllegro story, while Netezza got into the compression game just recently. Part of Teradata’s pricing disadvantage may stem from weak compression results. And so on.

Well, the general-purpose DBMS vendors are working busily at compression too. Microsoft SQL Server 2008 exploits compression in several ways (basic data storage, replication/log shipping, backup). And Oracle offers compression too, as per this extensive writeup by Don Burleson.

If I had to sum up what we do and don’t know about database compression, I guess I’d start with this:

Compression is one of the most important features a database management system can have, since it creates large savings in storage and sometimes non-trivial gains in performance as well. Hence, it should be a key item in any DBMS purchase decision.

Comments

7 Responses to “Database compression coming to the fore”

  1. Dominik Slezak on August 11th, 2008 10:59 am

    Indeed, there are numerous approaches to data compression in databases, and they can be categorized in many different ways. I would like to draw attention to the compression ratio versus (de)compression speed trade-off. Many database vendors use relatively light compression algorithms, with relatively worse compression ratios but, on the other hand, have the ability to work on non-fully decompressed data. Other vendors achieve better compression ratios by applying more advanced compression algorithms, which, however, require adding a more time-consuming decompression phase to the whole solution. It is an interesting dilemma which strategy is better, for row and column stores, for general-purpose and data-warehouse-focused products, etc. I think the answer lays in the database engines’ (dis)ability to precisely identify (or heuristically predict) which data pieces (and in which ordering) should be accessed, decompressed, and processed. To summarize, some ability to manipulate compressed data is great. But, on top of that, the better the database can isolate and organize the data required for the query, minimizing the need for decompression, the more sophisticated data compression techniques may be applied.

  2. Curt Monash on August 11th, 2008 11:17 am

    Excellent point, Dominik.

    Vertica’s the most famous for making the claim, but I’ve heard a few times now “We query on compressed data all the way through; we don’t have to decompress it before query execution.”

    Best,

    CAM

  3. Jason on August 13th, 2008 4:05 pm

    From what I understand, if a database is encrypted, compression is very difficult. If so, good compression would come at the expense of decreased database file security. Or do data warehouse users mostly rely on physical security of the database server hardware (ie. in a locked down data center) to protect from theft of data?

  4. Curt Monash on August 13th, 2008 6:46 pm

    Jason, that’s a good question, and offhand I don’t know the answer. But speculating:

    1. Dictionary/token compression, which is the main form, shouldn’t be much affected by encryption. An exception would be if you’re SO paranoid that you don’t want to reveal anything about your distribution of values even if the values are protected, but outside of a few national security applications I don’t see why that would be the case.

    2. Delta compression would be problematic. Compression after encryption would seem not to work, and compressing before encryption might rule out what are otherwise some shortcuts in getting reasonable write speed.

    CAM

  5. Stuart Frost on August 14th, 2008 10:08 pm

    In the DATAllegro architecture, compression helps us to balance CPU and I/O better. The basic problem is that CPU power is increasing much faster than I/O bandwidth. By using some of the ‘excess’ CPU power to decompress data after it’s read from disk, we effectively get more I/O bandwidth without having to add extra disks, switches and fiber channel cards, thereby keeping costs lower.

    To maintain these benefits with an encrypted database, we encrypt after we compress. Otherwise, the compression ratio would fall dramatically, since encrypted data doesn’t compress as well.

    Stuart Frost
    CEO, DATAllegro

  6. Robert Potter on September 4th, 2008 4:33 pm

    Per information theory and encoding practice, compression plain text in advance leads to more difficult to decipher encrypted text due to reduction of variances in value frequencies, thereby obscuring the identification of patterns based based on frequency. Furthermore, optimally compressing first can yield better space savings than vice versa. To my knowledge, in most business cases for efficient reasons, the compression / decompression would be done in computer memory while encryption / description is performed on disk controllerduring disk I/O using faster encryption algorithm such as DES that trades off encryption strength for speed.

  7. Infology.Ru » Blog Archive » Сжатие данных в СУБД выходит на первый план on October 5th, 2008 5:06 pm

    […] Автор: Curt Monash Дата публикации оригинала: 2008-08-08 Перевод: Олег Кузьменко Источник: Блог Курта Монаша […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.