Comments on: Database compression coming to the fore

By: Infology.Ru » Blog Archive » Сжатие данных в СУБД выходит на первый план

Sun, 05 Oct 2008 21:06:32 +0000

[…] Автор: Curt Monash Дата публикации оригинала: 2008-08-08 Перевод: Олег Кузьменко Источник: Блог Курта Монаша […]

By: Robert Potter

Robert Potter — Thu, 04 Sep 2008 20:33:16 +0000

Per information theory and encoding practice, compression plain text in advance leads to more difficult to decipher encrypted text due to reduction of variances in value frequencies, thereby obscuring the identification of patterns based based on frequency. Furthermore, optimally compressing first can yield better space savings than vice versa. To my knowledge, in most business cases for efficient reasons, the compression / decompression would be done in computer memory while encryption / description is performed on disk controllerduring disk I/O using faster encryption algorithm such as DES that trades off encryption strength for speed.

By: Stuart Frost

Stuart Frost — Fri, 15 Aug 2008 02:08:18 +0000

In the DATAllegro architecture, compression helps us to balance CPU and I/O better. The basic problem is that CPU power is increasing much faster than I/O bandwidth. By using some of the ‘excess’ CPU power to decompress data after it’s read from disk, we effectively get more I/O bandwidth without having to add extra disks, switches and fiber channel cards, thereby keeping costs lower.

To maintain these benefits with an encrypted database, we encrypt after we compress. Otherwise, the compression ratio would fall dramatically, since encrypted data doesn’t compress as well.

Stuart Frost
CEO, DATAllegro

By: Curt Monash

Curt Monash — Wed, 13 Aug 2008 22:46:28 +0000

Jason, that’s a good question, and offhand I don’t know the answer. But speculating:

1. Dictionary/token compression, which is the main form, shouldn’t be much affected by encryption. An exception would be if you’re SO paranoid that you don’t want to reveal anything about your distribution of values even if the values are protected, but outside of a few national security applications I don’t see why that would be the case.

2. Delta compression would be problematic. Compression after encryption would seem not to work, and compressing before encryption might rule out what are otherwise some shortcuts in getting reasonable write speed.

CAM

By: Jason

Jason — Wed, 13 Aug 2008 20:05:38 +0000

From what I understand, if a database is encrypted, compression is very difficult. If so, good compression would come at the expense of decreased database file security. Or do data warehouse users mostly rely on physical security of the database server hardware (ie. in a locked down data center) to protect from theft of data?

By: Curt Monash

Curt Monash — Mon, 11 Aug 2008 15:17:24 +0000

Excellent point, Dominik.

Vertica’s the most famous for making the claim, but I’ve heard a few times now “We query on compressed data all the way through; we don’t have to decompress it before query execution.”

Best,

CAM

By: Dominik Slezak

Dominik Slezak — Mon, 11 Aug 2008 14:59:45 +0000

Indeed, there are numerous approaches to data compression in databases, and they can be categorized in many different ways. I would like to draw attention to the compression ratio versus (de)compression speed trade-off. Many database vendors use relatively light compression algorithms, with relatively worse compression ratios but, on the other hand, have the ability to work on non-fully decompressed data. Other vendors achieve better compression ratios by applying more advanced compression algorithms, which, however, require adding a more time-consuming decompression phase to the whole solution. It is an interesting dilemma which strategy is better, for row and column stores, for general-purpose and data-warehouse-focused products, etc. I think the answer lays in the database engines’ (dis)ability to precisely identify (or heuristically predict) which data pieces (and in which ordering) should be accessed, decompressed, and processed. To summarize, some ability to manipulate compressed data is great. But, on top of that, the better the database can isolate and organize the data required for the query, minimizing the need for decompression, the more sophisticated data compression techniques may be applied.