Comments on: The secret sauce to Clearpace’s compression

By: The Netezza and IBM DB2 approaches to compression | DBMS 2 : DataBase Management System Services

Tue, 11 Jan 2011 18:20:30 +0000

[…] Except for the 4096 values limit, that sounds at least as flexible as the Rainstor/Clearpace compression approach. […]

By: Sneakernet to the cloud | DBMS2 -- DataBase Management System Services

Sneakernet to the cloud | DBMS2 -- DataBase Management System Services — Sat, 30 May 2009 03:06:07 +0000

[…] sending data to the cloud, you probably want to compress it to the max before sending. Clearpace’s new RainStor structured-data archiving service emphasizes that idea. RainStor marketing says cloud, […]

By: Andy Ben-Dyke

Andy Ben-Dyke — Wed, 20 May 2009 19:31:13 +0000

To clarify, Clearpace’s underlying technology leverages a tree-based rather than columnar structure that utilizes field and pattern level deduplication. When source data is loaded into NParchive each record is stored as a series of pointers to the location of a single instance of a data value, or pattern of data values. The NParchive data store comprises a tree-based structure that links the various instances of the patterns together to establish the data records. Each record is essentially an independent tree, but each record can share leaves and branches. This approach typically delivers 40:1 compression when combined with the additional algorithmic and byte-level compression techniques employed by NParchive, but means that the original data records can be reconstituted at any time.

NParchive’s tree-based approach provides the advantages of the columnar structure (column-level access and compression) but also allows additional compression to be applied (based upon “patterns” between columns). Furthermore, the tree structure is used for in-memory querying, so the memory footprint is also significantly reduced.

Take a look at this post http://tinyurl.com/qraffr if you want more information on NParchive’s compression techniques.

By: NParchive data compression - the secret sauce | Clearpace Blog

NParchive data compression - the secret sauce | Clearpace Blog — Wed, 20 May 2009 18:54:24 +0000

[…] patents and hundreds of man years of development. However, following some commentary in a post by Curt Monash this week, I thought I’d shed some light on Clearpace’s “secret sauce”. I’ve tried to […]

By: Joydeep Sen Sarma

Joydeep Sen Sarma — Sun, 17 May 2009 08:08:45 +0000

This looks pretty interesting – log files are often heavily denormalized (since joins in warehouses are so expensive) – and the multi-column idea could work very well.

I have been playing around with S3 and EC2 lately – one of the things that struck me was that the cost of uploading data can also be non-trivial. Besides – if data is not uploaded in an optimally compressed manner – then the user needs cpu cycles to compress it by renting cpu in the cloud.

I think it would be very interesting if highly efficient compression could be applied right from the moment data originates – all the way to it’s final long term store.

By: Curt Monash

Curt Monash — Thu, 14 May 2009 08:04:12 +0000

That looks like good stuff, Joe. Thanks!

By: Joe Hellerstein

Joe Hellerstein — Thu, 14 May 2009 07:03:58 +0000

For a good technical discussion of how to trade row- and column-wise schemes for maximum compression, have a look at the IBM work on Blink. In many cases they show you can approach the optimal compression rate (entropy) this way. I like how they cut through marketing fog on columns vs. rows and focus on the technical meat of compression, and the costs of coding/decoding vs. I/O. See this paper on Blink and their study/survey of various compression methods

By: Introduction to Clearpace | DBMS2 -- DataBase Management System Services

Introduction to Clearpace | DBMS2 -- DataBase Management System Services — Thu, 14 May 2009 05:52:18 +0000

[…] and deduping them. I’m still fuzzy on how that all works. (Edit: I subsequently posted an explanation of that […]