In an introduction to archiving vendor Clearpace last December, I noted that Clearpace claimed huge compression successes for its NParchive product (Clearpace likes to use a figure of 40X), but didn’t give much reason that NParchive could compress a lot more effectively than other columnar DBMS. Let me now follow up on that.
To the extent there’s a Clearpace secret sauce, it seems to lie in NParchive’s unusual data access method. NParchive doesn’t just tokenize the values in individual columns; it tokenizes multi-column fragments of rows. Which particular columns to group together in that way seems to be decided automagically; the obvious guess is that this is based on estimates of the cardinality of their Cartesian products.
Of the top of my head, examples for which this strategy might be particularly successful include:
- Denormalized databases
- Message stores with lots of header information