July 6, 2011

Hadoop hardware and compression

A month ago, I posted about typical Hadoop hardware. After talking today with Eric Baldeschwieler of Hortonworks, I have an update. I also learned some things from Eric and from Brian Christian of Zettaset about Hadoop compression.

First the compression part. Eric thinks 6-10X compression is common for “curated” Hadoop data — i.e., the data that actually gets used a lot. Brian used an overall figure of 6-8X, and told of a specific customer who had 6X or a little more. By way of comparison, it sounds as if the kinds of data involved are like what Vertica claimed 10-60X compression for almost three years ago.

Eric also made an excellent point about low-value machine-generated data. I was suggesting that as Moore’s Law made sensor networks ever more affordable: 

Eric retorted that such data compresses extremely well. He was, of course, correct. If you have a long sequence or other large amount of identical data, and the right compression algorithms* — yeah, that compresses really well.

*Think run-length encoding (RLE), delta, or tokenization with variable-length tokens.

While I was at it, I asked Eric what might be typical for Hadoop temp/working space. He said at Yahoo it was getting down to 1/4 of the disk, from a previous range of 1/3.

Anyhow, Yahoo’s most recent standard Hadoop nodes feature:

If you divide 12 by 3 for standard Hadoop redundancy, and take off 1/4, then you have 6-9 TB/node. Multiple that by a compression factor of 6-10X, at least for the “curated data,” and you get to 36-90 TB of user data per node.

As an alternative, suppose we take a point figure from Cloudera’s ranges of 16 TB of spinning disk per node (8 spindles, 2 TB/disk). Go with the 6X compression figure. Lop off 1/3 for temp space. That more conservative calculation leaves us a bit over 20 TB/node, which is probably a more typical figure among today’s Hadoop users.


10 Responses to “Hadoop hardware and compression”

  1. Petabyte-scale Hadoop clusters (dozens of them) | DBMS 2 : DataBase Management System Services on July 6th, 2011 12:15 am

    […] works out near the low end of the range I came up with for Yahoo’s newest gear, namely 36-90 TB/node. Yahoo’s biggest clusters are little over 4,000 nodes (a limitation that’s getting […]

  2. unholyguy on July 6th, 2011 9:19 am

    Are there customer references for any of it? Vendor claims about compression are always grain of salt..

  3. Curt Monash on July 6th, 2011 12:34 pm

    Hortonworks and Zettaset gave me one customer name each with specific figures. Indeed, Hortonworks didn’t give me data for anybody except Yahoo.

    One thing: As Merv Adrian points out privately, I neglected to check just where and by whom this compression tends to be implemented. I’ll look into it.

  4. Eric Baldeschwieler on July 12th, 2011 12:23 am

    The 6-10x number was from Yahoo!, where we store a lot of log data on Hadoop. In general we are simply using text or block compressed sequence files. In both cases the data is simply gzipped.

    Much higher compression ratios are possible for some data types or using more advanced techniques. With HCatalog we intend to support such techniques on columnar and row based data representations.

  5. Curt Monash on July 12th, 2011 2:08 pm


    Do I understand correctly that compression is not now baked into Hadoop, but will be with HCatalog?

  6. Eric Baldeschwieler on July 13th, 2011 8:28 am

    You can easily use compression today from map-reduce by specifying your choice of codec to various input / output formats. Yahoo compresses all of its big data sets via this option.

    Compression is not implemented in the HDFS layer and we’ve no immediate plans to put it there.

    HCatalog will separate application code from data location and representation choices (those input / output formats), making it much easier to role out new data layouts and compression codecs. I think this will spur rapid improvements in the average level of compression in Hadoop.

  7. Curt Monash on July 13th, 2011 3:20 pm


    So you’re saying compression happens on the application tier rather than the storage tier? Makes sense to me. In particular, it should be AT LEAST as friendly to late-decompression than the alternative.



  8. State of Data #57 « Dr Data's Blog on July 14th, 2011 3:26 pm

    […] – What range is Hadoop compression factor? “6-10X compression is common for “curated” Hadoop […]

  9. Introduction to Zettaset | DBMS 2 : DataBase Management System Services on July 27th, 2011 7:27 am

    […] told me of one big customer — with an almost-petabyte Hadoop cluster before compression — namely Zions […]

  10. Stripping out the Hype from the Realities of Big Data in Health Care on January 9th, 2014 12:19 pm

    […] deployment, with about 42,000 nodes with a total of between 180 and 200 petabytes of storage. Their standard node has between 8 and 12 cores, 48 GB of memory, and 24 to 36 terabytes of […]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.