A month ago, I posted about typical Hadoop hardware. After talking today with Eric Baldeschwieler of Hortonworks, I have an update. I also learned some things from Eric and from Brian Christian of Zettaset about Hadoop compression.
First the compression part. Eric thinks 6-10X compression is common for “curated” Hadoop data — i.e., the data that actually gets used a lot. Brian used an overall figure of 6-8X, and told of a specific customer who had 6X or a little more. By way of comparison, it sounds as if the kinds of data involved are like what Vertica claimed 10-60X compression for almost three years ago.
Eric also made an excellent point about low-value machine-generated data. I was suggesting that as Moore’s Law made sensor networks ever more affordable:
- There would be lots more data thrown off.
- A lot of it would be repetitive “I’m fine; nothing to report” kinds of events.
- It would be a good idea to filter this low-value information out rather than permanently storing it.
Eric retorted that such data compresses extremely well. He was, of course, correct. If you have a long sequence or other large amount of identical data, and the right compression algorithms* — yeah, that compresses really well.
*Think run-length encoding (RLE), delta, or tokenization with variable-length tokens.
While I was at it, I asked Eric what might be typical for Hadoop temp/working space. He said at Yahoo it was getting down to 1/4 of the disk, from a previous range of 1/3.
Anyhow, Yahoo’s most recent standard Hadoop nodes feature:
- 8-12 cores
- 48 gigabytes of RAM
- 12 disks of 2 or 3 TB each
If you divide 12 by 3 for standard Hadoop redundancy, and take off 1/4, then you have 6-9 TB/node. Multiple that by a compression factor of 6-10X, at least for the “curated data,” and you get to 36-90 TB of user data per node.
As an alternative, suppose we take a point figure from Cloudera’s ranges of 16 TB of spinning disk per node (8 spindles, 2 TB/disk). Go with the 6X compression figure. Lop off 1/3 for temp space. That more conservative calculation leaves us a bit over 20 TB/node, which is probably a more typical figure among today’s Hadoop users.