After suggesting that there’s little point to Hadoop appliances, it occurred to me to look into what kinds of hardware actually are used with Hadoop. So far as I can tell:
- Hadoop nodes today tend to run on fairly standard boxes.
- Hadoop nodes in the past have tended to run on boxes that were light with respect to RAM.
- The number of spindles per core on Hadoop node boxes is going up even as disks get bigger.
A key input comes from Cloudera, who to my joy delegated the questions to Omer Trajman, who wrote:
Most Hadoop deployments today use systems with dual socket and quad or hex cores (8 or 12 cores total, 16 or 24 hyper-threaded). Storage has increased as well with 6-8 spindles being common and some deployments going to 12 spindles. These are SATA disks with between 1TB and 2TB capacity. The amount of RAM varies depending on the application. 24GB is common as is 36GB – all ECC RAM. HBase clusters may have more RAM so they can cache more data. Some customers put Hadoop on their “standard box” which may not be perfectly balanced (e.g. more RAM, less disk) and needs to be altered slightly to meet the above specs. The new Dell C2100 series and the HP SL170 series are both popular server lines for Hadoop.
For a year ago perspective, see this post: http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
Bullet points from that year-ago link include:
- 4 1TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
- 2 quad core CPUs, running at least 2-2.5GHz
- 16-24GBs of RAM (24-32GBs if you’re considering HBase)
- Gigabit Ethernet
So basically we’re talking in the range of 2-3 GB of RAM per core — and 1 spindle per core, up from perhaps half a spindle per core a year ago.
Meanwhile, a 2009 Yahoo slide deck refers to “500 nodes, 4000 cores, 3TB RAM, 1.5PB disk”; that divides out to 8 cores, 6 GB of RAM, and 3 TB of disk per node, all on “commodity hardware.” By 2010 Yahoo was evidently up to 2 GB of RAM per core.
There are lots of data points on the Apache Hadoop wiki, but many seem a few years old, and I don’t immediately see how to time-stamp them. Overall, they seem consistent with the trends I noted at the top of the post.
One thing I haven’t done is attempted to price any of these systems.
Contributions in the comment thread would be warmly appreciated.