I talked with Cloudera yesterday about an unannounced technology, and took the opportunity to ask some non-embargoed questions as well. In particular, I requested an update to what I wrote last year about typical Hadoop hardware.
Cloudera thinks the picture now is:
- 2-socket servers, with 4- or 6-core chips.
- Increasing number of spindles, with 12 2-TB spindles being common.
- 48 gigs of RAM is most common, with 64-96 fairly frequent.
- A couple of 1GigE networking ports.
Discussion around that included:
- Enterprises had been running out of storage space; hence the increased amount of storage.
- Even more storage can be stuffed on a node, and at times is. But at a certain point there’s so much data on a node that recovery from node failure is too forbidding.
- There are some experiments with 10 GigE.
The foregoing applies to software-only Hadoop, specifically Cloudera’s distribution. The Hadoop appliances that Cloudera is familiar with tend to have higher-end hardware — more CPUs, “fancier” drives, and/or InfiniBand. If I understood correct, the same is somewhat true of hardware vendors’ pseudo-appliance recommended configurations.
My hunches about all that include:
- Footprint can matter. Not every enterprise has a cheap data center drawing cheap power by the Columbia River.
- As Cray and SAS both teach us, some analytic techniques do require high-speed interconnects.
- There’s nothing wrong with having 2 or more Hadoop clusters. One can have cheap gear, and be the ultimate big bit bucket. The other could have more expensive gear, and perhaps additional software as well. That’s even before you start thinking about cloud vs. on-premise alternatives.
And finally — as long as MapReduce persists intermediate result sets after every computational step, I wonder whether solid-state cache could be useful. An analogy could be the way analytic RDBMS can use flash for temp space, although I must admit that I can’t think of a lot of RDBMS installations configured to take advantage of that possibility.