October 17, 2012

Notes on Hadoop hardware

I talked with Cloudera yesterday about an unannounced technology, and took the opportunity to ask some non-embargoed questions as well. In particular, I requested an update to what I wrote last year about typical Hadoop hardware.

Cloudera thinks the picture now is:

Discussion around that included:

The foregoing applies to software-only Hadoop, specifically Cloudera’s distribution. The Hadoop appliances that Cloudera is familiar with tend to have higher-end hardware — more CPUs, “fancier” drives, and/or InfiniBand. If I understood correct, the same is somewhat true of hardware vendors’ pseudo-appliance recommended configurations.

My hunches about all that include:

And finally — as long as MapReduce persists intermediate result sets after every computational step, I wonder whether solid-state cache could be useful. An analogy could be the way analytic RDBMS can use flash for temp space, although I must admit that I can’t think of a lot of RDBMS installations configured to take advantage of that possibility.


7 Responses to “Notes on Hadoop hardware”

  1. Notes on analytic hardware | DBMS 2 : DataBase Management System Services on October 17th, 2012 8:10 am

    […] to my conjecture that if MapReduce insists on writing to persistent storage at every step, you might want to have flash cache just for […]

  2. Charles Zedlewski on October 17th, 2012 11:01 am

    Hi Curt,

    I think it’s fair to explore all these possibilities, just so long as the thought process is:

    $$ per node from adding a new part (e.g. flash for temp) X total number of nodes in the cluster / cost per node.

    In other words, how many nodes could each fancy part have bought me? And then would I be better, faster, cheaper with the fancy part or the additional nodes? That’s the “hurdle rate” that new node components or component upgrades have to get over.

  3. Curt Monash on October 17th, 2012 11:37 am


    Thanks for chiming in!

    You’re right of course, except that those aren’t just capex nodes, they’re opex as well.

    If I take the total cost of my hardware purchases up 20% — to pick a number rather at random — while taking power consumption and floor space up 0%, that’s not really a 20% cost increase at all.



  4. Ken Farmer on October 18th, 2012 12:31 pm


    As you point out organizations without a Google or Facebook data center, ie 99% of them, can pay anywhere from $3-15k/year to support each os image. Increasing the node speed & price can dramatically drop total costs for many customers.

    And is the changing picture of an ideal server really due to changing requirements or a better understanding of the problem? Weren’t IO constraints always a problem? And are we seeing these slowly evolving to resemble parallel database nodes?

  5. Quick notes on Impala | DBMS 2 : DataBase Management System Services on October 24th, 2012 10:52 am

    […] 2-hour conversation, and that’s exactly what I had with Cloudera last week. We touched on hardware and general adoption, but much of the conversation was about Cloudera Impala, announced today. Like […]

  6. Paul Johnson on October 31st, 2012 5:22 am

    The amount of storage per node is a key determinant of system performance. Less storage per node = more nodes required and vice versa. The number of nodes is obviously central to overall performance in any clustered system.

    High storage per node can deliver a relatively cheap ‘high capacity low throughput’ system. Low storage per node can deliver a relatively expensive ‘low capacity high throughput’ system. A balanced system will be somewhere in the middle.

    Teradata’s patented ‘bynet’ interconnect demonstrates the value of a high-speed interconnect in a clustered system. Lack of interconnect scalability can quickly become the bottelneck once the number of nodes is high enough to saturate the network, assuming data is routinely moved between compute nodes.

  7. Hortonworks, Hadoop, Stinger and Hive | DBMS 2 : DataBase Management System Services on August 7th, 2013 2:53 am

    […] or default Hadoop node these days. Happily, the answers seemed like straightforward upgrades to what Cloudera said in October, 2012. Specifics […]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.