August 21, 2011

Hadoop evolution

I wanted to learn more about Hadoop and its futures, so I talked Friday with Arun Murthy of Hortonworks.* Most of what we talked about was:

Arun previously addressed these issues and more in a June slide deck.

*Arun has been working on Hadoop full time for over 6 years, and leading the Yahoo/Hortonworks MapReduce team for at least the last 2.

For both NameNode and JobTracker, Arun’s take was along the lines of:

*My word, not Arun’s.

In the case of both NameNode and JobTracker, the problem starts with an essential component of the Hadoop system being on a single server. That single server can be mirrored for high availability, but it’s a bottleneck. The current fix for NameNode is to federate it across multiple servers, each owning a portion of the namespace. A variety of further possibilities come to mind:

The Hadoop guys evidently don’t know which of those approaches will be tried next, in which combination. But if they choose wrong — well, as Arun points out, you’re welcome to implement your own alternative.

Meanwhile, Arun says the current improvement will take Hadoop’s capacity up to half a billion files or so,* or 60-100 petabytes of “storage” (presumably before compression, replication, and so on are taken into account). I didn’t ask Arun to walk me through the arithmetic of that, but I did ask why there were so many files in the first place. I’ve heard in the past of what amounted to “a new file for every update” kinds of scenarios, but that’s not the example he gave. Rather, he spoke of such slicing and dicing of Yahoo advertising data that there are huge numbers of calculated quasi-cubes. For example, advertiser-specific aggregates are run on 5-minute slices of web event data, and stored for 15-minute slices; the raw data is of course kept as well.

*Depending on who you ask, the current figure seems to be in the 70-100 million range, special cases perhaps aside.

Arun also addressed the debate over the fact that MapReduce uses the node-specific file system to create temporary files, rather than HDFS. One reason for this is surely the file-count limitation, but another is performance. Yes, this choice means you lose intermediate results if there’s a node failure; but Hadoop is designed to deal with nodes that fail. And by the way — even if you did use HDFS for intermediate result sets, you might well be setting its replication factor to 1.

As for JobTracker — that’s getting split into at least two pieces, to separate its two main functions:

Apparently JobTracker has been a key barrier to scalability, being limited to 40-50,000 tasks per cluster, which is a lot like 10-15 tasks/node multiplied by the limit-to-date of 4,000 or so nodes/cluster. The new architecture, in which multiple Application Master compete for resources dispenses by a single Global Resource Manager, is supposed to allow at least 20-50 tasks/node, and a greater number of nodes as well. This capability was checked into the main Hadoop code line just this week.

Global Resource Manager also runs on a single node, high availability mirroring aside. Discussion of opportunities for further scaling was much like our discussion of scaling NameNode — there’s one option that could work, and another option that could also work, and in the mean time one can substitute one’s own version if one likes. Indeed, different Hadoop users already plug different schedulers into JobTracker today.

Another benefit of refactoring JobTracker is that non-MapReduce processing frameworks can be included in Hadoop. The one that gets mentioned repeatedly is MPI (Message Passing Interface). Like the SAS folks, the Hadoop folks seem to think that not all important algorithms can be parallelize via MapReduce, and that MPI is a great place to look for alternative strategies.

I’m still confused about issues such as:

But in any case, this all seems to be an important step in Hadoop’s evolution toward being (more of) a flexible, industrial-strength application execution environment.

Comments

7 Responses to “Hadoop evolution”

  1. unholyguy on August 21st, 2011 2:03 pm

    According to this article
    http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/

    The vision is for Global Resource Manager to manage “CPU, memory, disk and network bandwidth, local-storage etc”

    How much of that gets implemented in initial releases and how flexible it is seems unclear at the moment

    i think both Giraph and Hama already implement a MPI on top of hadoop, though I am sure there is a lot that still needs to be done on top of that

    http://www.slideshare.net/averyching/20110628giraph-hadoop-summit

  2. Arun C Murthy on August 21st, 2011 6:55 pm

    Great post Curt – enjoyed our conversation, thanks.

    Here are some clarifications for your queries:

    # The ResourceManager has a generic resource model to incorporate memory, cpu, disk/network i/o etc. The initial version of the Schedulers we have use RAM only; we expect to use CPU and i/o very soon – particularly as the underlying OS evolves to provide more fine grained controls (e.g. cgroups in RHEL6 etc.). We strive to rely on low-level features available in the OS and not reinvent them as far as possible. More details here: http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/

    # Yes, the pluggable Scheduler in the ResourceManager is customizable for workload specific tasks – we strive to provide a very generic, multi-tenant Scheduler out of the box. Our first scheduler is geared for capacity-based, multi organization, large scale deployments – an older post on the older ‘capacity scheduler’ is here: http://developer.yahoo.com/blogs/hadoop/posts/2011/02/capacity-scheduler/

    # To get MPI to work in NextGen Hadoop someone has to implement a MPI ApplicationMaster (AM) using the available interfaces. This MPI AM(s) will compete with other AMs (MapReduce AMs, Spark AMs etc.) to get resources from the Scheduler and use, possibly, OpenMPI to use the obtained resources.

    —-

    Some other clarifications:

    Currently there are production deployments of HDFS which support storage of ~30PB (effective, and ~100PB raw) and nearly 200M files. With HDFS Federation coming up in hadoop-0.23 we are confident of scaling #files by at least 5x via 5-6 NameNodes per cluster. Thus, we are looking at ~100PB of effective storage and 500M-1B files per HDFS install – I’m using 36TB/node and 6000 nodes per cluster with default replication factor of 3.

    36TB * 6000-nodes / 3-replication -> 72PB effective

    If necessary, we have headroom available to run clusters larger than 6000 nodes – something we like to have! *smile*

    I’d like to emphasize that these are real-world, production deployments based on our testing and operational experience; these not theoretical possibilities or hypothetical scenarios.

    —-

    HDFS in hadoop-0.23 has also undergone a re-factoring to separate the storage block-management layer from the namespace-management layer. This allows us to use the block-management directly for systems such as HBase which would greatly benefit from such an arrangement – similarly it allows for custom namespace implementations.

    Conceptually, and philosophically, this is similar to the new architecture of the compute framework (NextGen MapReduce).

    —-

    Overall, Apache Hadoop continues to evolve, rapidly, to meet the needs of its users.

    Scalability and multi-tenancy have been areas of focus in the current stable release (hadoop-0.20) – we have spent a lot of time ensuring we can run shared, large-scale, multi-tenant systems while helping non-expert users get the best out of their organizational investment in Hadoop. We’ve strived to incorporate several best-practices (http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/) while at the same time ensuring a single job, user, organization cannot adversely harm the Hadoop deployment either inadvertently (JobTracker/NameNode limits) or maliciously (strong security).

    Now, with the upcoming hadoop-0.23 release, Apache Hadoop is adapting again to meet a whole new set of demands with HDFS Federation, NextGen MapReduce and is on the path for high availability (HDFS & MR).

  3. Peng Tao on August 25th, 2011 4:49 am

    W.R.T. MPI within Hadoop, is there any discussion/investigation about whether current API semantics is sufficient for openMPI to run?

  4. Arun C Murthy on September 18th, 2011 12:27 am

    Peng – there is active development on running openMPI on Hadoop: https://issues.apache.org/jira/browse/MAPREDUCE-2911. Thanks.

  5. Steve Loughran on September 18th, 2011 3:51 am

    FYI, The Apache Incubation projects Hama and Giraph implement the Bulk Synchronous Parallel [Valiant1990] not MPI. MPI can be fast over low-latency/higher bandwidth/higher cost interconnect like InfiniBand, but (in C++) the code is pretty unappealing and it has limited resilience failures except checkpoint + resume, which is exactly what BSP can do too -and with its sync points, has good barriers to safely checkpoint. What the new Resource Manager can do is allow implementations of BPS and MPI to share cluster time -and things like a KVS &other high-resource-usage Hadoop-ecosystem apps. It should also offer more efficient use of an existing cluster than just “slots”, though you need some approximate estimates of the resource needs of future job stages for that -which is a tachyon-hard problem: you need to know what the future holds before you can schedule the work.

    Regarding Global resource management, one limitation of the Java-based design is that it (currently) strives to be x-platform: Linux, Windows, Solaris, and Arun’s MacOS laptop. Life would be simpler if we not only said “Unix only”, but “Linux only”, giving us a standard set of OS operations to use. The problem there is many developers do use Windows and OS/X for their desktop dev, and abandoning that desktop support would alienate them.

  6. Oliver Ratzesberger on September 18th, 2011 10:48 pm

    A few comments to pre- 0.23 stats mention here. There are theoretical limits and than there is reality. Todays dual 6 core nodes with e.g. 12 x 2TB drives pretty much have Hadoop run out of scale at or past the 1000 node mark.

    Why is that so? What nobody is talking about are real utilization numbers for large Hadoop deployments, or god forbid Parallel Efficiency studies (we are about to publish a white-paper on that).

    Reality is that 1000 node clusters suffer from utilization problems that keep CPUs at or below 10-15% busy on average. What nobody tells you in public is that 4000 node clusters run as low as 4% CPU busy on average with real life workload.

    If you deploy 100 or 200 nodes you don’t have to deal with much of the scalability issues, going beyond 1000 nodes requires some serious engineering, unless you are ok with using less than 1/20th of the CPUs you bought.

    The other issue seen in real world large scale deployments at 1000 nodes and e.g. 25PB are restart times. If you happen to loose the name node and have to restart, be prepared to not see the system back for 45+ minutes. Not only is the name node in pre-23 a single point of failure, it also limits scalability and at times (e.g. reboots) this will become very painfully obvious.

    There is a ton of engineering that multiple companies are putting into these areas, but it will a while to solve some fundamental scalability issue and to get .23 a viable and stable next version.

    As for job tracker, real world efficiency starts to take a non linear dive when you get past mid 30,000 tasks. As with some of the theoretical limits, don’t expect linear scalability when you get past that.

    Having said that – these are still very large clusters. A 1000 node cluster of todays configurations with 10G interconnect will deliver a lot of workload. And 25PB of spinning disk is by no means a small system. With Cloudera, Hortonworks and others in the mix, we should see some interesting progress in the next 12-18 months.

    We have more than 20 hard core engineers and committers on our own team, it also helps to have some other large scale MPP platforms at our disposal for comparison.

  7. Cloudera Hadoop usage notes | DBMS 2 : DataBase Management System Services on August 25th, 2013 11:40 am

    […] MPI used to be a higher Hadoop priority (August, 2011). That’s why I’ve kept bringing it up. Categories: BDAS, Spark, and Shark, Cloudera, Complex event processing (CEP), Endeca, Hadoop, HP and Neoview, MapReduce, Predictive modeling and advanced analytics, RDF and graphs, Revolution Analytics, SAS Institute, Teradata  Subscribe to our complete feed! […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.