A lot of confusion seems to have built around the facts:
- Hadoop MapReduce is being opened up into something called MapReduce 2 (MRv2).
- Something called YARN (Yet Another Resource Negotiator) is involved.
- One purpose of the whole thing is to make MapReduce not be required for Hadoop.
- MPI (Message Passing Interface) was mentioned as a paradigmatic example of a MapReduce alternative, yet the MPI/YARN/Hadoop effort is somehow troubled.
- Cloudera shipped YARN in June, yet simultaneously warned people away from actually using it.
Here’s my best effort to make sense of all that, helped by a number of conversations with various Hadoop companies, but most importantly a chat Friday with Arun Murthy and other Hortonworks folks.
- YARN, as an aspect of Hadoop, has two major kinds of benefits:
- The ability to use programming frameworks other than MapReduce.
- Scalability, no matter what programming framework you use.
- The YARN availability story goes:
- YARN is in alpha.
- YARN is expected to be in production at year-end, give or take.
- Cloudera made the marketing decision to include YARN in its June Hadoop distribution release anyway, but advised that it was for experimentation rather than production.
- Hortonworks, in its own June release, only shipped code it advised putting into production.
- My take on the YARN/MPI story goes something like this:
- Numerous people have told me of YARN/MPI delays.
- One person suggested that Greenplum is taking the lead in YARN/MPI integration, but has gotten slow and reclusive, apparently due to some big company-itis.
- I find that credible because of the Greenplum/SAS/MPI connection.
- If I understood Arun correctly, the latency story on Hadoop MapReduce is approximately:
- Arun says that Hadoop’s reputation for taking 10s of seconds to start a Hadoop job is old news. It takes a low single-digit number of seconds.
- However, starting all that Java does take 100s of milliseconds at best — 200 milliseconds in an ideal case, 500 milliseconds more realistically, and that’s just on a single server.
- Thus, if you want human real-time interaction, Hadoop MapReduce is not and likely never will be the way to go. Getting Hadoop MapReduce latencies under a few seconds is likely to be more trouble than it’s worth — because of MapReduce, not because of Hadoop.
- In particular — instead of incurring the overhead of starting processes up, Arun thinks low-latency needs should be met in a different way, namely by serving them from already-running processes. The examples he kept mentioning were the event processing projects Storm (out of Twitter, via an acquisition) and S4 (out of Yahoo).
The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop, specifically in (mainly) JobTracker:
- Monitoring the status of the cluster with respect to which nodes have which resources available. Under YARN, this will be global.
- Managing the parallelization execution of any specific job. Under YARN, this will be done separately for each job.
The current Hadoop MapReduce system is fairly scalable — Yahoo runs 5000 Hadoop jobs, truly concurrently, on a single cluster, for a total 1.5 – 2 millions jobs/cluster/month. Still, YARN will remove scalability bottlenecks.
At my current level of understanding, I don’t think it would be productive for me to try to explain things in a lot more detail than that.
After we talked, Arun sent over a list of links that I’ll just quote verbatim:
Alternate programming paradigms to MapReduce:
# UCB Spark: http://www.spark-project.org/
– YARN port: https://github.com/mesos/
# OpenMPI – http://www.open-mpi.org/
– YARN port: https://issues.apache.org/
# Giraph (graph processing based on Google Pregel) – http://giraph.apache.org/
– YARN port: https://issues.apache.org/
I’ll add that a September, 2011 post on Twitter Storm by David Bienvenido III was extremely helpful, as is a GitHub page on Storm concepts.
A couple more notes on all this:
- I finally understand how speculative execution works, in the context of Hadoop. Namely, if the resource scheduler perceives a risk that a subtask will finish late, bottlenecking the overall job, the system will clone the process and run a second copy. Whichever finishes first wins.
- Apache Zookeeper is pretty central to Hadoop high availability, and is expected to stay that way even when YARN comes around.
Finally, if you’re coming from an RDBMS background, it’s natural to think of YARN as a workload management system. In that context, I’d observe:
- YARN has heretofore only managed RAM. However, …
- … Arun said he planned to check in some form of CPU management within the next week.
- I think the YARN folks need to talk with some workload management experts at the RDBMS companies to better understand the workload management state of the art.