July 23, 2012

Hadoop YARN — beyond MapReduce

A lot of confusion seems to have built around the facts:

Hadoop MapReduce is being opened up into something called MapReduce 2 (MRv2).
Something called YARN (Yet Another Resource Negotiator) is involved.
One purpose of the whole thing is to make MapReduce not be required for Hadoop.
MPI (Message Passing Interface) was mentioned as a paradigmatic example of a MapReduce alternative, yet the MPI/YARN/Hadoop effort is somehow troubled.
Cloudera shipped YARN in June, yet simultaneously warned people away from actually using it.

Here’s my best effort to make sense of all that, helped by a number of conversations with various Hadoop companies, but most importantly a chat Friday with Arun Murthy and other Hortonworks folks.

YARN, as an aspect of Hadoop, has two major kinds of benefits:
- The ability to use programming frameworks other than MapReduce.
- Scalability, no matter what programming framework you use.
The YARN availability story goes:
- YARN is in alpha.
- YARN is expected to be in production at year-end, give or take.
- Cloudera made the marketing decision to include YARN in its June Hadoop distribution release anyway, but advised that it was for experimentation rather than production.
- Hortonworks, in its own June release, only shipped code it advised putting into production.
My take on the YARN/MPI story goes something like this:
- Numerous people have told me of YARN/MPI delays.
- One person suggested that Greenplum is taking the lead in YARN/MPI integration, but has gotten slow and reclusive, apparently due to some big company-itis.
- I find that credible because of the Greenplum/SAS/MPI connection.
If I understood Arun correctly, the latency story on Hadoop MapReduce is approximately:
- Arun says that Hadoop’s reputation for taking 10s of seconds to start a Hadoop job is old news. It takes a low single-digit number of seconds.
- However, starting all that Java does take 100s of milliseconds at best — 200 milliseconds in an ideal case, 500 milliseconds more realistically, and that’s just on a single server.
- Thus, if you want human real-time interaction, Hadoop MapReduce is not and likely never will be the way to go. Getting Hadoop MapReduce latencies under a few seconds is likely to be more trouble than it’s worth — because of MapReduce, not because of Hadoop.
In particular — instead of incurring the overhead of starting processes up, Arun thinks low-latency needs should be met in a different way, namely by serving them from already-running processes. The examples he kept mentioning were the event processing projects Storm (out of Twitter, via an acquisition) and S4 (out of Yahoo).

The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop, specifically in (mainly) JobTracker:

Monitoring the status of the cluster with respect to which nodes have which resources available. Under YARN, this will be global.
Managing the parallelization execution of any specific job. Under YARN, this will be done separately for each job.

The current Hadoop MapReduce system is fairly scalable — Yahoo runs 5000 Hadoop jobs, truly concurrently, on a single cluster, for a total 1.5 – 2 millions jobs/cluster/month. Still, YARN will remove scalability bottlenecks.

At my current level of understanding, I don’t think it would be productive for me to try to explain things in a lot more detail than that. 🙂

After we talked, Arun sent over a list of links that I’ll just quote verbatim:

Real-time processing:
# Twitter Storm – https://github.com/nathanmarz/storm/wiki
# Apache S4 – http://incubator.apache.org/s4/
– YARN port: https://issues.apache.org/jira/browse/S4-25

Alternate programming paradigms to MapReduce:
# UCB Spark: http://www.spark-project.org/
– YARN port: https://github.com/mesos/spark-yarn/
# OpenMPI – http://www.open-mpi.org/
# HAMA
– YARN port: https://issues.apache.org/jira/browse/HAMA-431
# Giraph (graph processing based on Google Pregel) – http://giraph.apache.org/
– YARN port: https://issues.apache.org/jira/browse/GIRAPH-13

I’ll add that a September, 2011 post on Twitter Storm by David Bienvenido III was extremely helpful, as is a GitHub page on Storm concepts.

A couple more notes on all this:

I finally understand how speculative execution works, in the context of Hadoop. Namely, if the resource scheduler perceives a risk that a subtask will finish late, bottlenecking the overall job, the system will clone the process and run a second copy. Whichever finishes first wins.
Apache Zookeeper is pretty central to Hadoop high availability, and is expected to stay that way even when YARN comes around.

Finally, if you’re coming from an RDBMS background, it’s natural to think of YARN as a workload management system. In that context, I’d observe:

YARN has heretofore only managed RAM. However, …
… Arun said he planned to check in some form of CPU management within the next week.
I think the YARN folks need to talk with some workload management experts at the RDBMS companies to better understand the workload management state of the art.

Categories: Cloudera, Hadoop, Hortonworks, MapReduce, Workload management, Yahoo

Subscribe to our complete feed!

Comments

7 Responses to “Hadoop YARN — beyond MapReduce”

Charles Zedlewski on July 24th, 2012 4:43 am

Hi Curt,

One correction:

We (Cloudera) didn’t ship YARN for marketing purposes. We shipped it because it was part of Hadoop 2.0 and our policy is not to strip out whole sections of code from Apache releases but rather to use a “known issues and workarounds” process for features or components that we don’t think are ready for production. There’s a more extensive list of those here: https://ccp.cloudera.com/display/CDH4DOC/Known+Issues+and+Work+Arounds+in+CDH4. You can see this isn’t limited to YARN.

When you build a platform out of 14+ open source projects it is inevitable that there will be features that are not production ready that you need to document since there are non-Cloudera employees who also commit features to these projects. Declaring limitations is also a standard practice in proprietary software.

An added benefit of shipping features as “tech preview” is you get developers to try them out and find early issues which is good QA (a common practice with Linux distributions). We’ve already seen this feedback on CDH-users.
M-A-O-L » PaaS on Hadoop Yarn on July 25th, 2012 2:48 am

[…] And Monash also recently looked at Yarn. […]
Introduction to Continuuity | DBMS 2 : DataBase Management System Services on November 1st, 2012 7:14 am

[…] Aggressive use of Hadoop, including newer capabilities such as YARN and MapReduce 2. […]
Hadoop distributions | DBMS 2 : DataBase Management System Services on February 27th, 2013 6:41 am

[…] is still focused on Hadoop 1 (without YARN and so on), because that’s what’s regarded as production-ready. But Hortonworks does […]
Hadoop execution enhancements | DBMS 2 : DataBase Management System Services on March 11th, 2013 6:21 am

[…] Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that: […]
Syncsort extends Hadoop MapReduce | DBMS 2 : DataBase Management System Services on May 29th, 2013 7:25 am

[…] primarily designed for Hadoop 2, and was adopted into Hadoop 2.03 in […]
Hortonworks, Hadoop, Stinger and Hive | DBMS 2 : DataBase Management System Services on August 6th, 2013 6:51 pm

[…] of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Hadoop YARN — beyond MapReduce

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin