July 23, 2012

Hadoop YARN — beyond MapReduce

A lot of confusion seems to have built around the facts:

Here’s my best effort to make sense of all that, helped by a number of conversations with various Hadoop companies, but most importantly a chat Friday with Arun Murthy and other Hortonworks folks.

The central goal of YARN is to clearly separate two things that are unfortunately smushed together in current Hadoop, specifically in (mainly) JobTracker:

The current Hadoop MapReduce system is fairly scalable — Yahoo runs 5000 Hadoop jobs, truly concurrently, on a single cluster, for a total 1.5 – 2 millions jobs/cluster/month. Still, YARN will remove scalability bottlenecks.

At my current level of understanding, I don’t think it would be productive for me to try to explain things in a lot more detail than that. :)

After we talked, Arun sent over a list of links that I’ll just quote verbatim:

Real-time processing:
# Twitter Storm - https://github.com/nathanmarz/storm/wiki
# Apache S4 - http://incubator.apache.org/s4/
- YARN port: https://issues.apache.org/jira/browse/S4-25

Alternate programming paradigms to MapReduce:
# UCB Spark: http://www.spark-project.org/
- YARN port: https://github.com/mesos/spark-yarn/
# OpenMPI - http://www.open-mpi.org/
# HAMA
- YARN port: https://issues.apache.org/jira/browse/HAMA-431
# Giraph (graph processing based on Google Pregel) - http://giraph.apache.org/
- YARN port: https://issues.apache.org/jira/browse/GIRAPH-13

I’ll add that a September, 2011 post on Twitter Storm by David Bienvenido III was extremely helpful, as is a GitHub page on Storm concepts.

A couple more notes on all this:

Finally, if you’re coming from an RDBMS background, it’s natural to think of YARN as a workload management system. In that context, I’d observe:

Comments

7 Responses to “Hadoop YARN — beyond MapReduce”

  1. Charles Zedlewski on July 24th, 2012 4:43 am

    Hi Curt,

    One correction:

    We (Cloudera) didn’t ship YARN for marketing purposes. We shipped it because it was part of Hadoop 2.0 and our policy is not to strip out whole sections of code from Apache releases but rather to use a “known issues and workarounds” process for features or components that we don’t think are ready for production. There’s a more extensive list of those here: https://ccp.cloudera.com/display/CDH4DOC/Known+Issues+and+Work+Arounds+in+CDH4. You can see this isn’t limited to YARN.

    When you build a platform out of 14+ open source projects it is inevitable that there will be features that are not production ready that you need to document since there are non-Cloudera employees who also commit features to these projects. Declaring limitations is also a standard practice in proprietary software.

    An added benefit of shipping features as “tech preview” is you get developers to try them out and find early issues which is good QA (a common practice with Linux distributions). We’ve already seen this feedback on CDH-users.

  2. M-A-O-L » PaaS on Hadoop Yarn on July 25th, 2012 2:48 am

    [...] And Monash also recently looked at Yarn. [...]

  3. Introduction to Continuuity | DBMS 2 : DataBase Management System Services on November 1st, 2012 7:14 am

    [...] Aggressive use of Hadoop, including newer capabilities such as YARN and MapReduce 2. [...]

  4. Hadoop distributions | DBMS 2 : DataBase Management System Services on February 27th, 2013 6:41 am

    [...] is still focused on Hadoop 1 (without YARN and so on), because that’s what’s regarded as production-ready. But Hortonworks does [...]

  5. Hadoop execution enhancements | DBMS 2 : DataBase Management System Services on March 11th, 2013 6:21 am

    [...] Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that: [...]

  6. Syncsort extends Hadoop MapReduce | DBMS 2 : DataBase Management System Services on May 29th, 2013 7:25 am

    [...] primarily designed for Hadoop 2, and was adopted into Hadoop 2.03 in [...]

  7. Hortonworks, Hadoop, Stinger and Hive | DBMS 2 : DataBase Management System Services on August 6th, 2013 6:51 pm

    [...] of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.