December 30, 2009

Clearing up MapReduce confusion, yet again

I’m frustrated by a constant need — or at least urge 🙂 — to correct myths and errors about MapReduce. Let’s try one more time:

Comments

8 Responses to “Clearing up MapReduce confusion, yet again”

  1. Manuel Simoni on December 30th, 2009 10:05 am

    Re “MapReduce was named and popularized — but not invented — by Google.”

    Can you point to any projects that used MapReduce before it was popularized by Google?

    I can imagine that there were projects that used something similar to (subsets of) MapReduce before the MR paper was published, but I am not aware of any that are as general and well-specified as Google MR.

  2. Curt Monash on December 30th, 2009 11:47 am

    I don’t know that anybody abstracted it exactly the way Google did before Google. But it also wasn’t a conceptual breakthrough on par with, say, Codd’s idea for a relational DBMS. The predecessor ideas were floating around pretty thickly.

  3. UnHolyGuy on December 30th, 2009 5:38 pm

    I think this is the cleanest definition I’ve seen yet

    http://scienceblogs.com/goodmath/2008/01/databases_are_hammers_mapreduc.php

    “MapReduce is a library that lets you adopt a particular, stylized way of programming that’s easy to split among a bunch of machines”

    There were tons of scatter/gather distributed/grid computing work prior to map reduce. The conceptual breakthrough I think was that by enforcing that everything had to be a key/value it makes the code easy to write, and distributable without an optimizer.

    You build the enforcement of the distributablilty into the syntax so to speak. That was pretty smart.

    The name itself comes from the Map and Reduce primitives that you find in a lot of languages I think?

  4. Omer Trajman on December 31st, 2009 12:31 am

    Curt – it’s great to see these posts and I’d expect you’ll find the need to write more as MR gains adoption.

    @UnHolyGuy cites a good source for defining MR. Note that MapReduce is distinctly different from distributed file systems. As Jeff and Sanjay write in a recent CACM article: “MapReduce is storage-system independent”[1]. Tom White has an entire chapter in his Hadoop book where he discusses how “Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation.”[2]

    As an example, you can run Hadoop/MapReduce on top of Amazon S3 (http://wiki.apache.org/hadoop/AmazonS3) or using Vertica without HDFS (http://www.vertica.com/Hadoop)

    [1] http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
    [2] http://books.google.com/books?id=bKPEwR-Pt6EC&lpg=PP1&ots=kOcy-DcbHh&pg=PA49#v=onepage&q=&f=false

    @Manuel – a UnHolyGuy also points out – prior to Google’s paper, distributed Map Reduce type operations might have been called Vectored or Scatter/Gather and were generally run on very large shared subsystem (SMP) machines such as Cray X-MP. Google’s breakthrough was running on shared nothing (MPP) clusters and simplifying the model to use key/value records.

  5. UnHolyGuy on December 31st, 2009 6:36 pm

    @Omer, actually I’m pretty sure the US DoE (Sandia, LLNL and Los Alamos) were running distributed computing jobs on shared nothing linux MPP clusters before google. Mostly simulating nuclear explosions and solving big physics problems.

  6. Neil Raden on January 2nd, 2010 12:35 am

    UnHolyGuy,

    I can confirm what you said about the national labs. We also used these techniques for the nuclear waste repository program, doing massive stochastic processes on 3-D geology, etc.

  7. かなり気になる ScalOut 関連記事 6本 « Agile Cat — Azure & Hadoop — Talking Book on January 4th, 2010 7:43 pm

    […] up MapReduce confusion, yet again http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/ •MapReduce was named and popularized — but not invented — by Google. •“MapReduce” […]

  8. Invention – Overloaded… « Dudefrommangalore's Weblog on November 6th, 2011 11:36 pm

    […] Good article on DBMS2 clear this confusion. MapReduce was named and popularized — but not invented — by Google. […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.