October 15, 2008

eBay doesn’t love MapReduce

The first time I ever heard from Oliver Ratzesberger of eBay, the subject line of his email mentioned MapReduce.  That was early this year.  Subsequently, however, eBay seems to have become a MapReduce non-fan.  The reason is simple: eBay’s parallel efficiency tests show that MapReduce leaves most processors idle most of the time.  The specific figure they mentioned was parallel efficiency of 18%.

Comments

7 Responses to “eBay doesn’t love MapReduce”

  1. Tony Bain on October 15th, 2008 4:52 pm

    I can imagine that this is a difficult thing to measure with Map/Reduce. I know Google’s implementation does the same part of the query on several nodes to protect from any one node having a performance issue slowing down the return of the result. So while this part is parallel it is parallelism for redundancy including it in parallel efficiency determinations could be debatable.

  2. Curt Monash on October 15th, 2008 8:01 pm

    Hmm. I would imagine eBay wasn’t including 2-4X the redundancy they think they really need to get the work done.

    CAM

  3. Neil Conway on October 15th, 2008 11:03 pm

    Tony: At least in the public Map/Reduce paper, redundant tasks are only started toward the very end of the entire Map/Reduce job, so they shouldn’t represent a very significant percentage of the total work required by the job (so the paper argues, anyway).

  4. Daniel Weinreb on October 16th, 2008 6:00 am

    I think this is going to be heavily dependent on exactly what you use MapReduce for. Some things are in its sweet spot more than others, depending on how much work the map part is and how much work the reduce part is. I think it’s a bit premature for anyone to draw general conclusions about MapReduce from this one anecdote. (Note: I have never used MapReduce and am not in any way an expert; but I have read about it and believe I get the idea.)

  5. Ajeet Singh on October 17th, 2008 12:42 pm

    Hi Curt,

    This is an interesting topic and I would like to share my thoughts here. MapReduce is a parallelization paradigm and the hardware utilization is heavily implementation-dependent. When MapReduce is used just to process text files, as is the case with some popular implementations, the text files are randomly split across a large number of nodes and there is no notion of a “schema” as it exists in a relational database. Lack of schema requires brute force reading of all the data and lot of shuffling over the network during MapReduce execution. This, in turn, puts heavy load on disk I/O and the network, leaving processors in a waiting mode for majority of the time. In contrast to that, Aster’s implementation of MapReduce inside a database results in much higher efficient utilization levels. Aster has provided tight integration between MapReduce and SQL and SQL/MR functions (Aster’s In-Database MapReduce) are seamlessly invoked as a part of SQL. The query planner plans for SQL/MR functions just like it does for other SQL operations. This means that Aster’s MapReduce can take advantage of the database functionality – schema-awareness, globally optimum query planning, partition pruning and also indexes. This results a much smaller amount of data being read from the disks, reducing disk I/O. When data has to be shuffled, Aster nCluster’s Optimized Transport capabilities kick in, making network transport more efficient. Hence, Aster nCluster is able to take data to the processors more efficiently and makes them work at a much higher levels of utilization by reducing I/O and making network more efficient. More details of our In-Database MapReduce can be found at http://www.asterdata.com/product/mapreduce.php

    Thanks,
    Ajeet

  6. Notes from a visit to Teradata | DBMS 2 : DataBase Management System Services on September 1st, 2014 6:22 pm

    […] Speaking of not being CPU-constrained — I heard 7-10% as an estimate for typical Hadoop utilization, and also 10-15%. While I didn’t ask, I presume these figures assume traditional MapReduce types of Hadoop workloads. I’m not sure why these figures are yet lower than eBay’s long-ago estimates of Hadoop “parallel efficiency”. […]

  7. The Future Of Hadoop Is Cloudy, With A Chance Of Growing Ecosystem - THE SENTIENT ENTERPRISE on July 15th, 2015 9:26 am

    […] the time. Back then I exchanged emails with industry watcher Curt Monash who wrote at the time, “eBay doesn’t love MapReduce.” At eBay, we thoroughly evaluated Hadoop and came to the clear conclusion that MapReduce is […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.