October 15, 2008
eBay doesn’t love MapReduce
The first time I ever heard from Oliver Ratzesberger of eBay, the subject line of his email mentioned MapReduce. That was early this year. Subsequently, however, eBay seems to have become a MapReduce non-fan. The reason is simple: eBay’s parallel efficiency tests show that MapReduce leaves most processors idle most of the time. The specific figure they mentioned was parallel efficiency of 18%.
Comments
5 Responses to “eBay doesn’t love MapReduce”
Leave a Reply

I can imagine that this is a difficult thing to measure with Map/Reduce. I know Google’s implementation does the same part of the query on several nodes to protect from any one node having a performance issue slowing down the return of the result. So while this part is parallel it is parallelism for redundancy including it in parallel efficiency determinations could be debatable.
Hmm. I would imagine eBay wasn’t including 2-4X the redundancy they think they really need to get the work done.
CAM
Tony: At least in the public Map/Reduce paper, redundant tasks are only started toward the very end of the entire Map/Reduce job, so they shouldn’t represent a very significant percentage of the total work required by the job (so the paper argues, anyway).
I think this is going to be heavily dependent on exactly what you use MapReduce for. Some things are in its sweet spot more than others, depending on how much work the map part is and how much work the reduce part is. I think it’s a bit premature for anyone to draw general conclusions about MapReduce from this one anecdote. (Note: I have never used MapReduce and am not in any way an expert; but I have read about it and believe I get the idea.)
Hi Curt,
This is an interesting topic and I would like to share my thoughts here. MapReduce is a parallelization paradigm and the hardware utilization is heavily implementation-dependent. When MapReduce is used just to process text files, as is the case with some popular implementations, the text files are randomly split across a large number of nodes and there is no notion of a “schema” as it exists in a relational database. Lack of schema requires brute force reading of all the data and lot of shuffling over the network during MapReduce execution. This, in turn, puts heavy load on disk I/O and the network, leaving processors in a waiting mode for majority of the time. In contrast to that, Aster’s implementation of MapReduce inside a database results in much higher efficient utilization levels. Aster has provided tight integration between MapReduce and SQL and SQL/MR functions (Aster’s In-Database MapReduce) are seamlessly invoked as a part of SQL. The query planner plans for SQL/MR functions just like it does for other SQL operations. This means that Aster’s MapReduce can take advantage of the database functionality – schema-awareness, globally optimum query planning, partition pruning and also indexes. This results a much smaller amount of data being read from the disks, reducing disk I/O. When data has to be shuffled, Aster nCluster’s Optimized Transport capabilities kick in, making network transport more efficient. Hence, Aster nCluster is able to take data to the processors more efficiently and makes them work at a much higher levels of utilization by reducing I/O and making network more efficient. More details of our In-Database MapReduce can be found at http://www.asterdata.com/product/mapreduce.php
Thanks,
Ajeet