February 7, 2008

Why the Great MapReduce Debate broke out

While chatting with Mike Stonebraker today, I finally understood why he and Dave DeWitt launched the Great MapReduce Debate:

It was all about academia.

DeWitt noticed cases where study of MapReduce replaced study of real database management in the computer science curriculum. And he thought some MapReduce-related research papers were at best misleading. So DeWitt and Stonebraker decided to set the record straight.

Fireworks ensued.

Comments

5 Responses to “Why the Great MapReduce Debate broke out”

  1. Daniel Weinreb on February 8th, 2008 10:45 am

    I think another reason is that the people publishing papers about MapReduce have not been citing earlier research, particularly DeWitt’s earlier research. In the academic world, including such citations is very important, and leaving them out, especially when you have been told about them and thus are clearly aware of them (DeWitt has done that), would be considered unethical and improper, by the rules of academia. This makes a lot of sense, considering how academic careers are advanced, and so on.

    Possibly the writers of the MapReduce papers don’t see it that way, but that’s just guessing.

  2. Oliver Ratzesberger on February 11th, 2008 2:41 am

    Its about time. 😉 A few months back we had an interesting discussion about XLDB at Stanford and different technologies and approaches used to manage extra large data volumes. MapReduce is very flexible algorithm for certain types of processing, it has been around for many years (or should I say decades) and is pretty much embedded in couple database products. The interesting part many forget to mention: Its rather expensive and in many cases there are alternatives that can be executed much quicker and cheaper. e.g. Hash Joins
    MapReduce is pretty much the 4-5th choice for a modern cost based optimizer. MapReduce has gotten ‘famous’ again because a few companies have custom built solutions based on it. There are public white papers on the internet about solutions that throw together 1800+ pizza boxes to perform MapReduce functions on large data volumes. While the processing rates at peak time are pretty impressive, the overhead of redistribution and non-scaleable steps is rather big leaving a lot to be desired about the parallel efficiency of such solutions.

  3. Mark Callaghan on February 12th, 2008 12:24 pm

    If efficiency is the concern, shouldn’t that be measured in terms of work done per dollar, not work done per box? How much does a 1,000 node proprietary RDBMS cluster cost and what is the uptime for it?

  4. Curt Monash on February 13th, 2008 1:19 am

    Major factors in price/performance include:

    Performance
    Cost of boxes
    Cost of software
    Maintenance of boxes
    Maintenance of software
    Administrative cost
    Energy cost (sometimes the biggest cost factor of all)

    That said, if you have a task for which RDBMS work well, you probably should use RDBMS. Writing your own RDBMS-substitute is probably not the way to go, although opinions about that split among the web biggies: Google loves its MapReduce; Yahoo and eBay seem to favor commercial products; and Amazon seems to like lots of different database products, including its own SimpleDB.

    Actually, I imagine the story is more mixed at all of them. E.g., I’m sure one could justify just about any label one wants for Yahoo’s use of MySQL.

    CAM

  5. Oliver Ratzesberger on February 14th, 2008 12:36 am

    With all the experience in massive parallel systems out there I don’t understand how people still overlook the most important factor: parallel efficiency. From the moment you throw workload at an MPP system, how many systems/cpus did fully participate in the work. A system with 1000++ cpus is very vulnerable for huge variations in throughput whenever you run below 100% PE. You can actually see that on the charts for the grep test. The MPP system reaches its processing peak only for a brief few seconds. That leads to massive throughput problems.
    To have 1800 server (given that they are older generation) spend 200 secs for a simple pattern match on 10^10 records or about 1TB of data should be really disappointing. Any of the common MPP systems can match that throughput with 4-12 servers or nodes. Therefore they would have to be about 250x more costly to match the MapReduce configuration in cost. Similar for the sort.
    Its all about the TCO. A good balance between CPU,IO and interconnect is required to make algorithms perform well.
    Lets not forget the usage of such systems. Today’s advanced SQL implementations put all of that processing power at the finger tips of a data analyst – many of them without programming background. And with User defined functions you can add any map/reduce logic you desire with plain C/C++ code and make it available to all users of the rdbms.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.