April 14, 2009

Stonebraker, DeWitt, et al. compare MapReduce to DBMS

Along with five other coauthors — the lead author seems to be Andy Pavlo — famous MapReduce non-fans Mike Stonebraker and David DeWitt have posted a SIGMOD 2009 paper called “A Comparison of Approaches to Large-Scale Data Analysis.” The heart of the paper is benchmarks of Hadoop, Vertica, and “DBMS-X” on identical clusters of 100 low-end nodes., across a series of tests including (if I understood correctly):

Both DBMS outshone Hadoop, and Vertica outperformed DBMS-X. This was true both on the Grep task, and also on all the other DBMS-like tasks the authors specified. Reasons for the DBMS outdoing Hadoop included compression and optimization. Reasons for Vertica outdoing DBMS-X included the usual benefits of column stores.

More precisely, both DBMS clobbered Hadoop on throughput. Hadoop, however, had some advantages in load speed and the like.

The paper also argues strenuously that for complex and/or team-oriented database programming, one is much better off using a DBMS rather than reinventing the software wheel. However, it concedes that for simple programming tasks, Hadoop may be easier and lighter-weight. For example, some of the benchmark tasks required user-defined functions (UDFs) or the equivalent, and those weren’t as easy to write in the DBMS as one might think.

Frankly, the paper is less extremely anti-MapReduce than I expected based on the authorship, or on how Mike Stonebraker framed it to me when he told me about it Monday afternoon. That said, it is absolutely in line with the DeWitt/Stonebraker meme “MapReduce isn’t nearly as good for DBMS-style processing as a DBMS is.”

Comments

6 Responses to “Stonebraker, DeWitt, et al. compare MapReduce to DBMS”

  1. There always seems to be a fire drill around MapReduce news | DBMS2 -- DataBase Management System Services on April 14th, 2009 5:10 am

    […] the benchmark particulars, and eventually posted a link to the paper to. And I rushed out several related blog […]

  2. Steven on April 14th, 2009 9:42 am

    I’m neither anti or pro MapReduce, probably because I have only read about it. Is it purely that it indirectly casts aside a DBMS? MapReduce strikes me as something to be used to categorize (key) large blobs of data where the only three things you know at the time of categorization are that you have a blob of data, you’ll get some random key, and you’ll get more blobs of data at a later time. Are the anti-MapReduce (pro-DBMS?) people saying that you should go analyze all the blobs and key them every possible way into a schema? Or are they saying that there is a solution in between the two solutions?

  3. Steve Wooledge on April 14th, 2009 10:03 am

    While we agree with many of the points in the study, it misses the big picture…why wouldn’t you use both SQL AND MapReduce? Asking if you should use SQL OR MapReduce is like asking if you should tie your left or right hand behind your back. SQL is very good at some things, and MapReduce is very good at others. Why not leverage the best of both worlds – use SQL for traditional database operations and MapReduce for richer analysis that cannot be expressed by SQL, in a single system.

    While the study notes that MapReduce also requires developers to write features or perform tasks manually that can be done automatically by most SQL databases, we have eliminated that hassle by providing both SQL and MapReduce capabilities. So essentially, our customers can maximize developer productivity, using SQL for regular data processing and MapReduce for richer analysis.

    At the end of the day, MapReduce is a technology that some vendors are and should be quite afraid of (isn’t that usually why they sponsor studies? ;-), since it provides some amazing capabilities. As a developer or DBA, why on earth wouldn’t you leverage the power of both?

    Thanks,
    Steve

    P.S. We recently blogged about our Enterprise-class MapReduce capabilities and noted the key advantages that a system like ours provides over a pureplay MapReduce implementation – http://www.asterdata.com/blog/index.php/2009/04/02/enterprise-class-mapreduce/

    Here are even more examples of why you would want to use both SQL and MapReduce: http://www.asterdata.com/blog/index.php/2009/03/13/sqlmapreduce-faster-answers-to-your-toughest-queries/

  4. Ashwin Jayaprakash on April 14th, 2009 3:56 pm

    Something very odd about the Hadoop/Java tests. The JVM arguments say “-client”. Now, anybody who has worked on Java can tell you that you are supposed to use the “-server” option. The server and client JVM optimizations are worlds apart.

  5. Vertica Gathers Momentum with New Release « Market Strategies for IT Suppliers on May 6th, 2009 9:30 pm

    […] up for his usual provocative comments – see the discussion of Mapreduce on Curt Monash’s blog here for a great example, if you like to dig deep. (Curt also has several useful posts about Vertica well […]

  6. Daniel Abadi on Kickfire and related subjects | DBMS2 -- DataBase Management System Services on June 7th, 2009 6:26 pm

    […] In general, seeing Abadi be so favorable toward Vertica competitors adds credibiity to the recent Hadoop vs. DBMS paper. […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.