Along with five other coauthors — the lead author seems to be Andy Pavlo — famous MapReduce non-fans Mike Stonebraker and David DeWitt have posted a SIGMOD 2009 paper called “A Comparison of Approaches to Large-Scale Data Analysis.” The heart of the paper is benchmarks of Hadoop, Vertica, and “DBMS-X” on identical clusters of 100 low-end nodes., across a series of tests including (if I understood correctly):
- A couple of different flavors of a Grep task originally proposed in a Google MapReduce paper.
- A database query on simulated clickstream data
- A join on the same clickstream data.
- Two aggregations on the clickstream data.
Both DBMS outshone Hadoop, and Vertica outperformed DBMS-X. This was true both on the Grep task, and also on all the other DBMS-like tasks the authors specified. Reasons for the DBMS outdoing Hadoop included compression and optimization. Reasons for Vertica outdoing DBMS-X included the usual benefits of column stores.
More precisely, both DBMS clobbered Hadoop on throughput. Hadoop, however, had some advantages in load speed and the like.
The paper also argues strenuously that for complex and/or team-oriented database programming, one is much better off using a DBMS rather than reinventing the software wheel. However, it concedes that for simple programming tasks, Hadoop may be easier and lighter-weight. For example, some of the benchmark tasks required user-defined functions (UDFs) or the equivalent, and those weren’t as easy to write in the DBMS as one might think.
Frankly, the paper is less extremely anti-MapReduce than I expected based on the authorship, or on how Mike Stonebraker framed it to me when he told me about it Monday afternoon. That said, it is absolutely in line with the DeWitt/Stonebraker meme “MapReduce isn’t nearly as good for DBMS-style processing as a DBMS is.”