In response to recent posting I’ve done about MapReduce, Mike Stonebraker just got on the phone to give me his views. His core claim, more or less, is that anything you can do in MapReduce you could already do in a parallel database that complies with SQL-92 and/or has PostgreSQL underpinnnings. In particular, Mike says:
- Map functions can’t do anything you can’t also do in PostgreSQL user-defined functions (assuming, of course, PostgreSQL UDFs can be written in the language you want to use).
- Reduce functions can’t do anything you can’t also do in PostgreSQL user-defined aggregates (with the same caveat).
- Map and Reduce functions always write their result sets to disk. This can create a large performance loss.
- Map and Reduce functions require new instances to be fired up to run them. This can also create a large performance loss. (Without checking, I’m guessing that one is very implementation-specific. I.e., even if it’s true of Hadoop, it may not be true of Greenplum’s or Aster Data’s MapReduce implementations.)
- Mike and his associates are working on benchmarks that he believes will show that MapReduce performance is 10X worse than parallel row-based SQL DBMS, and 100X worse than columnar SQL DBMS.
- MapReduce doesn’t play nicely with the SQL Analytics part of the SQL standard.
- The one advantage Mike concedes to MapReduce — more graceful degradation when nodes fail — isn’t that important in the hardware configurations on which parallel analytic DBMS actually run today. I.e., a Greenplum or Vertica installation is going to have nodes fail much more rarely than a Google data center will.
Bottom line: Mike Stonebraker more than disagrees with the claim that MapReduce is a valuable addition to SQL data warehousing, on somewhat different grounds than he emphasized in the Great MapReduce Debate last January.