Google’s highly parallel file manipulator MapReduce has gotten great attention recently, after a research paper revealed:
- MapReduce is running the core Google search engine, plus much of Google Analytics and other applications.
- MapReduce is processing 400+ petabytes of data per month.
(Niall Kennedy popularized the paper and surveyed its results.)
David DeWitt and Mike Stonebraker then launched a blistering attack on MapReduce, accusing it of disregarding almost all the lessons of database management system theory and practice. A vigorous comment thread has ensued, pointing out that MapReduce is not a DBMS and asserting it therefore shouldn’t be judged as one.
While correct, that defense begs the question – what is MapReduce good for? Proponents of MapReduce highlight two advantages:
- MapReduce makes it very easy to program data transformations, including ones to which relational structures are of little relevance.
- MapReduce runs in massively parallel mode “for free,” without extra programming.
Based on those advantages, MapReduce would indeed seem to have significant uses, including:
- Specialized indexing of large quantities of data. Obviously, MapReduce was built for text indexing of the Web. But it would likely also be useful for, say, preprocessing satellite telemetry or intelligence intercepts, or for doing early steps in large-scale network traffic analysis. MapReduce may not be good for data management, but it looks good for banging stuff into specialized data management systems.
- Computer-scientific research. If you’re trying to figure out better ways to, say, digest and analyze huge amounts of astronomical data, MapReduce seems like a great platform. Today’s researchers – even the students – aren’t nearly as adept at parallel algorithms as one would hope. Perhaps we should take those complications away to let them focus on the unique parts of their work. Breakthrough programming is hard enough anyway, especially if you’re trying to do all the work yourself.
I agree that MapReduce will have limited applicability to problems that relational database management systems handle well. But there are plenty of things that relational database management systems don’t handle well, and MapReduce could be very useful for some of them.
Some August, 2008 links about MapReduce
- Three major applications of MapReduce
- Another application of MapReduce
- Sound bites about MapReduce
- Other links about MapReduce