December 30, 2009
I’m frustrated by a constant need — or at least urge — to correct myths and errors about MapReduce. Let’s try one more time:
- MapReduce was named and popularized — but not invented — by Google.
- “MapReduce” variously refers to:
- A programming paradigm
- Execution engines that implement the programming paradigm
- Distributed file systems that work with the execution engines
- In particular, Hadoop is a MapReduce execution engine that includes or is closely associated with HDFS (Hadoop Distributed File System).
- MapReduce and analytic DBMS can interact in a number of different ways, including:
- Tight integration between a DBMS and exposed MapReduce functionality, e.g. Aster Data’s SQL/MapReduce or Greenplum.
- Integrated MapReduce “under the covers”, e.g. SenSage or Oracle. This may or may not follow all the rules Google laid out for MapReduce, but it’s at least similar in spirit.
- Looser coupling between DBMS and a MapReduce system, e.g. Vertica/Hadoop, in which MapReduce may or may not run on a different cluster than the DBMS.
- Not at all, except perhaps insofar as a quasi-DBMS such as Hive is implemented over a MapReduce system such as Hadoop/HDFS.
- As predicted by Monash’s First Law of Commercial Semantics, different vendors have individual variants on those themes. For example, as per a registration-required white paper, Splunk is moving to publicly expose a not-quite-complete form of MapReduce.
- MapReduce implementations such as Hadoop are sometimes regarded as part of the NoSQL “movement”. When they are, many generalities about NoSQL — such as that it doesn’t deal with analytics — are falsified.
- So far as I can tell, mainstream enterprise (as opposed to web, scientific, investment, etc.) data mining folks may be looking at MapReduce for data mining, but they haven’t done much to adopt it yet. Probably that’s because the outfits who have the greatest need are the same ones that have the largest sunk investments in more traditional ways of doing data mining.
- Cloudera != Hadoop. On the other hand, if you want to use Hadoop, it makes a lot of sense to do business with Cloudera.
- Non-DBMS MapReduce != Hadoop. On the other hand, Hadoop is the default choice for non-DBMS MapReduce.
- MapReduce != Hadoop, period. DBMS-based MapReduce is also a legitimate technical strategy.
Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Google, Hadoop, MapReduce, SenSage, Splunk
Subscribe to our complete feed!