Cloudera presents the MapReduce bull case
Monday was fire-drill day regarding MapReduce vs. MPP relational DBMS. The upshot was that I was quoted in Computerworld and paraphrased in GigaOm as being a little more negative on MapReduce than I really am, in line with my comment
Frankly, my views on MapReduce are more balanced than [my] weary negativity would seem to imply.
Tuesday afternoon the dial turned a couple notches more positive yet, when I talked with Michael Olson and Jeff Hammerbacher of Cloudera. Cloudera is a new company, built around the open source MapReduce implementation Hadoop. So far Cloudera gives away its Hadoop distribution, without charging for any sort of maintenance or subscription, and just gets revenue from professional services. Presumably, Cloudera plans for this business model to change down the road.
Much of our discussion revolved around Facebook, where Jeff directed a huge and diverse Hadoop effort. Apparently, Hadoop played much of the role of an enterprise data warehouse at Facebook — at least for clickstream/network data — including:
- 2 1/2 petabytes of data managed via Hadoop
- 10 terabytes/day of data ingested via Hadoop (Edit: Some of these metrics have been updated in a subsequent post about Facebook.)
- Ad targeting queries run every 15 minutes in Hadoop
- Dashboard roll-up queries run every hour in Hadoop
- Ad-hoc research/analytic Hadoop queries run whenever
- Anti-fraud analysis done in Hadoop
- Text mining (e.g., of things written on people’s “walls”) done in Hadoop
- 100s or 1000s of simultaneous Hadoop queries
- JSON-based social network analysis in Hadoop
Some Facebook data, however, was put into an Oracle RAC cluster for business intelligence. And Jeff does concede that query execution is slower in Hadoop than in a relational DBMS. Hadoop was also used to build the index for Facebook’s custom text search engine.
Jeff’s reasons for liking Hadoop over relational DBMS at Facebook included: Read more
