Monday was fire-drill day regarding MapReduce vs. MPP relational DBMS. The upshot was that I was quoted in Computerworld and paraphrased in GigaOm as being a little more negative on MapReduce than I really am, in line with my comment
Frankly, my views on MapReduce are more balanced than [my] weary negativity would seem to imply.
Tuesday afternoon the dial turned a couple notches more positive yet, when I talked with Michael Olson and Jeff Hammerbacher of Cloudera. Cloudera is a new company, built around the open source MapReduce implementation Hadoop. So far Cloudera gives away its Hadoop distribution, without charging for any sort of maintenance or subscription, and just gets revenue from professional services. Presumably, Cloudera plans for this business model to change down the road.
Much of our discussion revolved around Facebook, where Jeff directed a huge and diverse Hadoop effort. Apparently, Hadoop played much of the role of an enterprise data warehouse at Facebook — at least for clickstream/network data — including:
- 2 1/2 petabytes of data managed via Hadoop
- 10 terabytes/day of data ingested via Hadoop (Edit: Some of these metrics have been updated in a subsequent post about Facebook.)
- Ad targeting queries run every 15 minutes in Hadoop
- Dashboard roll-up queries run every hour in Hadoop
- Ad-hoc research/analytic Hadoop queries run whenever
- Anti-fraud analysis done in Hadoop
- Text mining (e.g., of things written on people’s “walls”) done in Hadoop
- 100s or 1000s of simultaneous Hadoop queries
- JSON-based social network analysis in Hadoop
Some Facebook data, however, was put into an Oracle RAC cluster for business intelligence. And Jeff does concede that query execution is slower in Hadoop than in a relational DBMS. Hadoop was also used to build the index for Facebook’s custom text search engine.
Jeff’s reasons for liking Hadoop over relational DBMS at Facebook included:
- Price. Hadoop is free. MPP relational DBMS generally aren’t.
- Re-purposed data transformation logic. Facebook has lots of code sitting around in, e.g., Python to massage various specific kinds of data on its site. This code is re-used in Hadoop for ETL/ELT/ELTL/whatever.
- Resource management. Amazingly, Jeff found it easier to build a custom Hadoop resource manager to deal with the 100s or 1000s of concurrent queries than to rely on the native capabilities of a DBMS.
- Schema flexibility. This is a subject I’ve been preaching about for years. When people interact with web sites, the best schema to store data from their interactions changes just as quickly as the nature of their possible interactions does. Of course, when you add new features to a website, you can capture anything you like on a glorified entity-attribute-value basis. (Actually, I guess it would be more like EventDescriptor-SessionIdentifierClue-Timestamp.) But evolving a relational schema rapidly enough to keep up is hard. Facebook found it easier to evolve its Hadoop-based data massagers instead. (I’ve usually suggested running with XML or an XML-like approach, but notwithstanding the case of Marklogic/OpenConnect , that’s not usually the way network analytics implementers choose to go.)
More generally, Jeff argues there are tasks better programmed in Hadoop than SQL. He generally leans that way when data is complex, or when the programmers are high-performance computing types who aren’t experienced DBMS users anyway. One specific example is graph construction and traversal; there seems to be considerable adoption of Hadoop for graph analysis in the national intelligence sector.