MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:
- Some are in heavy production with Hadoop, and closely engaged with Cloudera. Others are active Hadoop users but are very secretive. Yet others signed up for initial Hadoop training last week.
- Some have Hadoop clusters in the thousands of nodes. Many have Hadoop clusters in the 50-100 node range. Others are just prototyping Hadoop use. And one seems to be “OEMing” a small Hadoop cluster in each piece of equipment sold.
- Many export data from Hadoop to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case Jaql.
- Some are household names, in web businesses or otherwise. Others seem to be pretty obscure.
- Industries include financial services, telecom (Asia only, and quite new), bioinformatics (and other research), intelligence, and lots of web and/or advertising/media.
- Application areas mentioned — and these overlap in some cases — include:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance
We went over this list so quickly that we didn’t go into much detail on any one user. But one example that stood out was of an ad serving firm that had an “aggregation pipeline” consisting of 70-80 MapReduce jobs.
I also talked yesterday again w/ Omer Trajman of Vertica, who surprised me by indicating a high single-digit number of Vertica’s customers were in production with Hadoop — i.e., over 10% of Vertica’s production customers. (Vertica recently made its 100th sale, and of course not all those buyers are in production yet.) Vertica/Hadoop usage seems to have started in Vertica’s financial services stronghold — specifically in financial trading — with web analytics and the like coming on afterwards. Based on current prototyping efforts, Omer expects bioinformatics to be the third production market for Vertica/Hadoop, with telecommunications coming in fourth.
Unsurprisingly, the general Vertica/Hadoop usage model seems to be:
- Do something to the data in Hadoop
- Dump it into Vertica to be queried
What I did find surprising is that the data often isn’t reduced by this analysis, but rather exploded in size. E.g., a complete store of mortgage trading data might be a few terabytes in size, but Hadoop-based post processing can increase that by 1 or 2 orders of magnitude. (Analogies to the importance and magnitude of “cooked” data in scientific data processing come to mind.)
And finally, I talked to Aster a few days ago about the usage of its nCluster/Hadoop connector. Aster characterized Aster/Hadoop users’ Hadoop usage as being of the batch/ETL variety, which is the classic use case one concedes to Hadoop even if one believes that MapReduce should commonly be done right in the DBMS.