When we scheduled a call to talk about Sentry, Cloudera’s Charles Zedlewski and I found time to discuss other stuff as well. One interesting part of our discussion was around the processing “frameworks” Cloudera sees as most important.
- The four biggies are:
- MapReduce. Duh.
- SQL, specifically Impala. This is as opposed to the uneasy Hive/MapReduce layering.
- “Math” , which seems to mainly be through partnerships with SAS and Revolution Analytics. I don’t know a lot about how these work, but I presume they bypass MapReduce, in which case I could imagine them greatly outperforming Mahout.
- Stream processing (Storm) is next in line.
- Graph — e.g. Giraph — rises to at least the proof-of-concept level. Again, the hope would be that this well outperforms graph-on-MapReduce.
- Charles is also seeing at least POC interest in Spark.
- But MPI (Message Passing Interface) on Hadoop isn’t going anywhere fast, except to the extent it’s baked into SAS or other “math” frameworks. Generic MPI use cases evidently turn out to be a bad fit for Hadoop, due to factors such as:
- Low data volumes.
- Latencies in various parts of the system
HBase was artificially omitted from this “frameworks” discussion because Cloudera sees it as a little bit more of a “storage” system than a processing one.
Another good subject was offloading work to Hadoop, in a couple different senses of “offload”:
- From general-purpose data stores, mainly RDBMS, analytic or otherwise. This sounds similar to Hortonworks’ views about efficiency-oriented offloading; batch work can be moved to Hadoop, saving costs and/or getting more mileage from costs that are already sunk into expensive legacy installations. The top targets here are large, centralized systems, with Teradata being a clear #1 and IBM mainframes a probable #2, but anything from Oracle to newer parallel analytic RDBMS is fair game.
- From the specialized data stores associated with fuller technology stacks. The example I had in mind was Splunk; Charles added Palantir, HP Arcsight and, in the past, Endeca. The idea here is that Hadoop is used to organize and/or index data the way those products’ native data stores would, but in higher volumes than they are (cost-)effective for.
On a pickier note, I encouraged Charles to push back against Hortonworks’ arguments for ORC vs. Parquet. His first claim was that ORC at this time only works under Hive, while Parquet can also be used for Hive, MapReduce, etc. (Edit: But see Arun Murthy’s comment below.) I suspect this is a case where Hortonworks and Cloudera should just get over themselves, and either agree on a file format or wind up each supporting both of them. There’s a lot of DBMS-like tooling in Hadoop’s future, and I have to think it will work better — or at least run faster — if it can make reliable assumptions about how data is actually stored.
- In connection with its 0.1 version, Jakob Homan of LinkedIn contrasted Giraph to MapReduce-based graph processing.
- I wrote a series about graph processing in May, 2012.
- MPI used to be a higher Hadoop priority (August, 2011). That’s why I’ve kept bringing it up.