Hadoop is immature technology. As such, it naturally offers much room for improvement in both industrial-strengthness and performance. And since Hadoop is booming, multiple efforts are underway to fill those gaps. For example:
- Cloudera’s proprietary code is focused on management, set-up, etc.
- The “Phase 1″ plans Hortonworks shared with me for Apache Hadoop are focused on industrial-strengthness, as are significant parts of “Phase 2″.*
- MapR tells a performance story versus generic Apache Hadoop HDFS and MapReduce. (One aspect of same is just C++ vs. Java.)
- So does Hadapt, but mainly vs. Hive.
- Cloudera also tells me there’s a potential 4-5X performance improvement in Hive coming down the pike from what amounts to an optimizer rewrite.
(Zettaset belongs in the discussion too, but made an unfortunate choice of embargo date.)
*Hortonworks, a new Hadoop company spun out of Yahoo, graciously permitted me to post a slide deck outlining an Apache Hadoop roadmap. Phase 1 refers to stuff that is underway more or less now. Phase 2 is scheduled for alpha in October, 2011, with production availability not too late in 2012.
You’ve probably heard some single point of failure fuss. Hadoop NameNodes can crash, which wouldn’t cause data loss, but would shut down the cluster for a little while. It’s hard to come up with real-life stories in which this has been a problem; still, it’s something that should be fixed, and everybody (including the Apache Hadoop folks, as part of Phase 2) has a favored solution. A more serious problem is that Hadoop is currently bad for small updates, because:
- Hadoop’s fundamental paradigm assumes batch processing.
- Both major workarounds to allow small updates are broken:
- HBase is seriously buggy, to the point that it sometimes loses data.
- Storing each update in a separate file runs afoul of a practical limit of 70-100 million files.
File-count limits also get blamed for a second problem, in that there may not be enough intermediate files allowed for your Reduce steps, necessitating awkward and perhaps poorly-performing MapReduce workarounds. Anyhow, the Phase 2 Apache Hadoop roadmap features a serious HBase rewrite. I’m less clear as to where things stand with respect to file-count limits.
Edits: As per the comments below, I should perhaps have referred to HBase’s HDFS underpinnings rather than HBase itself. Anyhow, some details are in the slides. Please also see my follow-up post on how well HBase is indeed doing.
The other big area for Hadoop improvement is modularity, pluggability, and coexistence, on both the storage and application execution tiers. For example:
- Greenplum/MapR and Hadapt both think you should have HDFS file management and relational DBMS coexisting on the same storage nodes. (I agree.)
- Part of what Hortonworks calls “Phase 2″ sets out to ensure that Hadoop can properly manage temp space and so on next to HDFS.
- Perhaps HBase won’t always assume HDFS.
- DataStax thinks you should blend HDFS and Cassandra.
Meanwhile, Pig and Hive need to come closer together. Often you want to stream data into Hadoop. The argument that MPI trumps MapReduce does, in certain use cases, make sense. Apache Hadoop “Phase 2″ and beyond are charted to accommodate some of those possibilities too.