This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production (this post).
- In general, how “enterprise-ready” is Hadoop?
- HBase 0.92.
The posts depend on each other in various ways.
My clients at Cloudera and Hortonworks have somewhat different views as to the maturity of various pieces of Hadoop technology. In particular:
- Cloudera introduced CDH 4* and Hortonworks introduced HDP 1*, both timed for the recent Hadoop Summit.
- CDH 4 is based mainly on Hadoop 2.0, which Cloudera says it has tested extensively.
- HDP 1 is based on Hadoop 1.0, on the theory that nobody has properly tested Hadoop 2.0, which is still characterized as “alpha”.
- CDH 4 boasts sub-second NameNode failover.
- Hortonworks is partnering with third parties such as VMware to address the high-availability problems caused by failover potentially taking several minutes.
- Hadoop 2.0 and CDH 4 also incorporate improvements to NameNode scalability, HDFS (Hadoop Distributed File System) performance, HBase performance, and HBase functionality.
- As does CDH 4, HDP 1 includes HCatalog, an extension of Hive technology that serves as a more general metadata store. (Edit: Actually, see the comment thread below.)
- Hortonworks thinks HCatalog is a big deal in improving Hadoop data management and connectivity, and already has a Talend partnership based on HCatalog. Cloudera is less sure, especially in HCatalog’s current form.
- HDP 1 includes Ambari, an Apache open source competitor to Cloudera Manager (the closed-source part of Cloudera Enterprise). Hortonworks concedes a functionality gap between Ambari and Cloudera Manager, but perhaps a smaller one than Cloudera sees.
- Hortonworks thinks Ambari being open source means better integration with other management platforms. Cloudera touts the integration features and integrations of Cloudera Manager 4.
- Nobody seems confident that MapReduce 2 is ready for prime time. While it’s in CDH 4, so is MapReduce 1.
*”CDH” stands, due to some trademarking weirdness, for “Cloudera’s Distribution including Apache Hadoop”. “HDP” stands for “Hortonworks Data Platform”.
The whole thing seems like a big example of Miles’ Law: Where you stand depends upon where you sit. Cloudera’s embrace of more advanced Apache Hadoop technology is accompanied by claims such as “We built a lot of this ourselves” and “We’ve already tested this stuff at length.” I find Cloudera’s claims credible, and look forward to Hortonworks’ near-future declarations that those Hadoop 2.0 features are “now” enterprise-ready.
For HCatalog, however, the situations are reversed.
For now, my views on selecting Hadoop distributions start:
- For most enterprises, the Hadoop distribution you should go with is still CDH.
- I think Cloudera and Hortonworks are headed for a duopoly in general-purpose Hadoop distributions, and Hortonworks may achieve rough parity sooner than Cloudera likes. But at the moment Cloudera still seems well ahead.
- The same partners who root for Hortonworks to beat Cloudera also point out that they have worked with Cloudera for longer than Hortonworks has even existed. So while those partners are a plausibility argument for Hortonworks catching up with Cloudera in the future, they don’t show a Hortonworks advantage at this time.
- I think it’s already too late in the history of Hadoop to commit to other variants, such as MapR. But there can be credible and useful claims of Hadoop functionality in products like, for example, the DataStax/Cassandra stack.
- The wild card here is Amazon, which in some ways can be said to have majority Hadoop market share all by itself. One of the week’s announcements was some kind of optional integration between MapR and Elastic MapReduce.