There’s been a flurry of announcements recently in the Hadoop world. Much of it has been concentrated on Hadoop data storage and management. This is understandable, since HDFS (Hadoop Distributed File System) is quite a young (i.e. immature) system, with much strengthening and Bottleneck Whack-A-Mole remaining in its future.
Known HDFS and Hadoop data storage and management issues include but are not limited to:
- Hadoop is run by a master node, and specifically a namenode, that’s a single point of failure.
- HDFS compression could be better.
- HDFS likes to store three copies of everything, whereas many DBMS and file systems are satisfied with two.
- Hive (the canonical way to do SQL joins and so on in Hadoop) is slow.
Different entities have different ideas about how such deficiencies should be addressed.
For most practical purposes, Yahoo’s and IBM’s views about Hadoop have converged. Yahoo and IBM both believe that Hadoop data storage should be advanced solely through the Apache Hadoop open source process. In particular:
- IBM and Yahoo both talk of the great undesirability of Hadoop “forking” like Unix did.
- Yahoo appeared on stage at IBM’s analyst event this week to reinforce the meeting-of-the-minds, even though there’s no IBM/Yahoo customer relationship involved.
- IBM has disclaimed any intention of providing its own Hadoop distribution, but even so is committed to selling lots of IBM InfoSphere BigInsights, which incorporates Apache Hadoop.*
- Yahoo has stopped offering its own Hadoop distribution, period.
*IBM is emphatic about ruling out marketing terms whose connotation it doesn’t like. IBM’s Hadoop distribution isn’t a “distribution,” because that might make it sound too proprietary; IBM’s Oracle emulation offering isn’t an “emulation” offering, because that might make it sound too slow; and IBM’s CEP product InfoSphere Streams isn’t a “CEP” product, because that might make it sound too non-functional.
Cloudera can probably be regarded as part of the Yahoo/IBM camp, some stern looks from IBM in Cloudera’s direction notwithstanding. Cloudera Enterprise — also an embrace-and-extend offering — remains the obvious choice for enterprises Hadoop users; meanwhile, nobody has convinced me of any bogosity in the “no forking” claim Cloudera makes for its free/open source Hadoop distribution. Indeed, when I visited Cloudera a couple of weeks ago, Mike Olson showed me a slide demonstrating that Cloudera might be supplanting Yahoo as the biggest ongoing contributor to Apache Hadoop.
EMC’s Data Computing Division, nee’ Greenplum, made a lot of Hadoop noise this week. Unlike Yahoo, IBM, and Cloudera, EMC really is forking Hadoop. I’m not talking with the EMC/Greenplum folks these days, but the whole thing was covered from various angles by Lucas Mearian, Doug Henschen, Derrick Harris, and Dave Menninger.
Another option is to entirely replace HDFS with a DBMS, whether distributed or just instanced at each node. DataStax is doing that with Cassandra-based Brisk; Hadapt plans to do that with PostgreSQL and VectorWise (edit: As per the comment below, Hadapt only plans a partial replacement of HDFS); and Netezza’s analytic platform has a Hadoop-over-Netezza option as well. Mike Olson objects to such implementations being called “Hadoop”; but trademark issues aside, those vendors plan to support a broad variety of Hadoop-compatible tools. Aster Data has long taken that approach one step further, by offering an enhanced version of MapReduce — aka SQL/MapReduce — over its nCluster DBMS. And 10gen offers a more primitive form of MapReduce with MongoDB, but probably wouldn’t position it as addressing a “MapReduce market” at all.