I wasn’t the only one to be dubious about Forrester Research’s Hadoop taxonomy (or lack thereof). GigaOm’s Derrick Harris was as well, and offered a much superior approach of his own. In Derrick’s view, there’s Hadoop, Hadoop distributions, Hadoop management, and Hadoop applications. Taking those out of order, and recalling that no market categorization is ever precise:
- “Hadoop applications” is a catch-all category. Since Derrick offered suitable caveats around the label, I’m fine with what he said.
- Hadoop management software commonly comes in the form of suites. Derrick’s discussion was solid.
- Derrick seems to want to define “Hadoop” as being whatever is in the relevant Apache projects. Cool. He does seem to wind up on both sides of the “MapR and DataStax put Hadoop MapReduce on top of something that isn’t HDFS — so is that Hadoop or isn’t it?” question, but that’s a tough ambiguity to avoid.
- Derrick could have been a little clearer on the subject of Hadoop distributions.
Let’s drill down into that last one. Derrick refers to Hadoop distributions as “products” that:
package a set of Hadoop projects (MapReduce, Hive, Sqoop, Pig, etc.) in a way that in theory makes them integrate more naturally, and to run both smoothly and securely.
While that’s a reasonable recitation of the idea’s benefits, I’d rather say that a “distribution” of open source software comprises:
- Open source software, in selected versions.
- (Possibly) additional code.
- (Likely) documentation.
- (Possibly) legal assurances such as intellectual property indemnification.
In the case of Hadoop:
- The version selection is a relatively big deal. There are a lot of Hadoop sub-projects. There’s been some splitting and forking and recombination. Testing a specific set of point releases for integration and bugs is a non-trivial user benefit.
- The additional code is generally focused on installation or whatever, because the rest is bundled into separately identified management software. Even so, because of the large number of moving parts, this is a good thing to have.
- What’s more, in the case of Cloudera, using a particular distribution (theirs) is a prerequisite to getting the most widely adopted Hadoop management software (also theirs), which in turn is required if you want the industry’s most widely adopted Hadoop support (ditto). Similar things are apt to be true of rival distributions.