This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production.
- In general, how “enterprise-ready” is Hadoop (this post)?
- HBase 0.92.
The posts depend on each other in various ways.
Cloudera, Hortonworks, and MapR all claim, in effect, “Our version of Hadoop is enterprise-ready, unlike those other guys’.” I’m dubious.
- Hortonworks has considerably fewer features than Cloudera, along with less of a production or support track record. (Edit: HCatalog may be a significant exception.)
- I doubt Cloudera really believes or can support the apparent claim in its CDH 4 press release that Hadoop is now suitable for every enterprise, whereas last month it wasn’t.
- While MapR was early with some nice enterprise features, such as high availability or certain management UI elements — quickly imitated in Cloudera Enterprise — I don’t think it has any special status as “enterprise-ready” either.
That said, “enterprise-ready Hadoop” really is an important topic.
So what does it mean for something to be “enterprise-ready”, in whole or in part? Common themes in distinguishing between “enterprise-class” and other software include:
- Usable by our existing staff.
- Sufficiently feature-rich.
- Integrates well with the rest of our environment.
- Fits well into our purchasing and vendor relations model.
- Sufficiently reliable, proven, and secure — which is to say, “safe”.
For Hadoop, as for most things, these concepts overlap in many ways.
There are two major kinds of usability issues in Hadoop:
- Programming. Since the whole point of MapReduce is to make parallel programming be only slightly harder than the ordinary stuff, I’d say Hadoop has been enterprise-ready in this respect since Day 1. Hadoop demands good programmers; but it doesn’t demand great ones.
- Administration. It would be nice if Hadoop administration tools combined the best features of tools used to manage scientific clusters, clustered relational databases, clustered storage systems, and networks. They have a ways to go. But I think we’re already at the point that general cluster management challenges shouldn’t be a barrier to adopting Hadoop.
As for data management features — Hadoop isn’t across-the-board competitive with analytic relational DBMS. (And the same goes for HBase vs. short-request alternatives.) But the real question is whether its features are good enough for a variety of important tasks. And to that, the answer at many enterprises is an emphatic Yes.
When it come to integration:
- Hadoop generally runs on its own cluster, or in the public (generally Amazon) cloud, or in some cases on a cluster shared with another data management system. (E.g. DataStax/Cassandra, Hadapt/PostgreSQL, or IBM Netezza.) Anyhow, requiring a dedicated cluster isn’t a deal-breaker.
- Hadoop’s data integration/ETL story is already decent, and it’s getting better fast.
- Hadoop management tools are in the early days of being integrated into more general management tool environments. But I don’t see why the need for standalone management tools should be an enterprise deal-breaker.
- As for software running on top of Hadoop — pending future posts, I’ll just say that the ability to run anything analytic on Hadoop is being assembled fast, but performance is something that needs to be assessed on a case-by-case basis.
Hadoop is already a good match for most enterprises’ buying practices. A thankfully large fraction of them are already content with open source (or open core) subscription models. For the rest, there are always options like the Oracle appliance. In connection with that, Cloudera has been providing enterprise Hadoop support for a while, and now Hortonworks is getting into the game as well.
And so we circle to the final point, which intersects with most of the others — “Is this new-fangled Hadoop stuff safe?”
The story in unplanned downtime goes something like this:
- Hadoop has never crashed all that much.
- As of this month, pretty much anything that passes for a Hadoop distribution has an answer for Hadoop’s most famous single point of failure, the one at NameNode.
- HBase has added some capabilities in inter-data-center replication. (I’m not clear on the details.)
- Otherwise, formal disaster recovery for Hadoop seems more theoretical than practical.
For the most part, Hadoop use cases are either HBase or batch. For enterprise batch use, Hadoop’s reliability should already be fine. As for HBase — well, I’m not sure most enterprises would bet all that much on an 0.92 open source project with so little vendor sponsorship.
As for planned Hadoop downtime — theoretically, there should be very little; if you have a lot, it’s because your management tools and processes aren’t ideal. Temporary performance surprises may be harder to avoid, however, since Hadoop concurrency and workload management are still rudimentary, pending the maturity of MapReduce 2.
Hadoop security still seems pretty basic. Kerberos got in about a year ago, but I’ve only heard about role-based security and so on in the context of HBase, and that only in the latest release.
And finally, for the gut-feel question of proven — I think Hadoop is proven indeed, whether in technology, vendor support, or user success. But some particularly conservative enterprises may for a while disagree.