In my recent series of Hadoop posts, there were several cases where I had to choose between recommending that enterprises:
- Go with the most advanced features any vendor was credibly advocating.
- Be more cautious, and only adopt features that have been solidly proven in the field.
I favored the more advanced features each time. Here’s why.
To a first approximation, I divide Hadoop use cases into two major buckets, only one of which I was addressing with my comments:
1. Analytic data management.* Here I favored features over reliability because they are more important, for Hadoop as for analytic RDBMS before it. When somebody complains about an analytic data store not being ready for prime time, never really working, or causing them to tear their hair out, what they usually mean is that:
- It couldn’t do the work that needed doing …
- … with reasonable performance and turnaround time …
- … without undue effort in administration and/or programming.
Those complaints are much, much, more frequent than “It crashed”. So it was for Netezza, DATAllegro, Greenplum, Aster Data, Vertica, Infobright, et al. So it also is for Hadoop. And how does one address those complaints? By performance and feature enhancements, of the kind that the Hadoop community is introducing at high speed.
*When I refer to Hadoop being used for analytic data management, I mean that a bunch of data gets dumped into it, which may be either analyzed in situ or else massaged and summarized to be forwarded to an analytic RDBMS.
2. HBase-led. For a short-request DBMS, I indeed take the stance “First, let’s not lose the data.” But I doubt many enterprises are using HBase in production right now unless they’re watching the community development process very closely. I.e., they’re making their own decisions, and they aren’t really who I had in mind when I was offering advice.
If I’m wrong in all this, it would be because I’m lumping too many things together in “Hadoop-based analytic data management”, and some of them do indeed require a high degree of reliability. Indeed, that’s exactly the argument Hortonworks made in some of its pushback. Namely, they think enterprises are already adopting Hadoop as part of repeatable, production ETL (Extract/Transform/Load) processes, and those processes require rather stable software. They may not be claiming that their version of Hadoop is as stable as Informatica or Teradata, but that’s the kind of environment they want to be playing in.
But you know what? In support of that kind of capability, Hortonworks wants enterprises to adopt the new and unproven HCatalog. I suspect they’re right to do so. And so we have another illustration of my thesis:
We’re still at the point in Hadoop use where “unquestionably stable” is a nice-to-have, not a must-have. The features themselves are still more crucial.