June 19, 2012

“Enterprise-ready Hadoop”

This is part of a four-post series, covering:

The posts depend on each other in various ways.

Cloudera, Hortonworks, and MapR all claim, in effect, “Our version of Hadoop is enterprise-ready, unlike those other guys’.” I’m dubious.

That said, “enterprise-ready Hadoop” really is an important topic.

So what does it mean for something to be “enterprise-ready”, in whole or in part? Common themes in distinguishing between “enterprise-class” and other software include:

For Hadoop, as for most things, these concepts overlap in many ways.

There are two major kinds of usability issues in Hadoop:

As for data management features — Hadoop isn’t across-the-board competitive with analytic relational DBMS. (And the same goes for HBase vs. short-request alternatives.) But the real question is whether its features are good enough for a variety of important tasks. And to that, the answer at many enterprises is an emphatic Yes.

When it come to integration:

Hadoop is already a good match for most enterprises’ buying practices. A thankfully large fraction of them are already content with open source (or open core) subscription models. For the rest, there are always options like the Oracle appliance. In connection with that, Cloudera has been providing enterprise Hadoop support for a while, and now Hortonworks is getting into the game as well.

And so we circle to the final point, which intersects with most of the others — “Is this new-fangled Hadoop stuff safe?

The story in unplanned downtime goes something like this:

For the most part, Hadoop use cases are either HBase or batch. For enterprise batch use, Hadoop’s reliability should already be fine. As for HBase — well, I’m not sure most enterprises would bet all that much on an 0.92 open source project with so little vendor sponsorship.

As for planned Hadoop downtime — theoretically, there should be very little; if you have a lot, it’s because your management tools and processes aren’t ideal. Temporary performance surprises may be harder to avoid, however, since Hadoop concurrency and workload management are still rudimentary, pending the maturity of MapReduce 2.

Hadoop security still seems pretty basic. Kerberos got in about a year ago, but I’ve only heard about role-based security and so on in the context of HBase, and that only in the latest release.

And finally, for the gut-feel question of proven — I think Hadoop is proven indeed, whether in technology, vendor support, or user success. But some particularly conservative enterprises may for a while disagree.

Comments

7 Responses to ““Enterprise-ready Hadoop””

  1. Joe on June 19th, 2012 6:53 pm

    Curt, fantastic post. Some comments:

    “Anyhow, requiring a dedicated cluster isn’t a deal-breaker.” Can you elaborate here? From how I’m interpreting it, I’m not so sure I agree. You call out the DataStax and Hadapt models (which are distinctly different from one another, but, is anyone actually using either?) – and I’d lump HBase region servers in there as well – but even so, I haven’t seen anyone running a TaskTracker on their Tomcat or WebSphere servers. Have you? Would it not follow that Hadoop-ish clusters are thus ‘dedicated’? Even if they are, why is that a barrier to Enterprise adoption? Enterprises provision all sorts of stuff all the time (database appliances, for a contemporary example…).

    I agree that Hadoop doesn’t crash ‘all that much’, but on each “distribution has an answer for Hadoop’s most famous single point of failure, the one at NameNode.” – at Hadoop Summit last week, Facebook attributed roughly 10% of their HDFS failures to NameNode HA issues (they have a solution too – if their solution didn’t exist, they’d go down 10% more of the time.) Go figure.

    On HBase intra-DC “replication”, I find this post particularly useful in explaining how it works today and what the design assumptions are: http://www.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/

    “Hadoop use cases are either HBase or batch” – I assume you mean what’s in the Apache Hadoop project, strictly speaking. Hadoop (HDFS+MR/YARN) is used in conjunction with ‘real-time’ data ingestion and analysis techniques left and right. If you consider Hadoop itself aside from these complementary building blocks, then yes, Hadoop is batch unless it’s HBase.

  2. Hadoop distributions: CDH 4, HDP 1, Hadoop 2.0, Hadoop 1.0 and all that | DBMS 2 : DataBase Management System Services on June 20th, 2012 3:59 am

    […] In general, how “enterprise-ready” is Hadoop? […]

  3. Curt Monash on June 20th, 2012 4:04 am

    Joe,

    I’m not disputing that Hadoop (usually) needs a dedicated cluster. I’m just saying that that need isn’t some kind of deficiency in enterprise-readiness.

  4. Jim Walker on June 22nd, 2012 3:52 pm

    I am not sure we (Hortonworks) agree that HDP1 has “considerably” fewer features. First, thank you for adding the note about HCatalog, but it is also important to note that we provide WebHDFS and data integration (via Talend) . Yes, you could download TOS4BD direct from Talend and use with any distribution, but the level of integration with HDP is deeper than with others. The technical relationship allowed us to share development and harden their support of HCatalog and Oozie with our engineering and test teams. The same is true with other partners who have chosen HDP because it allows for deeper integration with their offerings.

    If we compare the two distributions they have very similar components. However, if you extend this to compare what is available for free open source download, I would say we are ahead. The cloudera management tool requires license. The Hortonworks Management Center is part of the core HDP download and is 100% open source.

  5. Curt Monash on the Enterprise-Readiness of Hadoop | Getting Connected on June 27th, 2012 2:05 pm

    […] Monash, writing on The DBMS2 blog, addressed the enterprise readiness of Hadoop […]

  6. Paul Johnson on June 29th, 2012 8:28 am

    The core premise of Hadoop is to enable complex analytics (i.e. not ‘just’ SQL queries) to happen at scale without breaking the bank, through the use of open source software and clustering potentially lots of commodity tin.

    There is no doubt that this is a paradigm shift in the world of analytics, possibly the biggest since MPP databases came on the scene.

    For those enterprises that don’t have a requirement of sufficient size/complexity for which Hadoop is the answer, the ‘enterprise readiness’ is a moot point.

    Hadoop is likely to be a solution looking for a problem that doesn’t exist in many enterprises.

    Inappropriate Hadoop adoption is a bigger issue than concerns over enterprise-readiness, mainly due to folks wanting to jump on the ‘big data’ bandwagon and the relativley low barriers to entry.

  7. Curt Monash on June 29th, 2012 1:57 pm

    Paul,

    Besides being (arguably) cheap, Hadoop is a highly flexible ETL tool. Dynamic schemas can make sense even in relatively low-volume use cases.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.