July 10, 2011

Hadoop futures and enhancements

Hadoop is immature technology. As such, it naturally offers much room for improvement in both industrial-strengthness and performance. And since Hadoop is booming, multiple efforts are underway to fill those gaps. For example:

Cloudera’s proprietary code is focused on management, set-up, etc.
The “Phase 1” plans Hortonworks shared with me for Apache Hadoop are focused on industrial-strengthness, as are significant parts of “Phase 2”.*
MapR tells a performance story versus generic Apache Hadoop HDFS and MapReduce. (One aspect of same is just C++ vs. Java.)
So does Hadapt, but mainly vs. Hive.
Cloudera also tells me there’s a potential 4-5X performance improvement in Hive coming down the pike from what amounts to an optimizer rewrite.

(Zettaset belongs in the discussion too, but made an unfortunate choice of embargo date.)

*Hortonworks, a new Hadoop company spun out of Yahoo, graciously permitted me to post a slide deck outlining an Apache Hadoop roadmap. Phase 1 refers to stuff that is underway more or less now. Phase 2 is scheduled for alpha in October, 2011, with production availability not too late in 2012.

You’ve probably heard some single point of failure fuss. Hadoop NameNodes can crash, which wouldn’t cause data loss, but would shut down the cluster for a little while. It’s hard to come up with real-life stories in which this has been a problem; still, it’s something that should be fixed, and everybody (including the Apache Hadoop folks, as part of Phase 2) has a favored solution. A more serious problem is that Hadoop is currently bad for small updates, because:

Hadoop’s fundamental paradigm assumes batch processing.
Both major workarounds to allow small updates are broken:
- HBase is seriously buggy, to the point that it sometimes loses data.
- Storing each update in a separate file runs afoul of a practical limit of 70-100 million files.

File-count limits also get blamed for a second problem, in that there may not be enough intermediate files allowed for your Reduce steps, necessitating awkward and perhaps poorly-performing MapReduce workarounds. Anyhow, the Phase 2 Apache Hadoop roadmap features a serious HBase rewrite. I’m less clear as to where things stand with respect to file-count limits.

Edits: As per the comments below, I should perhaps have referred to HBase’s HDFS underpinnings rather than HBase itself. Anyhow, some details are in the slides. Please also see my follow-up post on how well HBase is indeed doing.

The other big area for Hadoop improvement is modularity, pluggability, and coexistence, on both the storage and application execution tiers. For example:

Greenplum/MapR and Hadapt both think you should have HDFS file management and relational DBMS coexisting on the same storage nodes. (I agree.)
Part of what Hortonworks calls “Phase 2” sets out to ensure that Hadoop can properly manage temp space and so on next to HDFS.
Perhaps HBase won’t always assume HDFS.
DataStax thinks you should blend HDFS and Cassandra.

Meanwhile, Pig and Hive need to come closer together. Often you want to stream data into Hadoop. The argument that MPI trumps MapReduce does, in certain use cases, make sense. Apache Hadoop “Phase 2” and beyond are charted to accommodate some of those possibilities too.

Categories: Cloudera, Greenplum, Hadapt, Hadoop, HBase, MapR, MapReduce, Parallelization, Zettaset

Subscribe to our complete feed!

Comments

20 Responses to “Hadoop futures and enhancements”

Cloudera and Hortonworks | DBMS 2 : DataBase Management System Services on July 10th, 2011 10:15 pm

[…] Meanwhile, whatever else happens, I’m pretty psyched about some enhancements the Hortonworks folks plan to lead for Hadoop. […]
stack on July 11th, 2011 10:32 am

Do you get the following from the HortonWorks folks Curt?

“HBase is seriously buggy, to the point that it sometimes loses data.”

This seems like a strange statement given the scale of HBase deploys at the likes of facebook or even at Yahoo! itself with its 1k node cluster. I’m not saying HBase is without bugs nor that it might not lose data in the extreme, but the implication in your article is that hbase is ‘broken’. This seems ‘off’.

And this statement is complete news to me (and I believe to the others who are members of the apache hbase management committee): “…the Phase 2 Apache Hadoop roadmap features a serious HBase rewrite.”

Where’d that come from?

Let us know if you’d like the apache hbase committee’s pov next time you write about hbase Curt.
Andrew Purtell on July 11th, 2011 10:47 am

If this is what the Hortonworks guys are saying about HBase, then it is quite ironic because some of their founders are the same who have been resisting commits for 18+ months of the necessary HDFS support for HBase that avoids such problems in production at places like Facebook and anywhere there is a Cloudera Distribution installation. And the thought they are just going to take over HBase with a “serious rewrite” is laughable. We do welcome all contributions, except FUD and marketing or political games.

Andrew Purtell, HBase PMC
Andrew Purtell on July 11th, 2011 10:55 am

In fact I went so far as to do the legwork to integrate the 0.20-append branch changes with their 0.20.203-security branch, aka “Yahoo Hadoop” while these guys were still at Yahoo. Free labor and a contribution, and no meaningful response in return, at least, I haven’t heard anything back now for months. The fault lies not with HBase.
ryan on July 11th, 2011 11:44 am

Having run a production hbase cluster, every single data loss scenario was due to either bug or fundamental flaws in hdfs. Lack of sync, nn crash, snn bugs, and so on. So I’m not sure where you are getting your intel that hbase has flaws… Obviously not from real users.
Vlad Rodionov on July 11th, 2011 12:48 pm

I would like to mention that Facebook is running proprietary version of HBase.
Andrew Purtell on July 11th, 2011 1:12 pm

@Vlad: The changes of consequence are all upstream to my knowledge.
Jonathan Gray on July 11th, 2011 1:33 pm

As an engineer at Facebook working with HBase, I can say that we are definitely not running a “proprietary” version of HBase. We have internal branches that are based off of Apache HBase 0.89 and 0.90 releases, but that contain some patches that are currently slated for 0.92/0.94 (I suspect most large-scale installations are running with a model similar to this). As Andrew says, all changes of consequence are contributed to and available from Apache.

Would love more context around the notes that “HBase is seriously buggy and loses data” and that someone is planning (without the knowledge of the HBase community) to do an HBase rewrite.
Nag on July 11th, 2011 1:39 pm

Hi,
Very nice article,

Cloudera has good funding. So they can survive for some more time.

Heard some layoffs at Cloudera so not sure about that.

Also learnt that Oracle is looking to buy some folks in BigData like,

– Cloudera or datastax.

Thanks, Nag
Vlad Rodionov on July 11th, 2011 2:22 pm

@Jonathan Gray
By “proprietary” I meant that nobody except Facebook engineers themselves right now is able to reproduce Facebook’s environment (HBase + HDFS). Internal branches + numbers of some patches (can we get the list of all these patches?). Can I get your Hadoop+HBase version as a down-loadable tarball as well?
Curt Monash on July 11th, 2011 2:22 pm

@HBase guys,

Yeah, it’s fairer to say the HBase/HDFS combination is buggy, and HDFS is being corrected accordingly. To me, that’s a distinction without a difference, but I can see where you might feel differently. Sorry.

By all means reach out to me directly if you want to talk about HBase!

Thanks,

CAM
David Menninger on July 11th, 2011 2:36 pm

Good article. Question about coexistence of HDFS and Relational storage on the same nodes: how do you coordinate resource and workload management in this scenario?

Dave
Curt Monash on July 11th, 2011 2:42 pm

Dave,

That’s a big part of the engineering challenge. Greenplum, Hadapt, et al. each have to have an answer that amounts to “our software is in charge”.
stack on July 11th, 2011 2:59 pm

@Curt your comeback does insufficient redress. Your article has at least two assertions that are wrong and that we are trying to help you fix.

1. You say hbase is ‘broken’ yet hbase is deployed in a number of locations with at least two companies operating at large scales (Let me know if you’d like citations).

2. You talk of an hbase rewrite yet no one in the hbase community knows of what you talk.

Regards ‘reaching out to me directly’ to talk about HBase, isn’t that what we are doing here? Is there another channel you’d have us reach you on?
Eric Baldeschwieler on July 11th, 2011 3:37 pm

Hi Folks,

The Hortonworks team is not planning a HBASE rewrite or anything of the sort. We are talking about the fact that we believe Apache Hadoop 0.23 will be a great release for the HBASE community.

When I talked to Curt I talked about Apache Hadoop 0.20 and the improved sync/flush support that will be available in the next release of Apache Hadoop and other performance improvements. We did not discuss any failings of HBASE or any plans to rewrite it. The Yahoo HDFS team has done a complete rewrite of the HDFS write pipeline since 0.20. That is the rewrite we discussed.

Hortonworks is committed to making Hadoop a great platform for HBASE. HBASE is a huge and growing part of the Apache Hadoop community and the Hortonworks team is committed to working with the HBASE community. I’ve never said anything to the contrary.

Thanks,

E14
Andrew Purtell on July 11th, 2011 3:47 pm

@Curt: When you say “HDFS is being corrected accordingly” you are speaking in the wrong tense. Based on evidence of production installs that have the fix in place, you should be speaking in past tense. This is a distinction with a difference.
Curt Monash on July 11th, 2011 4:02 pm

email is a good channel for communication.
ryan on July 11th, 2011 6:30 pm

I think the Hadoop 0.23 release will be the best release of Hadoop yet for running HBase.

I still believe a substantially better release to run HBase on is MapR.
pinboard July 12, 2011 — arghh.net on July 12th, 2011 12:34 pm

[…] Hadoop futures and enhancements | DBMS 2 : DataBase Management System Services […]
HBase is not broken | DBMS 2 : DataBase Management System Services on July 18th, 2011 12:25 am

[…] turns out that my phrasing “HBase is broken” was inauspicious, for two reasons. The smaller is that something wrong with the HBase/Hadoop […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Hadoop futures and enhancements

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin