July 18, 2011

HBase is not broken

It turns out that my impression that HBase is broken was unfounded, in at least two ways. The smaller is that something wrong with the HBase/Hadoop interface or Hadoop’s HBase support cannot necessarily be said to be wrong with HBase (especially since HBase is no longer a Hadoop subproject). The bigger reason is that, according to consensus, HBase has worked pretty well since the .90 release in January of this year.

After Michael Stack of StumbleUpon beat me up for a while,* Omer Trajman of Cloudera was kind enough to walk me through HBase usage. He is informed largely by 18 Cloudera customers, plus a handful of other well-known HBase users such as Facebook, StumbleUpon, and Yahoo. Of the 18 Cloudera customers using HBase that Omer was thinking of, 15 are in HBase production, one is in HBase “early production”, one is still doing R&D in the area of HBase, and one is a classified government customer not providing such details.

*Just kidding — he was actually extremely gentle.

In the use cases that Omer offered, what’s stored in HBase is almost always records of web or network activity. Specific examples included clickstream information (at 5 different ad companies), crash reports (at Mozilla), and messages (at Facebook). Sometimes the data gets into Hadoop twice — once excerpted via HBase and once as part of a full log — and may even live in two different Hadoop clusters.

What’s served out from HBase in Omer’s examples is usually derived data, such as a user profile, an ad selection, a text index, etc. That makes sense, not least because if you’re going to keep enhancing your data, schema-free programming — which HBase offers — looks ever more appealing. Omer further said that there are a growing number of cases in which HBase is being used to serve up reference data for batch MapReduce jobs, but he didn’t have specifics. A counterexample to the derived data emphasis would be, if I understood correctly, a case where HBase manages shopping carts.

I haven’t put much effort into unearthing open source or other third-party HBase-based projects, but two examples are OpenTSDB (Time Series DataBase) and Lily CMS (Content Management Systems). (Edit: But see the comment about Lily below.)

Omer is perhaps my top go-to guy on database and cluster sizes, so of course I asked him for HBase metrics as well. He responded (approximately) that Cloudera HBase customer installations average 20-30 nodes, but that half a dozen are in the 100-200 node range.

Finally, there’s the matter of latency. As a general rule, the HBase users Omer sees are using HBase with at least several minutes latency. (Again , that shopping cart case would seem to be a counterexample.) So, for example, the data recorded when you click on a page isn’t immediately applied toward tweaking your profile to determine which ad you’ll see next — but it might come into play after you spend a few minutes reading the page you’re on. Naturally, Omer knows of efforts to use HBase with lower latency yet, and I won’t be surprised if already-working examples of low-latency HBase show up in the comment thread to this post.

Categories: Cloudera, Derived data, Facebook, Hadoop, HBase, Log analysis, Market share and customer counts, Open source, Specific users, Web analytics

Subscribe to our complete feed!

Comments

6 Responses to “HBase is not broken”

Hadoop futures and enhancements | DBMS 2 : DataBase Management System Services on July 18th, 2011 12:29 am

[…] Edits: As per the comments below, I should perhaps have referred to HBase’s HDFS underpinnings rather than HBase itself. Anyhow, some details are in the slides. Please also see my follow-up post on how well HBase is indeed doing. […]
Steven Noels on July 18th, 2011 7:46 am

Hey Curt,

thanks for mentioning Lily in passing. Seems like we haven’t got rid of the ‘CMS’ moniker everywhere – as we have been (re-)positioning Lily as a data platform recently, but indeed our heritage lies in content management. I find your analysis on the ‘in’ vs. ‘out’ interesting. We offer an easy-to-use API and content model on top of HBase, with more userfriendly datatypes, and automatic indexing into Solr (and soon Elastic Search). We make HBase easier to use for mundane enterprise developers, but have been really happy with HBase itself for almost two years now. It’s stable, quite predictable, and the dev community is absolutely great.

HBase is getting a lot of traction now, and I think there’s two reasons behind that: (1) the fundamentals (the underlying model and such) are sound and well-regarded, and inspired by the Google Bigtable model, and (2) there’s a decent spread of core devs amongst a relevant set of key users. It is reassuring that people from FB, SU, Cloudera and TrendMicro are working together on HBase.

We’re very happy to piggyback on HBase with Lily. And it’s *not* a CMS. 🙂
Andrew Purtell on July 23rd, 2011 6:24 pm

The largest HBase cluster that I am aware of is Yahoo’s, at approximately 1000 nodes. Another large user, Facebook, uses a cellular architecture for their big HBase application with each cell sized at about 100 nodes, and many cells. YFrog serves a live site out of HBase using about 80 nodes. We (Trend Micro) run clusters of more enterprise-y dimension: 60 servers in production, 15 for staging, and we also launch on demand HBase clusters on EC2 for development and testing purposes. HBase services diverse use cases on clusters from 10 to 1000 nodes.

Currently we store mainly derived data in HBase, which feeds RESTful query interfaces as well as mapreduce jobs, but have developed the Coprocessor framework, to be part of HBase 0.92, and are now reimagining our HBase installation as a data grid: one can store code with data, or code instead of data, the HBase layer becomes an application fabric with persistence, fault tolerance, and transparent scalability built in.
Benoit Sigoure on July 24th, 2011 11:32 pm

As you mentioned, at StumbleUpon we use HBase in production. We use it both in a low-latency setting, where we query HBase to actually serve responses to our users, and in a batch-job / analytics kinda setting where throughput matters more than latency.

OpenTSDB is another example of an application that uses HBase with low latency requirements. System monitoring data that enters OpenTSDB is typically only a few seconds old, and can be served by OpenTSDB within less than a second, most of the time.
Curt Monash on July 26th, 2011 3:52 am

Thanks for the data points!!
Notes on the Hadoop and HBase markets : DBMS 2 : DataBase Management System Services on April 24th, 2012 4:40 am

[…] Over half of Cloudera’s customers use HBase, vs. a figure of 18+ last July. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

HBase is not broken

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin