It turns out that my impression that HBase is broken was unfounded, in at least two ways. The smaller is that something wrong with the HBase/Hadoop interface or Hadoop’s HBase support cannot necessarily be said to be wrong with HBase (especially since HBase is no longer a Hadoop subproject). The bigger reason is that, according to consensus, HBase has worked pretty well since the .90 release in January of this year.
After Michael Stack of StumbleUpon beat me up for a while,* Omer Trajman of Cloudera was kind enough to walk me through HBase usage. He is informed largely by 18 Cloudera customers using, plus a handful of other well-known HBase users such as Facebook, StumbleUpon, and Yahoo. Of the 18 Cloudera customers using HBase that Omer was thinking of, 15 are in HBase production, one is in HBase “early production”, one is still doing R&D in the area of HBase, and one is a classified government customer not providing such details.
*Just kidding — he was actually extremely gentle.
In the use cases that Omer offered, what’s stored in HBase is almost always records of web or network activity. Specific examples included clickstream information (at 5 different ad companies), crash reports (at Mozilla), and messages (at Facebook). Sometimes the data gets into Hadoop twice — once excerpted via HBase and once as part of a full log — and may even live in two different Hadoop clusters.
What’s served out from HBase in Omer’s examples is usually derived data, such as a user profile, an ad selection, a text index, etc. That makes sense, not least because if you’re going to keep enhancing your data, schema-free programming — which HBase offers — looks ever more appealing. Omer further said that there are a growing number of cases in which HBase is being used to serve up reference data for batch MapReduce jobs, but he didn’t have specifics. A counterexample to the derived data emphasis would be, if I understood correctly, a case where HBase manages shopping carts.
I haven’t put much effort into unearthing open source or other third-party HBase-based projects, but two examples are Open TSDB (Time Series DataBase) and Lily CMS (Content Management Systems). (Edit: But see the comment about Lily below.)
Omer is perhaps my top go-to guy on database and cluster sizes, so of course I asked him for HBase metrics as well. He responded (approximately) that Cloudera HBase customer installations average 20-30 nodes, but that half a dozen are in the 100-200 node range.
Finally, there’s the matter of latency. As a general rule, the HBase users Omer sees are using HBase with at least several minutes latency. (Again , that shopping cart case would seem to be a counterexample.) So, for example, the data recorded when you click on a page isn’t immediately applied toward tweaking your profile to determine which ad you’ll see next — but it might come into play after you spend a few minutes reading the page you’re on. Naturally, Omer knows of efforts to use HBase with lower latency yet, and I won’t be surprised if already-working examples of low-latency HBase show up in the comment thread to this post.