This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production.
- In general, how “enterprise-ready” is Hadoop?
- HBase 0.92 (this post)
As part of my recent round of Hadoop research, I talked with Cloudera’s Todd Lipcon. Naturally, one of the subjects was HBase, and specifically HBase 0.92. I gather that the major themes to HBase 0.92 are:
- Performance, scalability, and so on.
- “Coprocessors”, which are like triggers or stored procedures.
- Security, as the first major application of co-processors.
HBase coprocessors are Java code that links straight into HBase. As with other DBMS extensions of the “links straight into the DBMS code” kind,* HBase coprocessors seem best suited for very sophisticated users and third parties.** Evidently, coprocessors have already been used to make HBase security more granular — role-based, per-column-family/per-table, etc. Further, Todd thinks coprocessors could serve as a good basis for future HBase enhancements in areas such as aggregation or secondary indexing.
*Examples include unfenced C++ extensions to analytic RDBMS or — which mattered more in the 1990s than now — “blade”/”cartridge”/datatype extensions to extensible RDBMS such as Illustra, Informix, Oracle, or DB2.
**Admittedly, in the current HBase community, a considerable fraction of user organizations fit the “very sophisticated”/co-developer template.
As for scalability and performance, it seems the advances there match clichés such as “low-hanging fruit” or Bottleneck Whack-a-Mole.
- HBase b-trees used to be restricted to two levels; now they aren’t.
- Replication among data centers has been strengthened (I eventually hear that about most NoSQL projects that aren’t Cassandra ).
- HBase inherits some performance improvements in HBase itself.
Overall, Todd says several tests have indicated HBase performance improvements of 60% or better, with some particular cases of course going much higher (up to 2 1/2X).
My whole HBase discussion with Todd was pretty short, actually; just one of several subjects in a one-hour call. But we did squeeze in one topic that wasn’t 0.92-specific — namely, what does HBase storage tend to be like? Notes on that included:
- HBase working sets are commonly in RAM, or else have cache hit ratios in at least the 60-80% range.
- Solid-state memory isn’t generally used for HBase persistence. Small fast disks are beginning to appear.
- When you do short-request and MapReduce processing against the same HBase database, the MapReduce part is usually still done using cheaper disks.