March 17, 2015

More notes on HBase

1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:

Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.

2. The discussion of which kinds of data are originally put into HBase was a bit confusing.

OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.

3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that: 

Cloudera also told me of a Very Famous Company that has many 100s of HBase nodes managing petabytes of mobile device data. That sounds like multiple terabytes per node even before considering a replication factor, so I presume it’s disk-based as well. The takeaway from those examples, other than general big-use-case impressiveness, is that storage choices for HBase can vary greatly by user and application.

4. HBase has master/master geographically remote replication. I gather that Yahoo replicates between a couple of 1000-node clusters, on behalf of its Flurry operation. HBase also has the technical capability to segment data across geographies — i.e., the geo-partitioning feature essential to data sovereignty compliance — but no actual implementations came to mind.

5. Besides the ones already mentioned, and famed HBase user Facebook, a few other users came up.

In general, Cloudera suggested that HBase was used in a fair number of OEM situations.

6. Finally, I have one number: As of January, 2014 there were 20,000 HBase nodes managed by Cloudera software. Obviously, that number is growing very quickly, and of course there are many HBase nodes that Cloudera has nothing to do with.

Related link

Comments

4 Responses to “More notes on HBase”

  1. David Gruzman on March 17th, 2015 6:17 pm

    I think sorting is very important, and relatively rare property among NoSQL systems. The fact that records are stored by their key gives important flexibility in system design. For example, architect can compose key from user id and time to get efficiently access to all records related to the given user in the given time frame. Other systems, built on consistent hash, can not guarantee data proximity – so almost any query will touch all servers in the cluster.
    In other words – the fact that data is sorted makes HBase much better solution for DWH then many others, built on consistent hash.

  2. Vladimir Rodionov on April 28th, 2015 12:56 am

    >> Cloudera also told me of a Very Famous Company that has many 100s of HBase nodes managing petabytes of mobile device data.

    Apple?

  3. jiri on May 29th, 2015 7:15 pm

    David, I am not sure what you mean by “rare”, key sorted files are not only common in noSQL world like HBase or Cassandra (if you use OrderPreservingPartitioner), they are also common in RDBMS e.g. Index Organized Tables in Oracle

  4. David Gruzman on May 30th, 2015 4:30 am

    By rare, I mean that default choice for most of NoSQL’s is consistent hashing and many of them do not give us other choice.

    Cassandra, by default, is using hash partitioning to distribute data among nodes, and only inside a node data is sorted.
    The is quote from DataStax documentation says : “Unless absolutely required by your application, DataStax strongly recommends against using the ordered partitioner for the following reasons … “.
    In a nutshell – because of the hot spots.

    Coachbase, Riak, Voldemort, are using consistent hashing.

    I do not think that their respective designers do not know benefits of sorting, but they also aware about the toll – “hot” regions problem in HBase terms. So they choose to keep their system simpler and focus on “access by key” case, where sorting is not required.

    In my view, only NoSQL solutions, which are committed to fight “hot regions” problem” should be considered when sorting is needed. In best of my knowledge, only HBase is doing so (at least among popular NoSQL systems).

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.