December 8, 2013

DataStax/Cassandra update

Cassandra’s reputation in many quarters is:

This has led competitors to use, and get away with, sales claims along the lines of “Well, if you really need geo-distribution and can’t wait for us to catch up — which we soon will! — you should use Cassandra. But otherwise, there are better choices.”

My friends at DataStax, naturally, don’t think that’s quite fair. And so I invited them — specifically Billy Bosworth and Patrick McFadin — to educate me. Here are some highlights of that exercise.

DataStax and Cassandra have some very impressive accounts, which don’t necessarily revolve around geo-distribution. Netflix, probably the flagship Cassandra user — since Cassandra inventor Facebook adopted HBase instead — actually hasn’t been using the geo-distribution feature. Confidential accounts include:

DataStax and Cassandra won’t necessarily win customer-brag wars versus MongoDB, Couchbase, or even HBase, but at least they’re strongly in the competition.

DataStax claims that simplicity is now a strength. There are two main parts to that surprising assertion.

DataStax claims that Cassandra excels at time series use cases, where “time series” seem to equate to collections of short records with timestamps. This seems borne out by, for example, the first three use cases on my bulleted list above. Actually, it’s not just timestamps, but rather any data that is naturally ordered by a sequential field, such as packet IDs from a packet-switching network.

Finally, DataStax claims that Cassandra is good for high-velocity applications in general. A generic example that DataStax supported with some Very Big Names — whether those were of customers or prospects wasn’t entirely clear — was in retailing, to actually serve accurate information as to whether inventory is in stock, something Walmart failed at as recently as last year.

Now let’s talk a bit about Cassandra technology. I’ll start with an example. Imagine a “phone-home” use case in which many devices emit many records each in the form of (DeviceID, TimeStamp, MeterReading) triples.

So in essence, you have schemas that at once are dynamic and tabular. The big downside vs. a relational DBMS is that — duh! — you can’t have the benefits of normalization.

For clarity, I should note that much of Cassandra’s logical architecture is shared by fellow BigTable-architecture data store HBase; it’s not a coincidence that Facebook invented Cassandra to support messaging, nor that when Facebook changed its mind about that, it adopted HBase as the alternative. Accumulo has similar characteristics as well.

Physically, what’s going on in Cassandra is something like this:

Cassandra has few indexes, and no physical concept of datatype.

The benefits I see to this physical architecture are mainly:

For some use cases, that’s not a bad list of advantages. Not bad at all.

Related link

Comments

9 Responses to “DataStax/Cassandra update”

  1. Patrick McFadin on December 8th, 2013 2:09 pm

    Hi Curt,

    I just wanted to provide some clarification on the geo-replication feature usage at Netflix. They have been replicating, just not in an active-active state that is possible with Cassandra. They have just recently announced this project which you can read more about here: http://techblog.netflix.com/2013/12/active-active-for-multi-regional.html

    Active-active architecture isn’t an easy thing to pull off and the engineers there have done an amazing job of creating the tools and services to make it work. The harder part of the infrastructure is data replication, which Cassandra makes much easier, and has allowed them to focus on other key areas.

    Thanks!

    Patrick

  2. David Rydzewski on December 9th, 2013 12:35 am

    Not only does CQL improve usability, it highlights the simplicity of the protocol used in RDBMS. String-based request and tabular response. The server is free to add language features w/out client library compatibility issues with each release.

    A nice feature of Cassandra over HBase is that it is simpler to start small and grow. Cassandra runs in a single JVM (per node of course) and is easy to stand up. HBase depends on Hadoop file system, a bit heavier to get going on, and potentially harder to diagnose issues.

  3. Thomas on December 9th, 2013 7:38 am

    Hi,

    “DataStax claims that Cassandra is good for high-velocity applications”

    What do you (or they) mean by “high-velocity”? The sentence following the quote could imply fast in terms performance, or fast in terms of reaching eventual consistency. The term itself could also mean that Cassandra fits well in agile environments.

    Which is true?

    Thanks :)
    Thomas

  4. Curt Monash on December 9th, 2013 11:29 am

    Thomas,

    I try to use the term “velocity” to refer to updates per unit of time. I’m vague as to the metric (total data volume, number of operations, whatever).

  5. Vlad Rodionov on December 9th, 2013 3:02 pm

    “A generic example that DataStax supported with some Very Big Names — whether those were of customers or prospects wasn’t entirely clear — was in retailing, to actually serve accurate information as to whether inventory is in stock, something Walmart failed at as recently as last year.”

    To actually serve accurate information writes must be consistent across the cluster (in Cassandra lingo – level ALL). It would be interesting to compare Cassandra with consistency level ALL and HBase head to head. I bet HBase will have substantial advantage.

    One more comment:

    I think its HBase Phoenix not WibiData Kiji can be compared to Cassandra’s CQL. Phoenix is SQL front end to HBase.
    https://github.com/forcedotcom/phoenix

    As for traditional Cassandra vs HBase , HBase is more feature rich (Coprocessors, filters, for example). For some applications this can be critical (cell level security, custom aggregates). Application logic can be pushed down to RS to reduce network usage and query latencies.

  6. Curt Monash on December 9th, 2013 4:27 pm

    Is Phoenix a salesforce.com project?

  7. Keshav Murthy on December 9th, 2013 4:56 pm

    Hi Curt,

    regarding “phone-home” devices, you’re right about the pattern — each reading will be (timestamp, value1, value2, …) pattern. This can be stored efficiently if you store it as an array. Number of rows remain static and the table grows in the third dimension.

    If you take the next step, for devices that that are expected to phone home at regular interval (every 30 mins or hour, etc), you can remove the timestamp altogether and simply store the value save additional space. Given a time, you can get its value simply by its logical offset. This is common in smart meters. Databases do have to deal with filling interpolation issue for missing values. When the “phone home” timestamp is non-regular, it efficient to store timestamp with the data.

    Once you store this as an “array”, then the question of efficient traversal, insertion, etc will come into picture. The data does not always get loaded in increasing timestamp/etc.

    For the curious, Informix has a specialized timeseries type, access method, language and fully relational exposure to the data. See some impressive benchmarks at: http://www-01.ibm.com/software/data/informix/smart-meter/

    Interestingly, timeseries isn’t just applicable to high volume/velocity situations… It’s important to handle sensor data at the edge of IoT. Shaspa is using timeseries approach to sensor data management: http://www.tatung.com/en/news2013_09_09.asp

  8. Vlad Rodionov on December 9th, 2013 7:13 pm

    Yep, its salesforce.com.

  9. clive boulton on May 23rd, 2014 12:26 am

    Expedia moved its main travel sites to Cassandra
    http://www.slideshare.net/clibou/seattle-scalability-meetup-10505322

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.