Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra.
My client DataStax is announcing DataStax Enterprise 2.0. The big point of the release is that there’s a bunch of stuff integrated together, including at least:
- Cassandra — the NoSQL DBMS, which DataStax sometimes calls “DataStax Server”. Edit: That’s not really a fair criticism of DataStax’s messaging.
- Hadoop MapReduce, which DataStax sometimes calls “Hadoop”. Edit: That is indeed fair.
- Sqoop — the general way to connect relational DBMS to Hadoop, which DataStax sometimes calls “RDBMS integration”.
- Solr — the search-centric Apache project, or big parts of it, which DataStax generally calls either “Solr” or “Solr compatibility”.
- log4j – an Apache project that has something or other to do with logging, or parts of it, which DataStax sometimes calls “log file integration”.
- DataStax OpsCenter — some management tools and so on around Cassandra and the rest of the product line.
DataStax stresses that all this runs on the same cluster, with the same administrative tools and so on. For example, on a single cluster:
- You can manage the interactive data for a web site.
- You can store the logs for that website.
- You can analyze all of the above in Hadoop.
No matter what is going on at a node, I gather that data is stored in the same Cassandra file format, which DataStax calls CFS (Cassandra File System). Edit: Not true. See the follow-on post. DataStax stresses that a node can have a choice of at least two “personalities”, namely:
- Cassandra, which DataStax sometimes calls “real-time”, and which among other things seems to entail talking CQL (Cassandra Query Language).
- Hadoop, which DataStax sometimes calls “batch analytics”.
- (I’m not sure whether Solr is a third such choice. On the one hand, that would seem to be thematic; on the other hand, DataStax hasn’t actually said so to me.) Edit: It is. But the elasticity point below doesn’t include Solr.
New in DataStax 2.0, there’s elasticity between these “personalities”; you can fire up a different kind of processing on a node, while leaving the data untouched. DataStax wasn’t able to say what typical replication factors are for the data — e.g., is it 3 on Cassandra nodes plus 3 more on Hadoop nodes, or might the total be less than 6? I’m guessing it’s really 3 on Cassandra nodes, so as to get failure-tolerant RYW consistency, but Hadoop nodes might not necessarily bring the total up to 6.
Other NoSQL vendors portray Cassandra as likely to win when a cluster needs to be spread around multiple data centers, but not a major contender otherwise. DataStax disputes this, but does cite a need for “continuous availability” as a key driver of adoption.
As you’ve probably gathered by now, I like the core DataStax story — and indeed had some influence on it — but roll my eyes somewhat at the work-in-progress as to how it is phrased and told. The other regrettable fuzziness in DataStax messaging is around customer count. DataStax cites >140 “customers”, but that includes every last outfit that bought a single day of training. On the plus side, DataStax cites a firm figure of 45 employees, and has lots of production use cases it can talk about and extrapolate from.
In particular, DataStax cites customers in areas that include:
- In-game messaging, at a number of gaming companies, which sounds a lot like the application Facebook originally invented Cassandra for, before moving to HBase.
- Various kinds of e-commerce — retail, travel, hospitality. Specific uses include product catalogs (a classic dynamic schema use case), shopping carts (arguably ditto), and user-generated data (reviews, comments, whatever).
- Streaming media — 5-6 “mission-critical production users”, most famously Netflix. I gather this is yet another twist on e-commerce.
- Online ads and campaigns — e.g. Constant Contact.
- Sensor data — mainly one example of auto fleet management DataStax keeps mentioning.
Indeed, Netflix should probably be regarded as the single flagship Cassandra user, even ahead of Twitter (not a DataStax customer). Netflix recently wrote:
We now have over 55 Cassandra clusters in the cloud and are moving our source of truth from our Datacenter to these Cassandra clusters.
which compares pretty favorably to an earlier estimate of
7 clusters in production by end of 2011