Discussion of DataStax — formerly known as Riptano — a company founded to commercialize Cassandra.
Cassandra’s reputation in many quarters is:
- World-leading in the geo-distribution feature.
- Impressively scalable.
- Hard to use.
This has led competitors to use, and get away with, sales claims along the lines of “Well, if you really need geo-distribution and can’t wait for us to catch up — which we soon will! — you should use Cassandra. But otherwise, there are better choices.”
My friends at DataStax, naturally, don’t think that’s quite fair. And so I invited them — specifically Billy Bosworth and Patrick McFadin — to educate me. Here are some highlights of that exercise.
DataStax and Cassandra have some very impressive accounts, which don’t necessarily revolve around geo-distribution. Netflix, probably the flagship Cassandra user — since Cassandra inventor Facebook adopted HBase instead — actually hasn’t been using the geo-distribution feature. Confidential accounts include:
- A petabyte or so of data at a very prominent company, geo-distributed, with 800+ nodes, in a kind of block storage use case.
- A messaging application at a very prominent company, anticipated to grow to multiple data centers and a petabyte of so of data, across 1000s of nodes.
- A 300 terabyte single-data-center telecom account (which I can’t find on DataStax’s extensive customer list).
- A huge health records deal.
- A Fortune 10 company.
DataStax and Cassandra won’t necessarily win customer-brag wars versus MongoDB, Couchbase, or even HBase, but at least they’re strongly in the competition.
DataStax claims that simplicity is now a strength. There are two main parts to that surprising assertion. Read more
|Categories: Cassandra, Clustering, Couchbase, Data models and architecture, DataStax, Facebook, HBase, Health care, Log analysis, Market share and customer counts, MongoDB and 10gen, NoSQL, Petabyte-scale data management, Specific users||8 Comments|
The 2013 Gartner Magic Quadrant for Operational Database Management Systems is out. “Operational” seems to be Gartner’s term for what I call short-request, in each case the point being that OLTP (OnLine Transaction Processing) is a dubious term when systems omit strict consistency, and when even strictly consistent systems may lack full transactional semantics. As is usually the case with Gartner Magic Quadrants:
- I admire the raw research.
- The opinions contained are generally reasonable (especially since Merv Adrian joined the Gartner team).
- Some of the details are questionable.
- There’s generally an excessive focus on Gartner’s perception of vendors’ business skills, and on vendors’ willingness to parrot all the buzzphrases Gartner wants to hear.
- The trends Gartner highlights are similar to those I see, although our emphasis may be different, and they may leave some important ones out. (Big omission — support for lightweight analytics integrated into operational applications, one of the more genuine forms of real-time analytics.)
Anyhow: Read more
I plan to write about several NewSQL vendors soon, but first here’s an overview post. Like “NoSQL”, the term “NewSQL” has an identifiable, recent coiner — Matt Aslett in 2011 — yet a somewhat fluid meaning. Wikipedia suggests that NewSQL comprises three things:
- OLTP- (OnLine Transaction Processing)/short-request-oriented SQL DBMS that are newer than MySQL.
- Innovative MySQL engines.
- Transparent sharding systems that can be used with, for example, MySQL.
I think that’s a pretty good working definition, and will likely remain one unless or until:
- SQL-oriented and NoSQL-oriented systems blur indistinguishably.
- MySQL (or PostgreSQL) laps the field with innovative features.
To date, NewSQL adoption has been limited.
- NewSQL vendors I’ve written about in the past include Akiban, Tokutek, CodeFutures (dbShards), Clustrix, Schooner (Membrain), VoltDB, ScaleBase, and ScaleDB, with GenieDB and NuoDB coming soon.
- But I’m dubious whether, even taken together, all those vendors have as many customers or production references as any of 10gen, Couchbase, DataStax, or Cloudant.*
That said, the problem may lie more on the supply side than in demand. Developing a competitive SQL DBMS turns out to be harder than developing something in the NoSQL state of the art.
- This is a list of Monash Advantage members.
- All our vendor clients are Monash Advantage members, unless …
- … we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
- We do not usually disclose our user clients.
- We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
- Excluded from this round of disclosure is one vendor I have never written about.
- Included in this round of disclosure is one client paying for services partly in stock. All our other clients are cash-only.
For reasons explained below, I’ll group the clients geographically. Obviously, companies often have multiple locations, but this is approximately how it works from the standpoint of their interactions with me. Read more
My last post about DataStax Enterprise and Cassandra didn’t go so well. As follow-up, I chatted for two hours with Rick Branson and Billy Bosworth of DataStax. Hopefully I can do better this time around.
For starters, let me say there are three kinds of data management nodes in DataStax Enterprise:
- Vanilla Cassandra.
- Cassandra plus Solr. Solr is a superset of the text-indexing system Lucene.
- Solr adds a lot more secondary indexing to Cassandra.
- In addition, these nodes serve as Solr emulation; you can run generic Solr apps on them.
- Cassandra plus Hadoop.
- You can use Hadoop MapReduce to manipulate generic Cassandra data.
- In addition, these nodes serve as Hadoop/HDFS (Hadoop Distributed File System) emulation; you can run generic Hadoop apps on them.
- Hadoop jobs can interweave access to the two kinds of data structure.
Cassandra, Solr, Lucene, and Hadoop are all Apache projects.
If we look at this from the standpoint of DML (Data Manipulation Language) and data access APIs:
- Cassandra is a column-group kind of NoSQL DBMS. You can get at its data programmatically.
- There’s something called CQL (Cassandra Query Language), said to be SQL-like.
- There’s a JDBC driver for CQL.
- With Hadoop MapReduce also come Hive, Pig, and Sqoop.
- With Solr and Lucene come full-text search.
In addition, it is sometimes recommended that you use “in-entity caching”, where an entire data structure (e.g. in JSON) winds up in a single Cassandra column.
The two main ways to get direct SQL* access to data in DataStax Enterprise are:
*or very SQL-like, depending on how you view things
Before going further, let’s recall some Cassandra basics: Read more
|Categories: Cassandra, DataStax, Hadoop, MapReduce, Market share and customer counts, NoSQL, Open source, Text||6 Comments|
Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra.
My client DataStax is announcing DataStax Enterprise 2.0. The big point of the release is that there’s a bunch of stuff integrated together, including at least:
- Cassandra — the NoSQL DBMS, which DataStax sometimes calls “DataStax Server”. Edit: That’s not really a fair criticism of DataStax’s messaging.
- Hadoop MapReduce, which DataStax sometimes calls “Hadoop”. Edit: That is indeed fair.
- Sqoop — the general way to connect relational DBMS to Hadoop, which DataStax sometimes calls “RDBMS integration”.
- Solr — the search-centric Apache project, or big parts of it, which DataStax generally calls either “Solr” or “Solr compatibility”.
- log4j – an Apache project that has something or other to do with logging, or parts of it, which DataStax sometimes calls “log file integration”.
- DataStax OpsCenter — some management tools and so on around Cassandra and the rest of the product line.
DataStax stresses that all this runs on the same cluster, with the same administrative tools and so on. For example, on a single cluster:
- You can manage the interactive data for a web site.
- You can store the logs for that website.
- You can analyze all of the above in Hadoop.
|Categories: Cassandra, Clustering, DataStax, EAI, EII, ETL, ELT, ETLT, Games and virtual worlds, Hadoop, Log analysis, Market share and customer counts, NoSQL, Parallelization, Text, Web analytics||5 Comments|
I checked in with James Phillips for a Couchbase update, and I understand better what’s going on. In particular:
- Give or take minor tweaks, what I wrote in my August, 2010 Couchbase updates still applies.
- Couchbase now and for the foreseeable future has one product line, called Couchbase.
- Couchbase 2.0, the first version of Couchbase (the product) to use CouchDB for persistence, has slipped …
- … because more parts of CouchDB had to be rewritten for performance than Couchbase (the company) had hoped.
- Think mid-year or so for the release of Couchbase 2.0, hopefully sooner.
- In connection with the need to rewrite parts of CouchDB, Couchbase has:
- Gotten out of the single-server CouchDB business.
- Donated its proprietary single-sever CouchDB intellectual property to the Apache Foundation.
- The 150ish new customers in 2011 Couchbase brags about are real, subscription customers.
- Couchbase has 60ish people, headed to >100 over the next few months.
|Categories: Basho and Riak, Cassandra, Couchbase, CouchDB, DataStax, Market share and customer counts, MongoDB and 10gen, NoSQL, Open source, Parallelization, Web analytics, Zynga||6 Comments|
I put up 14 posts over the past week, so perhaps you haven’t had a chance yet to read them all. Highlights included:
- My most important post of the week was a general guide to IT vendor strategy. That one has already spawned discussion at many companies, from the tiny to the multi-billion-dollar.
- The best comment thread of the week was probably on my post about scale-out relational OLTP choices, in which people discussed the merits of various particular alternatives.
- I recommended that people strongly consider attending XLDB 5 in Menlo Park on October 18-19.
Most of the posts, however, were reactions to news events. In particular:
- Teradata announced that Teradata 14 will be hybrid-columnar, more in Vertica’s way than in Greenplum’s or Aster Data’s. (Pay no attention to the Wall Street Journal’s apparent belief that no other analytic DBMS is hybrid-columnar at all.)
- Aster announced the unsurprising news that there will be a Teradata Aster appliance. Also, Aster talked about greater analytic flexibility in the forthcoming Aster 5.0.
- With Oracle OpenWorld coming up, Oracle decided to get some of its announcing out of the way early. In particular, it announced the Oracle Database Appliance, which is small-business-friendly hardware for running the Oracle DBMS. However, the Oracle Database Appliance doesn’t seem to do much about the complexity of running the Oracle DBMS software.
- In a catch-all Hadoop post, I noted that:
- Oracle has now clearly said it has a Hadoop appliance coming, no doubt next week at OpenWorld.
- I still can’t see why Hadoop appliances would succeed, but a lot of smart folks seem to disagree with me.
- Greenplum announced what looks like a nice but unimportant little product upgrade.
- It’s a really good thing that previously reported plans to revamp Hadoop are underway.
- DataStax announced that it really is a Cassandra company after all. Pay no attention to previous marketing that seemed to put DataStax in the same Hadoop-alternative category as, say, MapR.
- Ingres has changed its name to Actian. The announcement seems like a confession that Ingres and VectorWise are going nowhere.
|Categories: Actian and Ingres, Aster Data, Data warehousing, DataStax, Greenplum, Hadoop, Teradata, VectorWise||Leave a Comment|
The DataStax and Cassandra stories are somewhat confusing. Unfortunately, DataStax chose to clarify them in what has turned out to be a crazy news week. I’m going to use this post just to report on the status of the DataStax product line, without going into any analysis beyond that.
Last month, I reviewed with the Aster Data folks which markets they were targeting and selling into, subsequent to acquisition by their new orange overlords. The answers aren’t what they used to be. Aster no longer focuses much on what it used to call frontline (i.e., low-latency, operational) applications; those are of course a key strength for Teradata. Rather, Aster focuses on investigative analytics — they’ve long endorsed my use of the term — and on the batch run/scoring kinds of applications that inform operational systems.
|Categories: Analytic technologies, Application areas, Aster Data, Data warehousing, DataStax, RDF and graphs, Surveillance and privacy, Teradata, Web analytics||1 Comment|