Analysis and discussion of the open source data management project Cassandra. Related subjects include:
My last post about DataStax Enterprise and Cassandra didn’t go so well. As follow-up, I chatted for two hours with Rick Branson and Billy Bosworth of DataStax. Hopefully I can do better this time around.
For starters, let me say there are three kinds of data management nodes in DataStax Enterprise:
- Vanilla Cassandra.
- Cassandra plus Solr. Solr is a superset of the text-indexing system Lucene.
- Solr adds a lot more secondary indexing to Cassandra.
- In addition, these nodes serve as Solr emulation; you can run generic Solr apps on them.
- Cassandra plus Hadoop.
- You can use Hadoop MapReduce to manipulate generic Cassandra data.
- In addition, these nodes serve as Hadoop/HDFS (Hadoop Distributed File System) emulation; you can run generic Hadoop apps on them.
- Hadoop jobs can interweave access to the two kinds of data structure.
Cassandra, Solr, Lucene, and Hadoop are all Apache projects.
If we look at this from the standpoint of DML (Data Manipulation Language) and data access APIs:
- Cassandra is a column-group kind of NoSQL DBMS. You can get at its data programmatically.
- There’s something called CQL (Cassandra Query Language), said to be SQL-like.
- There’s a JDBC driver for CQL.
- With Hadoop MapReduce also come Hive, Pig, and Sqoop.
- With Solr and Lucene come full-text search.
In addition, it is sometimes recommended that you use “in-entity caching”, where an entire data structure (e.g. in JSON) winds up in a single Cassandra column.
The two main ways to get direct SQL* access to data in DataStax Enterprise are:
*or very SQL-like, depending on how you view things
Before going further, let’s recall some Cassandra basics: Read more
|Categories: Cassandra, DataStax, Hadoop, MapReduce, Market share and customer counts, NoSQL, Open source, Text||6 Comments|
Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra.
My client DataStax is announcing DataStax Enterprise 2.0. The big point of the release is that there’s a bunch of stuff integrated together, including at least:
- Cassandra — the NoSQL DBMS, which DataStax sometimes calls “DataStax Server”. Edit: That’s not really a fair criticism of DataStax’s messaging.
- Hadoop MapReduce, which DataStax sometimes calls “Hadoop”. Edit: That is indeed fair.
- Sqoop — the general way to connect relational DBMS to Hadoop, which DataStax sometimes calls “RDBMS integration”.
- Solr — the search-centric Apache project, or big parts of it, which DataStax generally calls either “Solr” or “Solr compatibility”.
- log4j – an Apache project that has something or other to do with logging, or parts of it, which DataStax sometimes calls “log file integration”.
- DataStax OpsCenter — some management tools and so on around Cassandra and the rest of the product line.
DataStax stresses that all this runs on the same cluster, with the same administrative tools and so on. For example, on a single cluster:
- You can manage the interactive data for a web site.
- You can store the logs for that website.
- You can analyze all of the above in Hadoop.
|Categories: Cassandra, Clustering, DataStax, EAI, EII, ETL, ELT, ETLT, Games and virtual worlds, Hadoop, Log analysis, Market share and customer counts, NoSQL, Parallelization, Text, Web analytics||5 Comments|
I checked in with James Phillips for a Couchbase update, and I understand better what’s going on. In particular:
- Give or take minor tweaks, what I wrote in my August, 2010 Couchbase updates still applies.
- Couchbase now and for the foreseeable future has one product line, called Couchbase.
- Couchbase 2.0, the first version of Couchbase (the product) to use CouchDB for persistence, has slipped …
- … because more parts of CouchDB had to be rewritten for performance than Couchbase (the company) had hoped.
- Think mid-year or so for the release of Couchbase 2.0, hopefully sooner.
- In connection with the need to rewrite parts of CouchDB, Couchbase has:
- Gotten out of the single-server CouchDB business.
- Donated its proprietary single-sever CouchDB intellectual property to the Apache Foundation.
- The 150ish new customers in 2011 Couchbase brags about are real, subscription customers.
- Couchbase has 60ish people, headed to >100 over the next few months.
|Categories: Basho and Riak, Cassandra, Couchbase, CouchDB, DataStax, Market share and customer counts, MongoDB, NoSQL, Open source, Parallelization, Web analytics, Zynga||6 Comments|
Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions:
- Bigness — Volume, Velocity, size
- Structure — Variety, Variability, Complexity
- High-velocity “big data” problems are usually high-volume as well.*
- Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction.
But the conflation should stop there.
*Low-volume/high-velocity problems are commonly referred to as “event processing” and/or “streaming”.
When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2×2 matrix of possibilities. For want of better alternatives, my suggestions are:
- Relational big data is data of high volume that fits well into a relational DBMS.
- Multi-structured big data is data of high volume that doesn’t fit well into a relational DBMS. Alternative: Poly-structured big data.
- Conventional relational data is data of not-so-high volume that fits well into a relational DBMS. Alternatives: Ordinary/normal/smaller relational data.
- Smaller poly-structured data is data for which dynamic schema capabilities are important, but which doesn’t rise to “big data” volume.
|Categories: Cassandra, Data models and architecture, Data warehousing, Exadata, Facebook, Google, Hadoop, HBase, Log analysis, Market share and customer counts, MarkLogic, NewSQL, NoSQL, Oracle, Splunk, Yahoo||10 Comments|
Last week I visited with James Phillips of Couchbase, Max Schireson and Eliot Horowitz of 10gen, and Todd Lipcon, Eric Sammer, and Omer Trajman of Cloudera. I guess it’s time for a round-up NoSQL post.
Views of the NoSQL market horse race are reasonably consistent, with perhaps some elements of “Where you stand depends upon where you sit.”
- As James tells it, NoSQL is simply a three-horse race between Couchbase, MongoDB, and Cassandra.
- Max would include HBase on the list.
- Further, Max pointed out that metrics such as job listings suggest MongoDB has the most development activity, and Couchbase/Membase/CouchDB perhaps have less.
- The Cloudera guys remarked on some serious HBase adopters.*
- Everybody I spoke with agreed that Riak had little current market presence, although some Basho guys could surely be found who’d disagree.
|Categories: Basho and Riak, Cassandra, Cloudera, Clustering, Couchbase, HBase, Market share and customer counts, MongoDB, NoSQL, Open source, Oracle, Parallelization||12 Comments|
A reporter tweeted: “Is there a simple plain English definition for NoSQL?” After reminding him of my cynical yet accurate Third Law of Commercial Semantics, I gave it a serious try, and came up with the following. More precisely, I tweeted the bolded parts of what’s below; the rest is commentary added for this post.
NoSQL is most easily defined by what it excludes: SQL, joins, strong analytic alternatives to those, and some forms of database integrity. If you leave all four out, and you have a strong scale-out story, you’re in the NoSQL mainstream. Read more
|Categories: Cassandra, dbShards and CodeFutures, MarkLogic, MySQL, Object, Open source, Petabyte-scale data management, Schooner Information Technology||7 Comments|
The DataStax and Cassandra stories are somewhat confusing. Unfortunately, DataStax chose to clarify them in what has turned out to be a crazy news week. I’m going to use this post just to report on the status of the DataStax product line, without going into any analysis beyond that.
I decided I needed some Couchbase drilldown, on business and technology alike, so I had solid chats with both CEO Bob Wiederhold and Chief Architect Dustin Sallings. Pretty much everything I wrote at the time Membase and CouchOne merged to form Couchbase (the company) still holds up. But I have more detail now.
Context for any comments on customer traction includes:
- Membase went into limited production release in October, and full release in January. Similar things are true of CouchDB.
- Hence, most sales of Couchbase’s products have been made over the past 6 months.
- Couchbase (the merged product) is at this point only in a pre-production developer’s release.
- Couchbase has both a direct sales force and a classic open-source “funnel”-based online selling model. Naturally, Couchbase’s understanding of what its customers are doing is more solid with respect to the direct sales base.
- Most of Couchbase’s revenue to date seems to have come from a limited number of big-ticket “lighthouse” accounts (as opposed to, say, the larger number of smaller deals that come in through the online funnel).
- Most Membase purchases are for new applications, as opposed to memcached migrations. However, customers are the kinds of companies that probably also are using memcached elsewhere.
- Most other Membase purchases are replacements for the Membase/MySQL combination. Bob says those are easy sales with short sales cycles.
- Pure memcached support is a small but non-zero business for Couchbase, and a fine source of upsell opportunities.
- In the pipeline but not so much yet in the customer base are SaaS vendors and the like who use and may want to replace traditional DBMS such as Oracle. Other than among those, Couchbase doesn’t compete much yet with Oracle et al.
- Pure CouchDB isn’t all that much of a business, at least relative to community size, as CouchDB is a single-server product commonly used by people who are content not to pay for support.
Membase sales are concentrated in five kinds of internet-centric companies, which in declining order are: Read more
As a follow-up to the latest Stonebraker kerfuffle, Derrick Harris asked me a bunch of smart followup questions. My responses and afterthoughts include:
- Facebook et al. are in effect Software as a Service (SaaS) vendors, not enterprise technology users. In particular:
- They have the technical chops to rewrite their code as needed.
- Unlike packaged software vendors, they’re not answerable to anybody for keeping legacy code alive after a rewrite. That makes migration a lot easier.
- If they want to write different parts of their system on different technical underpinnings, nobody can stop them. For example …
- … Facebook innovated Cassandra, and is now heavily committed to HBase.
- It makes little sense to talk of Facebook’s use of “MySQL.” Better to talk of Facebook’s use of “MySQL + memcached + non-transparent sharding.” That said:
- It’s hard to see why somebody today would use MySQL + memcached + non-transparent sharding for a new project. At least one of Couchbase or transparently-sharded MySQL is very likely a superior alternative. Other alternatives might be better yet.
- As noted above in the example of Facebook, the many major web businesses that are using MySQL + memcached + non-transparent sharding for existing projects can be presumed able to migrate away from that stack as the need arises.
Continuing with that discussion of DBMS alternatives:
- If you just want to write to the memcached API anyway, why not go with Couchbase?
- If you want to go relational, why not go with MySQL? There are many alternatives for scaling or accelerating MySQL — dbShards, Schooner, Akiban, Tokutek, ScaleBase, ScaleDB, Clustrix, and Xeround come to mind quickly, so there’s a great chance that one or more will fit your use case. (And if you don’t get the choice of MySQL flavor right the first time, porting to another one shouldn’t be all THAT awful.)
- If you really, really want to go in-memory, and don’t mind writing Java stored procedures, and don’t need to do the kinds of joins it isn’t good at, but do need to do the kinds of joins it is, VoltDB could indeed be a good alternative.
And while we’re at it — going schema-free often makes a whole lot of sense. I need to write much more about the point, but for now let’s just say that I look favorably on the Big Four schema-free/NoSQL options of MongoDB, Couchbase, HBase, and Cassandra.
We hear much these days about unstructured or semi-structured (as opposed to) structured data. Those are misnomers, however, for at least two reasons. First, it’s not really the data that people think is un-, semi-, or fully structured; it’s databases.* Relational databases are highly structured, but the data within them is unstructured — just lists of numbers or character strings, whose only significance derives from the structure that the database imposes.
*Here I’m using the term “database” literally, rather than as a concise synonym for “database management system”. But see below.
Second, a more accurate distinction is not whether a database has one structure or none – it’s whether a database has one structure or many. The easiest way to see this is for databases that have clearly-defined schemas. A relational database has one schema (even if it is just the union of various unrelated sub-schemas); an XML database, however, can have as many schemas as it contains documents.
One small terminological problem is easily handled, namely that people don’t talk about true databases very often, at least when they’re discussing generalities; rather, they talk about data and DBMS.* So let’s talk of DBMS being “structured” singly or multiply or whatever, just as the databases they’re designed to manage are.
*And they refer to the DBMS as “databases,” because they don’t have much other use for the word.
All that said — I think that single vs. multiple database structures isn’t a bright-line binary distinction; rather, it’s a spectrum. For example: Read more
|Categories: Cassandra, Couchbase, Data models and architecture, HBase, IBM and DB2, MarkLogic, MongoDB, NoSQL, Splunk, Theory and architecture||21 Comments|