I last wrote about Couchbase in November, 2012, around the time of Couchbase 2.0. One of the many new features I mentioned then was secondary indexing. Ravi Mayuram just checked in to tell me about Couchbase 4.0. One of the important new features he mentioned was what I think he said was Couchbase’s “first version” of secondary indexing. Obviously, I’m confused.
Now that you’re duly warned, let me remind you of aspects of Couchbase timeline.
- 2 corporate name changes ago, Couchbase was organized to commercialize memcached. memcached, of course, was internet companies’ default way to scale out short-request processing before the rise of NoSQL, typically backed by manually sharded MySQL.
- Couchbase’s original value proposition, under the name Membase, was to provide persistence and of course support for memcached. This later grew into a caching-oriented pitch even to customers who weren’t already memcached users.
- A merger with the makers of CouchDB ensued, with the intention of replacing Membase’s SQLite back end with CouchDB at the same time as JSON support was introduced. This went badly.
- By now, however, Couchbase sells for more than distributed cache use cases. Ravi rattled off a variety of big-name customer examples for system-of-record kinds of use cases, especially in session logging (duh) and also in travel reservations.
- Couchbase 4.0 has been in beta for a few months.
Technical notes on Couchbase 4.0 — and related riffs — start:
- There’s a new SQL-like language called N1QL (pronounced like “nickel”). I’m hearing a lot about SQL-on-NoSQL these days. More on that below.
- “Index”, “data” and “query” are three different services/tiers.
- You can run them all on the same nodes or separately. Couchbase doesn’t have enough experience yet with the technology to know which choice will wind up as a best practice.
- I’m hearing a lot about heterogeneous-node/multi-tier DBMS architectures these days, and would no longer stand by my 2009 statement that they are unusual. Other examples include Oracle Exadata, MySQL, MongoDB (now that it has pluggable storage engines), MarkLogic, and of course the whole worlds of Hadoop and Spark.
- To be clear — the secondary indexes are global, and not tied to the same nodes as the data they index.
- There’s a new back end called ForestDB, but if I understood correctly, it’s used just for the indexes, not for the underlying data.
- ForestDB represents Couchbase indexes in something that resembles b-trees, but also relies on tries. Indeed, if I’m reading the relevant poster correctly, it’s based on a trie of b-trees.
- In another increasingly common trend, Couchbase uses Bloom filters to help decide which partitions to retrieve for any particular query.
Up to a point, SQL-on-NoSQL stories can be fairly straightforward.
- You define some kind of a table,* perhaps in a SQL-like DDL (Data Description Language).
- SELECT, FROM and WHERE clauses work in the usual way.
- Hopefully, if a column is going to have a lot of WHERE clauses on it, it also has an index.
For example, I think that’s the idea behind most ODBC/JDBC drivers for NoSQL systems. I think it’s also the idea behind most “SQL-like” languages that NoSQL vendors ship.
*Nobody I talk to about this ever wants to call it a “view”, but it sure sounds like a view to me — not a materialized view, of course, but a view nonetheless.
JOIN syntax can actually be straightforward as well under these assumptions. As for JOIN execution, Couchbase pulls all the data into the relevant tier, and nested loop execution there. My new clients at SequoiaDB have a similar strategy, by the way, although in their case there’s a hash join option as well.
But if things stopped there, they would miss an important complication: NoSQL has nested data. I.e., a value can actually be an array, whose entries are arrays themselves, and so on. That said, the “turtles all the way down” joke doesn’t quite apply, because at some point there are actual scalar or string values, and those are the ones SQL wants to actually operate on.
Most approaches I know of to that problem boil down to identifying particular fields as table columns, with or without aliases/renaming; I think that’s the old Hadapt/Vertica strategy, for example. Couchbase claims to be doing something a little different however, with a SQL-extending operator called UNNEST. Truth be told, I’m finding the N1QL language reference a bit terse, and haven’t figured out what the practical differences vs. the usual approach are, if any. But it sounds like there may be some interesting ideas in there somewhere.