Couchbase technical update
My Couchbase business update with Bob Wiederhold was very interesting, but it didn’t answer much about the actual Couchbase product. For that, I talked with Dustin Sallings. We jumped around a lot, and some important parts of the Couchbase product haven’t had their designs locked down yet anyway. But here’s at least a partial explanation of what’s up.
memcached is a way to cache data in RAM across a cluster of servers and have it all look logically like a single memory pool, extremely popular among large internet companies. The Membase product — which is what Couchbase has been selling this year — adds persistence to memcached, an obvious improvement on requiring application developers to write both to memcached and to non-transparently-sharded MySQL. The main technical points in adding persistence seem to have been:
- A persistent backing store (duh), namely SQLite.
- A change to the hashing algorithm, to avoid losing data when the cluster configuration is changed.
Couchbase is essentially Membase improved by integrating CouchDB into it, with the main changes being:
- Changing the backing store to CouchDB (duh). This will be in the first Couchbase release.
- Adding cross data center replication on CouchDB’s consistency model. This will not, I believe, be in the first Couchbase release.
- Offering CouchDB’s programming and query interfaces as an option. So far as I can tell, this will be implemented straightforwardly in the first Couchbase release, with elegance planned for later down the road.
Let’s drill down a bit into Membase/Couchbase clustering and consistency. Read more
Categories: Cache, Clustering, Couchbase, memcached, Memory-centric data management, MySQL, Parallelization, Solid-state memory | 8 Comments |
Couchbase business update
I decided I needed some Couchbase drilldown, on business and technology alike, so I had solid chats with both CEO Bob Wiederhold and Chief Architect Dustin Sallings. Pretty much everything I wrote at the time Membase and CouchOne merged to form Couchbase (the company) still holds up. But I have more detail now. 😉
Context for any comments on customer traction includes:
- Membase went into limited production release in October, and full release in January. Similar things are true of CouchDB.
- Hence, most sales of Couchbase’s products have been made over the past 6 months.
- Couchbase (the merged product) is at this point only in a pre-production developer’s release.
- Couchbase has both a direct sales force and a classic open-source “funnel”-based online selling model. Naturally, Couchbase’s understanding of what its customers are doing is more solid with respect to the direct sales base.
- Most of Couchbase’s revenue to date seems to have come from a limited number of big-ticket “lighthouse” accounts (as opposed to, say, the larger number of smaller deals that come in through the online funnel).
That said,
- Most Membase purchases are for new applications, as opposed to memcached migrations. However, customers are the kinds of companies that probably also are using memcached elsewhere.
- Most other Membase purchases are replacements for the Membase/MySQL combination. Bob says those are easy sales with short sales cycles.
- Pure memcached support is a small but non-zero business for Couchbase, and a fine source of upsell opportunities.
- In the pipeline but not so much yet in the customer base are SaaS vendors and the like who use and may want to replace traditional DBMS such as Oracle. Other than among those, Couchbase doesn’t compete much yet with Oracle et al.
- Pure CouchDB isn’t all that much of a business, at least relative to community size, as CouchDB is a single-server product commonly used by people who are content not to pay for support.
Membase sales are concentrated in five kinds of internet-centric companies, which in declining order are: Read more
Terminology: Dynamic- vs. fixed-schema databases
E. F. “Ted” Codd taught the computing world that databases should have fixed logical schemas (which protect the user from having to know about physical database organization). But he may not have been as universally correct as he thought. Cases I’ve noted in which fixed schemas may be problematic include:
- “A bunch of apps in one, similar but not the same” (in my recent post on MongoDB).
- Out-of-control product catalogs (ditto).
- Analytic use cases in which one keeps enhancing the database with derived data.
And if marketing profile analysis is ever done correctly, that will be a huge example for the list.
So what do we call those DBMS — for example NoSQL, object-oriented, or XML-based systems — that bake the schema into the applications or the records themselves? In the MongoDB post I went with “schemaless,” but I wasn’t really comfortable with that, so I took the discussion to Twitter. Comments from Vlad Didenko (in particular), Ryan Prociuk, Merv Adrian, and Roland Bouman favored the idea that schemas in such systems are changeable or late-bound, rather than entirely absent. I quickly agreed.
Categories: Data models and architecture, NoSQL, Object, Structured documents | 51 Comments |
The Ted Codd guarantee
I write a lot about whether or not to use relational DBMS. For example:
- In May I surveyed relational vs. non-relational pros and cons at some length.
- Last November I mused about when it might be OK to do without joins.
- The question is implicit in a variety of posts about, say, document-oriented or object-oriented DBMS.
Before going further in that vein, I’d like to do a quick review of what E. F. “Ted” Codd was getting at with the relational model in the first place. Read more
Categories: Data models and architecture, IBM and DB2, MOLAP, NoSQL | 3 Comments |
MongoDB users and use cases
I spoke with Eliot Horowitz and Max Schierson of 10gen last month about MongoDB users and use cases. The biggest clusters they came up with weren’t much over 100 nodes, but clusters an order of magnitude bigger were under development. The 100 node one we talked the most about had 33 replica sets, each with about 100 gigabytes of data, so that’s in the 3-4 terabyte range total. In general, the largest MongoDB databases are 20-30 TB; I’d guess those really do use the bulk of available disk space. Read more
Categories: Data models and architecture, Games and virtual worlds, Log analysis, MongoDB, NoSQL, Solid-state memory, Specific users, Splunk, Telecommunications, Web analytics | 13 Comments |
Introduction to Zettaset
Zettaset is confusing, but as best I understand:
- Zettaset sells Hadoop add-on/enhancement software, with what might be called an enterprise-friendly Hadoop management focus.
- “Business intelligence” gets mentioned prominently in Zettaset’s marketing, but not in what executives now say. Apparently the BI focus is old news, predating a hard pivot.
- Zettaset’s marketing also mentions NoSQL, for little reason that I can discern, except insofar as Zettaset relies on HBase.
- CEO Brian Christian told me that Zettaset has been around since December, 2007; on the other hand, Zettaset press release boilerplate says Zettaset was founded in 2009. Apparently, the distinction is that Zettaset was founded in 2007 as a consultancy, but turned its efforts to software development in 2009.
- Zettaset has fewer than 20 people but is “hiring like mad.”
- Zettaset just did a $3 million Series A round — or maybe is just announcing it now; the latter interpretation might explain how those 20ish people are getting paid.
- Zettaset’s product was just launched and made generally available, notwithstanding that Version 2 of Zettaset’s product was shipped last year to fanfare on Zettaset’s blog.
- Zettaset’s pricing is based on how many terabytes of compressed data it is being used to manage.
- Until very recently, Zettaset was called GOTO Metrics; I imagine the name change is connected to the strategy pivot.
- Zettaset told me of one big customer — with an almost-petabyte Hadoop cluster before compression — namely Zions Bancorporation.
- Zettaset has “a number of paying customers” overall.
Categories: Hadoop, HBase, MapReduce, Market share and customer counts, Specific users, Zettaset | 4 Comments |
Remote machine-generated data
I refer often to machine-generated data, which is commonly generated inexpensively and in log-like formats, and is often best aggregated in a big bit bucket before you try to do much analysis on it. The term has caught on, to the point that perhaps it’s time to distinguish more carefully among different kinds of machine-generated data. In particular, I think it may be useful to distinguish between:
- Log-stream machine-generated data, when what you’re looking at — at least initially — is the entire output of verbose logging systems.
- Remote machine-generated data.
Here’s what I’m thinking of for the second category. I rather frequently hear of cases in which data is generated by large numbers of remote machines, which occasionally send messages home. For example: Read more
Categories: Analytic technologies, Cloud computing, Log analysis, MySQL, Netezza, Splunk, Truviso | 2 Comments |
McObject and eXtremeDB
I talked with McObject yesterday. McObject has two product lines, both of which are something like in-memory DBMS — eXtremeDB, which is the main one, and Perst. McObject has been around since at least 2003, probably has no venture capital, and probably has a very low double-digit number of employees.*
*I could be wrong in those guesses; as small companies go, McObject is unusually prone to secrecy games.
As best I understand:
- eXtremeDB is something like an in-memory object-oriented DBMS, designed to be embeddable.
- However, much as with Objectivity and other old-school OODBMS, eXtremeDB winds up being more of a toolkit with which to build DBMS than a full DBMS.
- eXtremeDB has a few indexing schemes. The main one is good old B-trees. One customer wanted Patricia tries, so they’re in there. (Perhaps not coincidentally, solidDB relies on Patricia tries.) At least one wanted R-trees, so they’re in there too.
- eXtremeDB has long had the option of persistent logs.
- eXtremeDB newly has a hybrid memory-centric option, in which you can have more data in the database than fits into RAM.
- eXtremeDB newly has multi-master two-phase-commit clustering.
My guess three years ago that eXtremeDB might emerge as an alternative to solidDB seems to have been borne out. McObject CEO Steve Graves says that the core of McObject’s business is OEMs, in sectors such as telecom equipment and defense/aerospace. That’s exactly solidDB’s traditional market, except that solidDB got acquired by IBM and deemphasized it.
I’ve said before that if I were starting a SaaS effort — and it wasn’t just focused on analytics — I’d look at using a memory-centric OODBMS. Perhaps eXtremeDB is worth looking at in such scenarios.
Categories: In-memory DBMS, McObject, Memory-centric data management, Object, Objectivity and Infinite Graph, solidDB, Telecommunications | 11 Comments |
HBase is not broken
It turns out that my impression that HBase is broken was unfounded, in at least two ways. The smaller is that something wrong with the HBase/Hadoop interface or Hadoop’s HBase support cannot necessarily be said to be wrong with HBase (especially since HBase is no longer a Hadoop subproject). The bigger reason is that, according to consensus, HBase has worked pretty well since the .90 release in January of this year.
After Michael Stack of StumbleUpon beat me up for a while,* Omer Trajman of Cloudera was kind enough to walk me through HBase usage. He is informed largely by 18 Cloudera customers, plus a handful of other well-known HBase users such as Facebook, StumbleUpon, and Yahoo. Of the 18 Cloudera customers using HBase that Omer was thinking of, 15 are in HBase production, one is in HBase “early production”, one is still doing R&D in the area of HBase, and one is a classified government customer not providing such details. Read more
Categories: Cloudera, Derived data, Facebook, Hadoop, HBase, Log analysis, Market share and customer counts, Open source, Specific users, Web analytics | 6 Comments |
Soundbites: the Facebook/MySQL/NoSQL/VoltDB/Stonebraker flap, continued
As a follow-up to the latest Stonebraker kerfuffle, Derrick Harris asked me a bunch of smart followup questions. My responses and afterthoughts include:
- Facebook et al. are in effect Software as a Service (SaaS) vendors, not enterprise technology users. In particular:
- They have the technical chops to rewrite their code as needed.
- Unlike packaged software vendors, they’re not answerable to anybody for keeping legacy code alive after a rewrite. That makes migration a lot easier.
- If they want to write different parts of their system on different technical underpinnings, nobody can stop them. For example …
- … Facebook innovated Cassandra, and is now heavily committed to HBase.
- It makes little sense to talk of Facebook’s use of “MySQL.” Better to talk of Facebook’s use of “MySQL + memcached + non-transparent sharding.” That said:
- It’s hard to see why somebody today would use MySQL + memcached + non-transparent sharding for a new project. At least one of Couchbase or transparently-sharded MySQL is very likely a superior alternative. Other alternatives might be better yet.
- As noted above in the example of Facebook, the many major web businesses that are using MySQL + memcached + non-transparent sharding for existing projects can be presumed able to migrate away from that stack as the need arises.
Continuing with that discussion of DBMS alternatives:
- If you just want to write to the memcached API anyway, why not go with Couchbase?
- If you want to go relational, why not go with MySQL? There are many alternatives for scaling or accelerating MySQL — dbShards, Schooner, Akiban, Tokutek, ScaleBase, ScaleDB, Clustrix, and Xeround come to mind quickly, so there’s a great chance that one or more will fit your use case. (And if you don’t get the choice of MySQL flavor right the first time, porting to another one shouldn’t be all THAT awful.)
- If you really, really want to go in-memory, and don’t mind writing Java stored procedures, and don’t need to do the kinds of joins it isn’t good at, but do need to do the kinds of joins it is, VoltDB could indeed be a good alternative.
And while we’re at it — going schema-free often makes a whole lot of sense. I need to write much more about the point, but for now let’s just say that I look favorably on the Big Four schema-free/NoSQL options of MongoDB, Couchbase, HBase, and Cassandra.