Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.
Discussions of DBMS performance are always odd, for starters because:
- Workloads and use cases vary greatly.
- In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.
- It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
- Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
- In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
- Many workloads are inherently single node (replication aside). Others are not.
MongoDB and 10gen
I caught up with Ron Avnur at 10gen. Technical highlights included: Read more
Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with Postgres and Illustra/Informix, then more recently suggesting the world needs 9 or so kinds of database technology. As for me — well, I agreed with Mike both times.
Since this is MUCH too big a subject for a single blog post, what I’ll do in this one is simply race through some background material. To a first approximation, this whole discussion is mainly about data layouts — but only if we interpret that concept broadly enough to comprise:
- Every level of storage (disk, RAM, etc.).
- Indexes, aggregates and raw data alike.
To date, nobody has ever discovered a data layout that is efficient for all usage patterns. As a general rule, simpler data layouts are often faster to write, while fancier ones can boost query performance. Specific tradeoffs include, but hardly are limited to: Read more
In typical debates, the extremists on both sides are wrong. “SQL vs. NoSQL” is an example of that rule. For many traditional categories of database or application, it is reasonable to say:
- Relational databases are usually still a good default assumption …
- … but increasingly often, the default should be overridden with a more useful alternative.
Reasons to abandon SQL in any given area usually start:
- Creating a traditional relational schema is possible …
- … but it’s tedious or difficult …
- … especially since schema design is supposed to be done before you start coding.
Some would further say that NoSQL is cheaper, scales better, is cooler or whatever, but given the range of NewSQL alternatives, those claims are often overstated.
Sectors where these reasons kick in include but are not limited to: Read more
|Categories: Health care, Investment research and trading, Log analysis, NewSQL, NoSQL, Web analytics||8 Comments|
Two different vendors recently tried to inflict benchmarks on me. Both were YCSBs, so I decided to look up what the YCSB (Yahoo! Cloud Serving Benchmark) actually is. It turns out that the YCSB:
- Was developed by — you guessed it! — Yahoo.
- Is meant to simulate workloads that fetch web pages, including the writing portions of those workloads.
- Was developed with NoSQL data managers in mind.
- Bakes in one kind of sensitivity analysis — latency vs. throughput.
- Is implemented in extensible open source code.
That actually sounds pretty good, especially the extensibility part;* it’s likely that the YCSB can be useful in a variety of product selection scenarios. Still, as recent examples show, benchmark marketing is an annoying blight upon the database industry.
*With extensibility you can test your own workloads and do your own sensitivity analyses.
Alternate title: TokuDB updates
Tokutek turns a performance argument into a functionality one. In particular, Tokutek claims that TokuDB does a much better job than alternatives of making it practical for you to update indexes at OLTP speeds. Hence, it claims to do a much better job than alternatives of making it practical for you to write and execute queries that only make sense when indexes (or other analytic performance boosts) are in place.
That’s all been true since I first wrote about Tokutek and TokuDB in 2009. However, TokuDB’s technical details have changed. In particular, Tokutek has deemphasized the ideas that:
- Vaguely justified the “fractal” metaphor, namely …
- … the stuff in that post about having one block each sized for each power of 2, …
- … which seem to be a form of what is more ordinarily called “cache-oblivious” technology.
Rather, Tokutek’s new focus for getting the same benefits is to provide a separate buffer for each node of a b-tree. In essence, Tokutek is taking the usual “big blocks are better” story and extending it to indexes. TokuDB also uses block-level compression. Notes on that include: Read more
|Categories: Akiban, Database compression, Market share and customer counts, NewSQL, Tokutek and TokuDB||7 Comments|
NuoDB has an interesting NewSQL story. NuoDB’s core design goals seem to be:
- Very flexible topology, including:
- Local replicas.
- Remote replicas.
- Easy deployment and management.
GenieDB is one of the newer and smaller NewSQL companies. GenieDB’s story is focused on wide-area replication and uptime, coupled to claims about ease and the associated low TCO (Total Cost of Ownership).
GenieDB is in my same family of clients as Cirro.
The GenieDB product is more interesting if we conflate the existing GenieDB Version 1 and a soon-forthcoming (mid-year or so) Version 2. On that basis:
- GenieDB has three tiers.
- GenieDB’s top tier is the usual MySQL front-end.
- GenieDB’s bottom tier is either Berkeley DB or a conventional MySQL storage engine.
- GenieDB’s bottom tier stores your entire database at every node.
- If you replicate locally, GenieDB’s middle tier operates a distributed cache.
- If you replicate wide-area, GenieDB’s middle tier allows active-active/multi-master replication.
The heart of the GenieDB story is probably wide-area replication. Specifics there include: Read more
|Categories: Cache, Cloud computing, Clustering, GenieDB, Market share and customer counts, MySQL, NewSQL||4 Comments|
I plan to write about several NewSQL vendors soon, but first here’s an overview post. Like “NoSQL”, the term “NewSQL” has an identifiable, recent coiner — Matt Aslett in 2011 — yet a somewhat fluid meaning. Wikipedia suggests that NewSQL comprises three things:
- OLTP- (OnLine Transaction Processing)/short-request-oriented SQL DBMS that are newer than MySQL.
- Innovative MySQL engines.
- Transparent sharding systems that can be used with, for example, MySQL.
I think that’s a pretty good working definition, and will likely remain one unless or until:
- SQL-oriented and NoSQL-oriented systems blur indistinguishably.
- MySQL (or PostgreSQL) laps the field with innovative features.
To date, NewSQL adoption has been limited.
- NewSQL vendors I’ve written about in the past include Akiban, Tokutek, CodeFutures (dbShards), Clustrix, Schooner (Membrain), VoltDB, ScaleBase, and ScaleDB, with GenieDB and NuoDB coming soon.
- But I’m dubious whether, even taken together, all those vendors have as many customers or production references as any of 10gen, Couchbase, DataStax, or Cloudant.*
That said, the problem may lie more on the supply side than in demand. Developing a competitive SQL DBMS turns out to be harder than developing something in the NoSQL state of the art.
I’m usually annoyed by lists of year-end predictions. Still, a reporter asked me for some, and I found one kind I was comfortable making.
Trends that I think will continue in 2013 include:
Growing attention to machine-generated data. Human-generated data grows at the rate business activity does, plus 0-25%. Machine-generated data grows at the rate of Moore’s Law, also plus 0-25%, which is a much higher total. In particular, the use of remote machine-generated data is becoming increasingly real.
Hadoop adoption. Everybody has the big bit bucket use case, largely because of machine-generated data. Even today’s technology is plenty good enough for that purpose, and hence justifies initial Hadoop adoption. Development of further Hadoop technology, which I post about frequently, is rapid. And so the Hadoop trend is very real.
Application SaaS. The on-premises application software industry has hopeless problems with product complexity and rigidity. Any suite new enough to cut the Gordian Knot is or will be SaaS (Software as a Service).
Newer BI interfaces. Advanced visualization — e.g. Tableau or QlikView — and mobile BI are both hot. So, more speculatively, are “social” BI (Business Intelligence) interfaces.
Price discounts. If you buy software at 50% of list price, you’re probably doing it wrong. Even 25% can be too high.
MySQL alternatives. NoSQL and NewSQL products often are developed as MySQL alternatives. Oracle has actually done a good job on MySQL technology, but now its business practices are scaring companies away from MySQL commitments, and newer short-request SQL DBMS are ready for use.
|Categories: Business intelligence, Hadoop, MySQL, NewSQL, NoSQL, Open source, Oracle, Pricing, Software as a Service (SaaS), Surveillance and privacy||3 Comments|
I have a bunch of backlogged post subjects in or around short-request processing, based on ongoing conversations with my clients at Akiban, Cloudant, Code Futures (dbShards), DataStax (Cassandra) and others. Let’s start with Akiban. When I posted about Akiban two years ago, it was reasonable to say:
- Akiban is in the short-request DBMS business.
- MySQL compatibility is one way to access Akiban, but it’s not the whole story.
- Akiban’s main point of technical differentiation is to arrange data hierarchically on disk so that many joins are “zero-cost”.
- Walking the hierarchy isn’t a great way to get at data for every possible query; Akiban recognizes the need for other access techniques as well.
All of the above are still true. But unsurprisingly, plenty of the supporting details have changed. Read more