Clustering – DBMS 2 : DataBase Management System Services

Differentiation in data management

Curt Monash — Mon, 26 Oct 2015 19:32:34 +0000

In the previous post I broke product differentiation into 6-8 overlapping categories, which may be abbreviated as:

Scope
Accuracy
(Other) trustworthiness
Speed
User experience
Cost

and sometimes also issues in adoption and administration.

Now let’s use this framework to examine two market categories I cover — data management and, in separate post, business intelligence.

Applying this taxonomy to data management:

Scope: Different subcategories of data management technology are suitable for different kinds of data, different scale of data, etc. To a lesser extent that may be true within a subcategory as well.
Scope: Further, products may differ in what you can do with the data, especially analytically.
Accuracy: Don’t … lose … data.
Other trustworthiness:
- Uptime, availability and so on are big deals in many data management sectors.
- Security is hugely important for data that both belongs to other people — usually your customers — and is accessible via the internet. It’s important in numerous other database use cases as well.
- Awkwardly, the CAP Theorem teaches us that there can be a bit of a trade-off between availability and (temporary) accuracy.
Speed:
- Different kinds of data management products perform differently in different use cases.
- If your use case is down the middle of what a mature data management subsector focuses on, performance may not vary much among individual leading products.
- Even then, tuning effort may be quite different for different products.
User experience:
- Users rarely interact directly with database management products.
- There can be clear differentiation in database administration UIs. (The most dramatic example was perhaps the rise of Microsoft SQL Server.)
- Data manipulation languages (DMLs) can make a huge difference in programmers’ lives.
Cost:
- License and maintenance costs can be a huge issue, especially if you’re buying from traditional vendors.
- Performance affects cost in a few ways: hardware costs for sure, tuning effort in some cases, and occasionally even vendor license/maintenance fees.
- Ongoing operations costs can vary greatly by database product in general, and by your pre-existing in-house expertise in particular.
- Ease of programming can sometimes lead to significant programming cost differences as well.
Adoption: This one is often misunderstood.
- The effort of adopting new database technology for new applications is often overrated. When projects are huge, it’s often because of what you’re doing with the technology, not because of the technology itself.
- Migration, however, is usually a bitch.

For reasons of length, I’m doing a separate post on differentiation in business intelligence.

Cassandra and privacy requirements

Curt Monash — Thu, 15 Oct 2015 15:18:26 +0000

For starters:

I’ve suggested in the past that multi-data-center capabilities are important for “data sovereignty”/geo-compliance.
The need for geo-compliance just got a lot stronger, with the abolition of the European Union’s Safe Harbour rule for the US. If you collect data in multiple countries, you should be at least thinking about geo-compliance.
Cassandra is an established leader in multi-data-center operation.

But when I made that connection and checked in accordingly with my client Patrick McFadin at DataStax, I discovered that I’d been a little confused about how multi-data-center Cassandra works. The basic idea holds water, but the details are not quite what I was envisioning.

The story starts:

Cassandra groups nodes into logical “data centers” (i.e. token rings).
As a best practice, each physical data center can contain one or more logical data center, but not vice-versa.
There are two levels of replication — within a single logical data center, and between logical data centers.
Replication within a single data center is planned in the usual way, with the principal data center holding a database likely to have a replication factor of 3.
However, copies of the database held elsewhere may have different replication factors …
… and can indeed have different replication factors for different parts of the database.

In particular, a remote replication factor for Cassandra can = 0. When that happens, then you have data sitting in one geographical location that is absent from another geographical location; i.e., you can be in compliance with laws forbidding the export of certain data. To be clear (and this contradicts what I previously believed and hence also implied in this blog):

General multi-data-center operation is not what gives you geo-compliance, because the default case is that the whole database is replicated to each data center.
Instead, you get that effect by tweaking your specific replication settings.

The most visible DataStax client using this strategy is apparently ING Bank.

If you have a geo-compliance issue, you’re probably also concerned about security. After all, the whole reason the issue arises is because one country’s government might want to look at another country’s citizens’ or businesses’ data. The DataStax security story is approximately:

Encryption in flight, for any Cassandra.
Encryption at rest, specifically with DataStax Enterprise.
No cell-level or row-level security until Cassandra 3.0 is introduced and established. (I didn’t actually ask whether something similar to HBase coprocessors is coming for Cassandra, but that would be my first guess.)
Various roles and permissions stuff.

While flexible, Cassandra’s multi-data-center features do add some complexity. Tunable-consistency choices are baked into Cassandra programs at each point data is accessed, and more data centers make for more choices. (Default best practice = write if you get a local quorum, running the slight risk of logical data centers being out of sync with each other.)

One way in which the whole thing does seem nice and simple is that you can have different logical data centers running on different kinds of platforms — cloud, colocation, in-house, whatever — without Cassandra caring.

I’m not going to call the DataStax Enterprise approach to geo-compliance the “gold standard”, because some of it seems pretty clunky or otherwise feature-light. On the other hand, I’m not aware of competitors who exceed it, in features or track record, so “silver standard” seems defensible.

Basho and Riak

Curt Monash — Thu, 15 Oct 2015 15:18:05 +0000

Basho was on my (very short) blacklist of companies with whom I refuse to speak, because they have lied about the contents of previous conversations. But Tony Falco et al. are long gone from the company. So when Basho’s new management team reached out, I took the meeting.

For starters:

Basho management turned over significantly 1-2 years ago. The main survivors from the old team are 1 each in engineering, sales, and services.
Basho moved its headquarters to Bellevue, WA. (You get one guess as to where the new CEO lives.) Engineering operations are very distributed geographically.
Basho claims that it is much better at timely product shipments than it used to be. Its newest product has a planned (or at least hoped-for) 8-week cadence for point releases.
Basho’s revenue is ~90% subscription.
Basho claims >200 enterprise clients, vs. 100-120 when new management came in. Unfortunately, I forgot to ask the usual questions about divisions vs. whole organizations, OEM sell-through vs. direct, etc.
Basho claims an average contract value of >$100K, typically over 2-3 years. $9 million of that (which would be close to half the total, actually), comes from 2 particular deals of >$4 million each.

Basho’s product line has gotten a bit confusing, but as best I understand things the story is:

There’s something called Riak Core, which isn’t even a revenue-generating product. However, it’s an open source project with some big users (e.g. Goldman Sachs, Visa), and included in pretty much everything else Basho promotes.
Riak KV is the key-value store previously known as Riak. It generates the lion’s share of Basho’s revenue.
Riak S2 is an emulation of Amazon S3. Basho thinks that Riak KV loses efficiency when objects get bigger than 1 MB or so, and that’s when you might want to use Riak S2 in addition or instead.
Riak TS is for time series, and just coming out now.
Also in the mix are some (extra charge) connectors for Redis and Spark. Presumably, there are more of these to come.
There’s an umbrella marketing term of “Basho Data Platform”.

Technical notes on some of that include:

Riak Core doesn’t do data management. It just manages distributed operation of — well, whatever you want to operate. In part, Basho sees Riak Core as a better Apache ZooKeeper.
- That is the essence of the Riak/Spark pitch — something better than ZooKeeper for cluster management, and I presume some help in persisting Spark RDDs as well.
- The Riak/Redis pitch is even simpler — cluster management for Redis, and persistent backing as well.
- Basho’s criticisms of ZooKeeper start with “Cluster manager, manage thyself” claims about ZooKeeper availability, as in the PagerDuty ZooKeeper critique.
Riak KV has secondary indexing. Performance is somewhat questionable. It also has Solr indexing, which is fast.
At least in its 1.0 form, Riak TS assumes:
- There’s some kind of schema or record structure.
- There are explicit or else easily-inferred timestamps.
- Microsecond accuracy, perfect ordering and so on are not essential.
Thus, Riak TS 1.0 is not ideal for the classic Splunk use case where you text index/search on a lot of log emissions. It also is not ideal for financial tick storage.
Riak TS has range-based partitioning, where the range is in terms of time. Basho refers to this as “locality”.
Riak TS has a SQL subset. Evidently there’s decent flexibility as to which part of the database carries which schema.
Riak has a nice feature of allowing you stage a change to network topology before you push it live.
Riak’s vector clock approach to wide-area synchronization is more controversial.

Finally, notes on what Basho sees as use cases and competition include:

Riak KV is generally used to store usual-suspect stuff — log data, user/profile data and so on.
Basho thinks NoSQL is a 4-horse race — Basho/Riak KV, DataStax/Cassandra, MongoDB, Couchbase. (I would be surprised if there was much agreement with that view from, for example, MongoDB, DataStax, Aerospike, MapR or the HBase community.)
Basho competes on availability, scalability (including across geography) and so on, or in simplest terms:
- “Availability and correctness”
- Simple operation
Unsurprisingly, Basho thinks its closest competitor is DataStax. (However, DataStax tells me they don’t compete much with Basho.)
Basho thinks Riak KV has ease-of-operation advantages vs. Cassandra.
An example of a mission-critical Riak app is the UK National Health Service storing prescription information.
An example of Riak S2 and Riak KV being used together is Turner Broadcasting storing video in the former and associated metadata in the latter.
Riak TS is initially pointed at two use cases:
- “Internet of Things”
- “Metrics”, which seems to mean monitoring of system metrics.
Basho sees the competition for Riak TS as starting with HBase, Cassandra, and InfluxDB.

Couchbase 4.0 and related subjects

Curt Monash — Thu, 15 Oct 2015 15:17:44 +0000

I last wrote about Couchbase in November, 2012, around the time of Couchbase 2.0. One of the many new features I mentioned then was secondary indexing. Ravi Mayuram just checked in to tell me about Couchbase 4.0. One of the important new features he mentioned was what I think he said was Couchbase’s “first version” of secondary indexing. Obviously, I’m confused.

Now that you’re duly warned, let me remind you of aspects of Couchbase timeline.

2 corporate name changes ago, Couchbase was organized to commercialize memcached. memcached, of course, was internet companies’ default way to scale out short-request processing before the rise of NoSQL, typically backed by manually sharded MySQL.
Couchbase’s original value proposition, under the name Membase, was to provide persistence and of course support for memcached. This later grew into a caching-oriented pitch even to customers who weren’t already memcached users.
A merger with the makers of CouchDB ensued, with the intention of replacing Membase’s SQLite back end with CouchDB at the same time as JSON support was introduced. This went badly.
By now, however, Couchbase sells for more than distributed cache use cases. Ravi rattled off a variety of big-name customer examples for system-of-record kinds of use cases, especially in session logging (duh) and also in travel reservations.
Couchbase 4.0 has been in beta for a few months.

Technical notes on Couchbase 4.0 — and related riffs — start:

There’s a new SQL-like language called N1QL (pronounced like “nickel”). I’m hearing a lot about SQL-on-NoSQL these days. More on that below.
“Index”, “data” and “query” are three different services/tiers.
- You can run them all on the same nodes or separately. Couchbase doesn’t have enough experience yet with the technology to know which choice will wind up as a best practice.
- I’m hearing a lot about heterogeneous-node/multi-tier DBMS architectures these days, and would no longer stand by my 2009 statement that they are unusual. Other examples include Oracle Exadata, MySQL, MongoDB (now that it has pluggable storage engines), MarkLogic, and of course the whole worlds of Hadoop and Spark.
To be clear — the secondary indexes are global, and not tied to the same nodes as the data they index.
There’s a new back end called ForestDB, but if I understood correctly, it’s used just for the indexes, not for the underlying data.
ForestDB represents Couchbase indexes in something that resembles b-trees, but also relies on tries. Indeed, if I’m reading the relevant poster correctly, it’s based on a trie of b-trees.
In another increasingly common trend, Couchbase uses Bloom filters to help decide which partitions to retrieve for any particular query.

Up to a point, SQL-on-NoSQL stories can be fairly straightforward.

You define some kind of a table,* perhaps in a SQL-like DDL (Data Description Language).
SELECT, FROM and WHERE clauses work in the usual way.
Hopefully, if a column is going to have a lot of WHERE clauses on it, it also has an index.

For example, I think that’s the idea behind most ODBC/JDBC drivers for NoSQL systems. I think it’s also the idea behind most “SQL-like” languages that NoSQL vendors ship.

*Nobody I talk to about this ever wants to call it a “view”, but it sure sounds like a view to me — not a materialized view, of course, but a view nonetheless.

JOIN syntax can actually be straightforward as well under these assumptions. As for JOIN execution, Couchbase pulls all the data into the relevant tier, and nested loop execution there. My new clients at SequoiaDB have a similar strategy, by the way, although in their case there’s a hash join option as well.

But if things stopped there, they would miss an important complication: NoSQL has nested data. I.e., a value can actually be an array, whose entries are arrays themselves, and so on. That said, the “turtles all the way down” joke doesn’t quite apply, because at some point there are actual scalar or string values, and those are the ones SQL wants to actually operate on.

Most approaches I know of to that problem boil down to identifying particular fields as table columns, with or without aliases/renaming; I think that’s the old Hadapt/Vertica strategy, for example. Couchbase claims to be doing something a little different however, with a SQL-extending operator called UNNEST. Truth be told, I’m finding the N1QL language reference a bit terse, and haven’t figured out what the practical differences vs. the usual approach are, if any. But it sounds like there may be some interesting ideas in there somewhere.

Notes on HBase

Curt Monash — Tue, 10 Mar 2015 18:24:40 +0000

I talked with a couple of Cloudera folks about HBase last week. Let me frame things by saying:

The closest thing to an HBase company, ala MongoDB/MongoDB or DataStax/Cassandra, is Cloudera.
Cloudera still uses a figure of 20% of its customers being HBase-centric.
HBaseCon and so on notwithstanding, that figure isn’t really reflected in Cloudera’s marketing efforts. Cloudera’s marketing commitment to HBase has never risen to nearly the level of MongoDB’s or DataStax’s push behind their respective core products.
With Cloudera’s move to “zero/one/many” pricing, Cloudera salespeople have little incentive to push HBase hard to accounts other than HBase-first buyers.

Also:

Cloudera no longer dominates HBase development, if it ever did.
- Cloudera is the single biggest contributor to HBase, by its count, but doesn’t make a majority of the contributions on its own.
- Cloudera sees Hortonworks as having become a strong HBase contributor.
- Intel is also a strong contributor, as are end user organizations such as Chinese telcos. Not coincidentally, Intel was a major Hadoop provider in China before the Intel/Cloudera deal.
As far as Cloudera is concerned, HBase is just one data storage technology of several, focused on high-volume, high-concurrency, low-latency short-request processing. Cloudera thinks this is OK because of HBase’s strong integration with the rest of the Hadoop stack.
Others who may be inclined to disagree are in several cases doing projects on top of HBase to extend its reach. (In particular, please see the discussion below about Apache Phoenix and Trafodion, both of which want to offer relational-like functionality.)

Cloudera’s views on HBase history — in response to the priorities I brought to the conversation — include:

HBase initially favored consistency over performance/availability, while Cassandra initially favored the opposite choice. Both products, however, have subsequently become more tunable in those tradeoffs.
Cloudera’s initial contributions to HBase focused on replication, disaster recovery and so on. I guess that could be summarized as “scaling”.
Hortonworks’ early HBase contributions included (but were not necessarily limited to):
- Making recovery much faster (10s of seconds or less, rather than minutes or more).
- Some of that consistency vs. availability tuning.
“Coprocessors” were added to HBase ~3 years ago, to add extensibility, with the first use being in security/permissions.
With more typical marketing-oriented version numbers:
- HBase .90, the first release that did a good job on durability, could have been 1.0.
- HBase .92 and .94, which introduced coprocessors, could have been Version 2.
- HBase .96 and .98 could have been Version 3.
- The recent HBase 1.0 could have been 4.0.

The HBase roadmap includes:

A kind of BLOB/CLOB (Binary/Character Large OBject) support.
- Intel is heavily involved in this feature.
- The initial limit is 10 megabytes or so, due to some limitations in the API (I didn’t ask why that made sense). This happens to be all the motivating Chinese customer needs for the traffic photographs it wants to store.
Various kinds of “multi-tenancy” support (multi-tenancy is one of those terms whose meaning is getting stretched beyond recognition), including:
- Mixed workload support (short-request and analytic) on the same nodes.
- Mixed workload support on different nodes in the same cluster.
- Security between different apps in the same cluster.
(Still in the design phase) Bottleneck Whack-A-Mole, with goals including but not limited to:
- Scale-out beyond the current assumed limit of ~1200 nodes.
- More predictable performance, based on smaller partition sizes.
(Possibly) Multi-data-center fail-over.

Not on the HBase roadmap per se are global/secondary indexes. Rather, we talked about projects on top of HBase which are meant to provide those. One is Apache Phoenix, which supposedly:

Makes it simple to manage compound keys. (E.g., City/State/ZipCode)
Provides global secondary indexes (but not in a fully ACID way).
Offers some very basic JOIN support.
Provides a JDBC interface.
Offers efficiencies in storage utilization, scan optimizations, and aggregate calculations.

Another such project is Trafodion — supposedly the Welsh word for “transaction” — open sourced by HP. This seems to be based on NonStop SQL and Neoview code, which counter-intuitively have always been joined at the hip.

There was a lot more to the conversation, but I’ll stop here for two reasons:

This post is pretty long already.
I’m reserving some of the discussion until after I’ve chatted with vendors of other NoSQL systems.

Related link

My July 2011 post on HBase offers context, as do the comments on it.

Quick update on Tachyon

Curt Monash — Wed, 04 Mar 2015 18:03:53 +0000

I’m on record as believing that:

Hadoop needs a memory-centric storage grid.
Tachyon is a strong candidate to fill the role.

That said:

It’s an open secret that there will be a Tachyon company. However, …
… no details have been publicized. Indeed, the open secret itself is still officially secret.
Tachyon technology, which just hit 0.6 a couple of days ago, still lacks many features I regard as essential.
As a practical matter, most Tachyon interest to date has been associated with Spark. This makes perfect sense given Tachyon’s origin and initial technical focus.
Tachyon was in 50 or more sites last year. Most of these sites were probably just experimenting with it. However …
… there are production Tachyon clusters with >100 nodes.

As a reminder of Tachyon basics:

You do I/O with Tachyon in memory.
Tachyon data can optionally be persisted.
- That “tiered storage” capability — including SSDs — was just introduced in 0.6. So in particular …
- … it’s very primitive and limited at the moment.
- I’ve heard it said that Intel was a big contributor to tiered storage/SSD support. (Solid-State Drives.)
Tachyon has some ability to understand “lineage” in the Spark sense of term. (In essence, that amounts to knowing what operations created a set of data, and potentially replaying them.)

Beyond that, I get the impressions:

Synchronous write-through from Tachyon to persistent storage is extremely primitive right now — but even so I am told it is being used in production by multiple companies already.
Asynchronous write-through, relying on lineage tracking to recreate any data that gets lost, is slightly further along.
One benefit of adding Tachyon to your Spark installation is a reduction in garbage collection issues.

And with that I have little more to say than my bottom lines:

If you’re writing your own caching layer for some project you should seriously consider adapting Tachyon instead.
If you’re using Spark you should seriously consider using Tachyon as well.
I think Tachyon will be a big deal, but it’s far too early to be sure.

Actian Vector Hadoop Edition

Curt Monash — Thu, 07 Aug 2014 11:12:35 +0000

I have a small blacklist of companies I won’t talk with because of their particularly unethical past behavior. Actian is one such; they evidently made stuff up about me that Josh Berkus gullibly posted for them, and I don’t want to have conversations that could be dishonestly used against me.

That said, Peter Boncz isn’t exactly an Actian employee. Rather, he’s the professor who supervised Marcin Zukowski’s PhD thesis that became Vectorwise, and I chatted with Peter by Skype while he was at home in Amsterdam. I believe his assurances that no Actian personnel sat in on the call.

In other news, Peter is currently working on and optimistic about HyPer. But we literally spent less than a minute talking about that

Before I get to the substance, there’s been a lot of renaming at Actian. To quote Andrew Brust,

… the ParAccel, Pervasive and Vectorwise technologies are being unified under the Actian Analytics Platform brand. Specifically, the ParAccel technology … is being re-branded Actian Matrix; Pervasive’s technologies are rechristened Actian DataFlow and Actian DataConnect; and Vectorwise becomes Actian Vector.

and

Actian … is now “one company, with one voice and one platform” according to its John Santaferraro

The bolded part of the latter quote is untrue — at least in the ordinary sense of the word “one” — but the rest can presumably be taken as company gospel.

All this is by way of preamble to saying that Peter reached out to me about Actian’s new Vector Hadoop Edition when he blogged about it last June, and we finally talked this week. Highlights include:

Vectorwise, while being proudly multi-core, was previously single-server. The new Vector Hadoop Edition is the first version with node parallelism.
Actian’s Vector Hadoop edition uses HDFS (Hadoop Distributed File System) and YARN to manage an Actian-proprietary file format. There is currently no interoperability whereby Hadoop jobs can read these files. However …
… Actian’s Vector Hadoop edition relies on Hadoop for cluster management, workload management and so on.
Peter thinks there are two paying customers, both too recent to be in production, who between then paid what I’d call a remarkable amount of money.*
Roadmap futures* include:
- Being able to update and indeed trickle-update data. Peter is very proud of Vectorwise’s Positional Delta Tree updating.
- Some elasticity they’re proud of, both in terms of nodes (generally limited to the replication factor of 3) and cores (not so limited).
- Better interoperability with Hadoop.

Actian actually bundles Vector Hadoop Edition with DataFlow — the old Pervasive DataRush — into what it calls “Actian Analytics Platform – Hadoop SQL Edition”. DataFlow/DataRush has been working over Hadoop since the latter part of 2012, based on a visit with my then clients at Pervasive that December.

*Peter gave me details about revenue, pipeline, roadmap timetables etc. that I’m redacting in case Actian wouldn’t like them shared. I should say that the timetable for some — not all — of the roadmap items was quite near-term; however, pay no attention to any phrasing in Peter’s blog post that suggests the roadmap features are already shipping.

The Actian Vector Hadoop Edition optimizer and query-planning story goes something like this:

Vectorwise started with the open-source Ingres optimizer. After a query is optimized, it is rewritten to reflect Vectorwise’s columnar architecture. Peter notes that these rewrites rarely change operator ordering; they just add column-specific optimizations, whatever that means.
Now there are rewrites for parallelism as well.
These rewrites all seem to be heuristic/rule-based rather than cost-based.
Once Vectorwise became part of the Ingres company (later renamed to Actian), they had help from Ingres engineers, who helped them modify the base optimizer so that it wasn’t just the “stock” Ingres one.

As with most modern MPP (Massively Parallel Processing) analytic RDBMS, there doesn’t seem to be any concept of a head-node to which intermediate results need to be shipped. This is good, because head nodes in early MPP analytic RDBMS were dreadful bottlenecks.

Peter and I also talked a bit about SQL-oriented HDFS file formats, such as Parquet and ORC. He doesn’t like their lack of support for columnar compression. Further, in Parquet there seems to be a requirement to read the whole file, to an extent that interferes with Vectorwise’s form of data skipping, which it calls “min-max indexing”.

Frankly, I don’t think the architectural choice “uses Hadoop for workload management and administration” provides a lot of customer benefit in this case. Given that, I don’t know that the world needs another immature MPP analytic RDBMS. I also note with concern that Actian has two different MPP analytic RDBMS products. Still, Vectorwise and indeed all the stuff that comes out Martin Kersten and Peter’s group in Amsterdam has always been interesting technology. So the Actian Vector Hadoop Edition might be worth taking a look at before you redirect your attention to products with more convincing track records and futures.

Optimism, pessimism, and fatalism — fault-tolerance, Part 2

Curt Monash — Sun, 08 Jun 2014 16:58:35 +0000

The pessimist thinks the glass is half-empty.
The optimist thinks the glass is half-full.
The engineer thinks the glass was poorly designed.

Most of what I wrote in Part 1 of this post was already true 15 years ago. But much gets added in the modern era, considering that:

Clusters will have node hiccups more often than single nodes will. (Duh.)
Networks are relatively slow even when uncongested, and furthermore congest unpredictably.
In many applications, it’s OK to sacrifice even basic-seeming database functionality.

And so there’s been innovation in numerous cluster-related subjects, two of which are:

Distributed query and update. When a database is distributed among many modes, how does a request access multiple nodes at once?
Fault-tolerance in long-running jobs.When a job is expected to run on many nodes for a long time, how can it deal with failures or slowdowns, other than through the distressing alternatives:
- Start over from the beginning?
- Keep (a lot of) the whole cluster’s resources tied up, waiting for things to be set right?

Distributed database consistency

When a distributed database lives up to the same consistency standards as a single-node one, distributed query is straightforward. Performance may be an issue, however, which is why we have seen a lot of:

Analytic RDBMS innovation.
Short-request applications designed to avoid distributed joins.
Short-request clustered RDBMS that don’t allow fully-general distributed joins in the first place.

But in workloads with low-latency writes, living up to those standards is hard. The 1980s approach to distributed writing was two-phase commit (2PC), which may be summarized as:

A write is planned and parceled out to occur on all the different nodes where the data needs to be placed.
Each node decides it’s ready to commit the write.
Each node informs the others of its readiness.
Each node actually commits.

Unfortunately, if any of the various messages in the 2PC process is delayed, so is the write. This creates way too much likelihood of work being blocked. And so modern approaches to distributed data writing are more … well, if I may repurpose the famous Facebook slogan, they tend to be along the lines of “Move fast and break things”,* with varying tradeoffs among consistency, other accuracy, reliability, functionality, manageability, and performance.

By the way — Facebook recently renounced that motto, in favor of “Move fast with stable infrastructure.” Hmm …

Back in 2010, I wrote about various approaches to consistency, with the punch line being:

A conventional relational DBMS will almost always feature RYW consistency. Some NoSQL systems feature tunable consistency, in which — depending on your settings — RYW consistency may or may not be assured.

The core ideas of RYW consistency, as implemented in various NoSQL systems, are:

Let N = the number of copies of each record distributed across nodes of a parallel system.

Let W = the number of nodes that must successfully acknowledge a write for it to be successfully committed. By definition, W <= N.

Let R = the number of nodes that must send back the same value of a unit of data for it to be accepted as read by the system. By definition, R <= N.

The greater N-R and N-W are, the more node or network failures you can typically tolerate without blocking work.

As long as R + W > N, you are assured of RYW consistency.

That bolded part is the key point, and I suggest that you stop and convince yourself of it before reading further.

Eventually :), Dan Abadi claimed that the key distinction is synchronous/asynchronous — is anything blocked while waiting for acknowledgements? From many people, that would simply be an argument for optimistic locking, in which all writes go through, and conflicts — of the sort that locks are designed to prevent — cause them to be rolled back after-the-fact. But Dan isn’t most people, so I’m not sure — especially since the first time I met Dan was to discuss VoltDB predecessor H-Store, which favors application designs that avoid distributed transactions in the first place.

One idea that’s recently gained popularity is a kind of semi-synchronicity. Writes are acknowledged as soon as they arrive at a remote node (that’s the synchronous part). Each node then updates local permanent storage on its own, with no further confirmation. I first heard about this in the context of replication, and generally it seems designed for replication-oriented scenarios.

Single-job fault-tolerance

Finally, let’s consider fault-tolerance within a single long-running job, whether that’s a big query or some other kind of analytic task. In most systems, if there’s a failure partway through a job, they just say “Oops!” and start it over again. And in non-extreme cases, that strategy is often good enough.

Still, there are a lot of extreme workloads these days, so it’s nice to absorb a partial failure without entirely starting over.

Hadoop MapReduce, which stores intermediate results anyway, finds it easy to replay just the parts of the job that went awry.
Spark, which is more flexible in execution graph and data structures alike, has a similar capability.

Additionally, both Hadoop and Spark support speculative execution, in which several clones of a processing step are executed at once (presumably on different nodes), to hedge against the risk that any one copy of the process runs slowly or fails outright. According to my notes, speculative execution is a major part of NuoDB’ architecture as well.

Further topics

I’ve rambled on for two long posts, which seems like plenty — but this survey is in no way complete. Other subjects I could have covered include but are hardly limited to:

Occasionally-connected operation, which for example is a design point of CouchDB, SQL Anywhere (sort of), and most kinds of mobile business intelligence.
Avoiding planned downtime — i.e., operating despite self-inflicted wounds.
Data cleaning and master data management, both of which exist in large part to fix errors people have made in the past.

Related links

Uninterrupted DBMS operation (September, 2012)
The cardinal rules of DBMS development (March, 2013)
Bottleneck Whack-A-Mole (August, 2009)

MemSQL update

Curt Monash — Fri, 02 May 2014 03:40:39 +0000

I stopped by MemSQL last week, and got a range of new or clarified information. For starters:

Even though MemSQL (the product) was originally designed for OLTP (OnLine Transaction Processing), MemSQL (the company) is now focused on analytic use cases …
… which was the point of introducing MemSQL’s flash-based columnar option.
One MemSQL customer has a 100 TB “data warehouse” installation on Amazon.
Another has “dozens” of terabytes of data spread across 500 machines, which aggregate 36 TB of RAM.
At customer Shutterstock, 1000s of non-MemSQL nodes are monitored by 4 MemSQL machines.
A couple of MemSQL’s top references are also Vertica flagship customers; one of course is Zynga.
MemSQL reports encountering Clustrix and VoltDB in a few competitive situations, but not NuoDB. MemSQL believes that VoltDB is still hampered by its traditional issues — Java, reliance on stored procedures, etc.

On the more technical side:

Some MemSQL users are running 7- or 8-way joins and other long-ish SQL statements.
But MemSQL doesn’t yet have fully peer-to-peer data redistribution.
- MemSQL “leaves” only talk to MemSQL “aggregator nodes,” not each other …
- … but note the plural on “aggregator nodes”, which should immunize MemSQL from the worst of “fat head” bottlenecks.
- Of course, you can sometimes get join locality by sharding multiple tables on the same key …
- … or by broadcast-replicating tables that are sufficiently small.
Better SQL coverage — e.g. SQL Windowing — is coming soon.
MemSQL believes it has an aggressive data skipping story.
MemSQL doesn’t yet have a true workload management story; they’re still at the stage “Our queries run so fast not many of them have to be active at once, and if things nevertheless get too busy we have some throttling capabilities.” But MemSQL at least sounds aware of the difference between that and true workload management, which puts them ahead of some other vendors I talk with.
MemSQL doesn’t have stored procedures. In particular, since MemSQL (the product) generates code on the fly, MemSQL (the company) doesn’t think the performance benefits of stored procedure pre-compilation are needed.

And finally, MemSQL’s column-store compression story — which I mangled in a previous post — goes like this:

There are numerous compression algorithm choices, both columnar (e.g. dictionary/tokenization, run-length encoding) and block (Lempel-Ziv, I presume in multiple variations).
Compression is block-by-block, something I hear more commonly these days than Vertica’s alternative of global compression choices.
The choice of compression scheme is automagic for each block, unless you give explicit hints.
Default block size for the columnar store is 10 million rows.

Wants vs. needs

Curt Monash — Sun, 23 Mar 2014 11:51:54 +0000

In 1981, Gerry Chichester and Vaughan Merlyn did a user-survey-based report about transaction-oriented fourth-generation languages, the leading application development technology of their day. The report included top-ten lists of important features during the buying cycle and after implementation. The items on each list were very similar — but the order of the items was completely different. And so the report highlighted what I regard as an eternal truth of the enterprise software industry:

What users value in the product-buying process is quite different from what they value once a product is (being) put into use.

Here are some thoughts about how that comes into play today.

Wants outrunning needs

1. For decades, BI tools have been sold in large part via demos of snazzy features the CEO would like to have on his desk. First it was pretty colors; then it was maps; now sometimes it’s “real-time” changing displays. Other BI features, however, are likely to be more important in practice.

2. In general, the need for “real-time” BI data freshness is often exaggerated. If you’re a human being doing a job that’s also often automated at high speed — for example network monitoring or stock trading — there’s a good chance you need fully human real-time BI. Otherwise, how much does a 5-15 minute delay hurt? Even if you’re monitoring website sell-through — are your business volumes really high enough that 5 minutes matters much? eBay answered “yes” to that question many years ago, but few of us work for businesses anywhere near eBay’s scale.

Even so, the want for speed keeps growing stronger.

3. Similarly, some desires for elastic scale-out are excessive. Your website selling koi pond accessories should always run well on a single server. If you diversify your business to the point that that’s not true, you’ll probably rewrite your app by then as well.

4. Some developers want to play with cool new tools. That doesn’t mean those tools are the best choice for the job. In particular, boring old SQL has merits — such as joins! — that shiny NoSQL hasn’t yet replicated.

5. Some developers, on the other hand, want to keep using their old tools, on which they are their employers’ greatest experts. That doesn’t mean those tools are the best choice for the job either.

6. More generally, some enterprises insist on brand labels that add little value but lots of expense. Yes, there are many benefits to vendor consolidation, and you may avoid many headaches if you stick with not-so-cutting-edge technology. But “enterprise-grade” hardware failure rates may not differ enough from “consumer-grade” ones to be worth paying for.

7. Some enterprises still insist on keeping their IT operations on-premises. In a number of cases, that perceived need is hard to justify.

8. Conversely, I’ve steered clients away from data warehouse appliances and toward, say, Vertica, because they had a clear desire to be cloud-ready. However, I’m not aware that any of those companies ever actually deployed Vertica in the cloud.

Needs ahead of wants

1. Enterprises often don’t realize how much their lives can be improved via a technology upgrade. Those queries that take 6 hours on your current systems, but only 6 minutes on the gear you’re testing? They’d probably take 15 minutes or less on any competitive product as well. Just get something reasonably modern, please!

2. Every application SaaS vendor should offer decent BI. Despite their limited scope, dashboards specific to the SaaS application will likely provide customer value. As a bonus, they’re also apt to demo well.

3. If your customer personal-identity data that resides on internet-facing systems isn’t encrypted — why not? And please don’t get me started on passwords that are stored and mailed around in plain text.

4. Notwithstanding what I said above about elasticity being overrated, buyers often either underrate their needs for concurrent usage, or else don’t do a good job of testing concurrency. A lot of performance disappointments are really problems with concurrency.

5. As noted above, it’s possible to underrate one’s need for boring old SQL goodness.

Wants and needs in balance

1. Twenty years ago, I thought security concerns were overwrought. But in an internet-connected world, with customer data privacy and various forms of regulatory compliance in play, wants and needs for security seem pretty well aligned.

2. There also was a time when ease of set-up and installation were underrated. Not any more, however; people generally understand its great importance.