I’ve suggested in the past that multi-data-center capabilities are important for “data sovereignty”/geo-compliance.
The need for geo-compliance just got a lot stronger, with the abolition of the European Union’s Safe Harbour rule for the US. If you collect data in multiple countries, you should be at least thinking about geo-compliance.
Cassandra is an established leader in multi-data-center operation.

But when I made that connection and checked in accordingly with my client Patrick McFadin at DataStax, I discovered that I’d been a little confused about how multi-data-center Cassandra works. The basic idea holds water, but the details are not quite what I was envisioning.

The story starts:

Cassandra groups nodes into logical “data centers” (i.e. token rings).
As a best practice, each physical data center can contain one or more logical data center, but not vice-versa.
There are two levels of replication — within a single logical data center, and between logical data centers.
Replication within a single data center is planned in the usual way, with the principal data center holding a database likely to have a replication factor of 3.
However, copies of the database held elsewhere may have different replication factors …
… and can indeed have different replication factors for different parts of the database.

In particular, a remote replication factor for Cassandra can = 0. When that happens, then you have data sitting in one geographical location that is absent from another geographical location; i.e., you can be in compliance with laws forbidding the export of certain data. To be clear (and this contradicts what I previously believed and hence also implied in this blog):

General multi-data-center operation is not what gives you geo-compliance, because the default case is that the whole database is replicated to each data center.
Instead, you get that effect by tweaking your specific replication settings.

Categories: Cassandra, Clustering, DataStax, HBase, NoSQL, Open source, Specific users, Surveillance and privacy

3 Comments

October 15, 2015

Basho and Riak

Basho was on my (very short) blacklist of companies with whom I refuse to speak, because they have lied about the contents of previous conversations. But Tony Falco et al. are long gone from the company. So when Basho’s new management team reached out, I took the meeting.

For starters:

Basho management turned over significantly 1-2 years ago. The main survivors from the old team are 1 each in engineering, sales, and services.
Basho moved its headquarters to Bellevue, WA. (You get one guess as to where the new CEO lives.) Engineering operations are very distributed geographically.
Basho claims that it is much better at timely product shipments than it used to be. Its newest product has a planned (or at least hoped-for) 8-week cadence for point releases.
Basho’s revenue is ~90% subscription.
Basho claims >200 enterprise clients, vs. 100-120 when new management came in. Unfortunately, I forgot to ask the usual questions about divisions vs. whole organizations, OEM sell-through vs. direct, etc.
Basho claims an average contract value of >$100K, typically over 2-3 years. $9 million of that (which would be close to half the total, actually), comes from 2 particular deals of >$4 million each.

Basho’s product line has gotten a bit confusing, but as best I understand things the story is:

There’s something called Riak Core, which isn’t even a revenue-generating product. However, it’s an open source project with some big users (e.g. Goldman Sachs, Visa), and included in pretty much everything else Basho promotes.
Riak KV is the key-value store previously known as Riak. It generates the lion’s share of Basho’s revenue.
Riak S2 is an emulation of Amazon S3. Basho thinks that Riak KV loses efficiency when objects get bigger than 1 MB or so, and that’s when you might want to use Riak S2 in addition or instead.
Riak TS is for time series, and just coming out now.
Also in the mix are some (extra charge) connectors for Redis and Spark. Presumably, there are more of these to come.
There’s an umbrella marketing term of “Basho Data Platform”.

Technical notes on some of that include: Read more

Categories: Aerospike, Basho and Riak, Cassandra, Clustering, Couchbase, Databricks, Spark and BDAS, DataStax, HBase, Health care, Log analysis, MapR, Market share and customer counts, MongoDB, NoSQL, Pricing, Specific users, Splunk

1 Comment

October 15, 2015

Couchbase 4.0 and related subjects

I last wrote about Couchbase in November, 2012, around the time of Couchbase 2.0. One of the many new features I mentioned then was secondary indexing. Ravi Mayuram just checked in to tell me about Couchbase 4.0. One of the important new features he mentioned was what I think he said was Couchbase’s “first version” of secondary indexing. Obviously, I’m confused.

Now that you’re duly warned, let me remind you of aspects of Couchbase timeline.

2 corporate name changes ago, Couchbase was organized to commercialize memcached. memcached, of course, was internet companies’ default way to scale out short-request processing before the rise of NoSQL, typically backed by manually sharded MySQL.
Couchbase’s original value proposition, under the name Membase, was to provide persistence and of course support for memcached. This later grew into a caching-oriented pitch even to customers who weren’t already memcached users.
A merger with the makers of CouchDB ensued, with the intention of replacing Membase’s SQLite back end with CouchDB at the same time as JSON support was introduced. This went badly.
By now, however, Couchbase sells for more than distributed cache use cases. Ravi rattled off a variety of big-name customer examples for system-of-record kinds of use cases, especially in session logging (duh) and also in travel reservations.
Couchbase 4.0 has been in beta for a few months.

Technical notes on Couchbase 4.0 — and related riffs 🙂 — start: Read more

Categories: Cache, Clustering, Couchbase, Data models and architecture, Databricks, Spark and BDAS, Exadata, Hadoop, MarkLogic, MongoDB, MySQL, NoSQL, Open source, Schema on need, Structured documents, Web analytics

1 Comment

March 10, 2015

Notes on HBase

I talked with a couple of Cloudera folks about HBase last week. Let me frame things by saying:

The closest thing to an HBase company, ala MongoDB/MongoDB or DataStax/Cassandra, is Cloudera.
Cloudera still uses a figure of 20% of its customers being HBase-centric.
HBaseCon and so on notwithstanding, that figure isn’t really reflected in Cloudera’s marketing efforts. Cloudera’s marketing commitment to HBase has never risen to nearly the level of MongoDB’s or DataStax’s push behind their respective core products.
With Cloudera’s move to “zero/one/many” pricing, Cloudera salespeople have little incentive to push HBase hard to accounts other than HBase-first buyers.

Also:

Cloudera no longer dominates HBase development, if it ever did.
- Cloudera is the single biggest contributor to HBase, by its count, but doesn’t make a majority of the contributions on its own.
- Cloudera sees Hortonworks as having become a strong HBase contributor.
- Intel is also a strong contributor, as are end user organizations such as Chinese telcos. Not coincidentally, Intel was a major Hadoop provider in China before the Intel/Cloudera deal.
As far as Cloudera is concerned, HBase is just one data storage technology of several, focused on high-volume, high-concurrency, low-latency short-request processing. Cloudera thinks this is OK because of HBase’s strong integration with the rest of the Hadoop stack.
Others who may be inclined to disagree are in several cases doing projects on top of HBase to extend its reach. (In particular, please see the discussion below about Apache Phoenix and Trafodion, both of which want to offer relational-like functionality.)

Categories: Cloudera, Clustering, Data models and architecture, Database diversity, Hadoop, HBase, Hortonworks, HP and Neoview, Intel, Market share and customer counts, NoSQL, Open source

4 Comments

March 4, 2015

Quick update on Tachyon

I’m on record as believing that:

Hadoop needs a memory-centric storage grid.
Tachyon is a strong candidate to fill the role.

That said:

It’s an open secret that there will be a Tachyon company. However, …
… no details have been publicized. Indeed, the open secret itself is still officially secret.
Tachyon technology, which just hit 0.6 a couple of days ago, still lacks many features I regard as essential.
As a practical matter, most Tachyon interest to date has been associated with Spark. This makes perfect sense given Tachyon’s origin and initial technical focus.
Tachyon was in 50 or more sites last year. Most of these sites were probably just experimenting with it. However …
… there are production Tachyon clusters with >100 nodes.

As a reminder of Tachyon basics: Read more

Categories: Clustering, Databricks, Spark and BDAS, Hadoop, Memory-centric data management

3 Comments

August 7, 2014

Actian Vector Hadoop Edition

I have a small blacklist of companies I won’t talk with because of their particularly unethical past behavior. Actian is one such; they evidently made stuff up about me that Josh Berkus gullibly posted for them, and I don’t want to have conversations that could be dishonestly used against me.

That said, Peter Boncz isn’t exactly an Actian employee. Rather, he’s the professor who supervised Marcin Zukowski’s PhD thesis that became Vectorwise, and I chatted with Peter by Skype while he was at home in Amsterdam. I believe his assurances that no Actian personnel sat in on the call. 🙂

In other news, Peter is currently working on and optimistic about HyPer. But we literally spent less than a minute talking about that

Before I get to the substance, there’s been a lot of renaming at Actian. To quote Andrew Brust,

… the ParAccel, Pervasive and Vectorwise technologies are being unified under the Actian Analytics Platform brand. Specifically, the ParAccel technology … is being re-branded Actian Matrix; Pervasive’s technologies are rechristened Actian DataFlow and Actian DataConnect; and Vectorwise becomes Actian Vector.

and

Actian … is now “one company, with one voice and one platform” according to its John Santaferraro

The bolded part of the latter quote is untrue — at least in the ordinary sense of the word “one” — but the rest can presumably be taken as company gospel.

All this is by way of preamble to saying that Peter reached out to me about Actian’s new Vector Hadoop Edition when he blogged about it last June, and we finally talked this week. Highlights include: Read more

Categories: Actian and Ingres, Clustering, Database compression, Hadoop, ParAccel, Pervasive Software, SQL/Hadoop integration, VectorWise, Workload management

4 Comments

June 8, 2014

Optimism, pessimism, and fatalism — fault-tolerance, Part 2

The pessimist thinks the glass is half-empty.
The optimist thinks the glass is half-full.
The engineer thinks the glass was poorly designed.

Most of what I wrote in Part 1 of this post was already true 15 years ago. But much gets added in the modern era, considering that:

Clusters will have node hiccups more often than single nodes will. (Duh.)
Networks are relatively slow even when uncongested, and furthermore congest unpredictably.
In many applications, it’s OK to sacrifice even basic-seeming database functionality.

And so there’s been innovation in numerous cluster-related subjects, two of which are:

Distributed query and update. When a database is distributed among many modes, how does a request access multiple nodes at once?
Fault-tolerance in long-running jobs.When a job is expected to run on many nodes for a long time, how can it deal with failures or slowdowns, other than through the distressing alternatives:
- Start over from the beginning?
- Keep (a lot of) the whole cluster’s resources tied up, waiting for things to be set right?

Distributed database consistency

When a distributed database lives up to the same consistency standards as a single-node one, distributed query is straightforward. Performance may be an issue, however, which is why we have seen a lot of:

Analytic RDBMS innovation.
Short-request applications designed to avoid distributed joins.
Short-request clustered RDBMS that don’t allow fully-general distributed joins in the first place.

But in workloads with low-latency writes, living up to those standards is hard. The 1980s approach to distributed writing was two-phase commit (2PC), which may be summarized as: Read more

Categories: Clustering, CouchDB, Data warehousing, Databricks, Spark and BDAS, Facebook, Hadoop, MapReduce, Sybase, Theory and architecture, VoltDB and H-Store

1 Comment

May 1, 2014

MemSQL update

I stopped by MemSQL last week, and got a range of new or clarified information. For starters:

Even though MemSQL (the product) was originally designed for OLTP (OnLine Transaction Processing), MemSQL (the company) is now focused on analytic use cases …
… which was the point of introducing MemSQL’s flash-based columnar option.
One MemSQL customer has a 100 TB “data warehouse” installation on Amazon.
Another has “dozens” of terabytes of data spread across 500 machines, which aggregate 36 TB of RAM.
At customer Shutterstock, 1000s of non-MemSQL nodes are monitored by 4 MemSQL machines.
A couple of MemSQL’s top references are also Vertica flagship customers; one of course is Zynga.
MemSQL reports encountering Clustrix and VoltDB in a few competitive situations, but not NuoDB. MemSQL believes that VoltDB is still hampered by its traditional issues — Java, reliance on stored procedures, etc.

On the more technical side: Read more

Categories: Clustering, Clustrix, Columnar database management, Data warehousing, Database compression, In-memory DBMS, MemSQL, NewSQL, NuoDB, Specific users, Vertica Systems, VoltDB and H-Store, Workload management, Zynga

18 Comments

March 23, 2014

Wants vs. needs

In 1981, Gerry Chichester and Vaughan Merlyn did a user-survey-based report about transaction-oriented fourth-generation languages, the leading application development technology of their day. The report included top-ten lists of important features during the buying cycle and after implementation. The items on each list were very similar — but the order of the items was completely different. And so the report highlighted what I regard as an eternal truth of the enterprise software industry:

What users value in the product-buying process is quite different from what they value once a product is (being) put into use.

Here are some thoughts about how that comes into play today.

Wants outrunning needs

1. For decades, BI tools have been sold in large part via demos of snazzy features the CEO would like to have on his desk. First it was pretty colors; then it was maps; now sometimes it’s “real-time” changing displays. Other BI features, however, are likely to be more important in practice.

2. In general, the need for “real-time” BI data freshness is often exaggerated. If you’re a human being doing a job that’s also often automated at high speed — for example network monitoring or stock trading — there’s a good chance you need fully human real-time BI. Otherwise, how much does a 5-15 minute delay hurt? Even if you’re monitoring website sell-through — are your business volumes really high enough that 5 minutes matters much? eBay answered “yes” to that question many years ago, but few of us work for businesses anywhere near eBay’s scale.

Even so, the want for speed keeps growing stronger. 🙂

3. Similarly, some desires for elastic scale-out are excessive. Your website selling koi pond accessories should always run well on a single server. If you diversify your business to the point that that’s not true, you’ll probably rewrite your app by then as well.

4. Some developers want to play with cool new tools. That doesn’t mean those tools are the best choice for the job. In particular, boring old SQL has merits — such as joins! — that shiny NoSQL hasn’t yet replicated.

5. Some developers, on the other hand, want to keep using their old tools, on which they are their employers’ greatest experts. That doesn’t mean those tools are the best choice for the job either.

6. More generally, some enterprises insist on brand labels that add little value but lots of expense. Yes, there are many benefits to vendor consolidation, and you may avoid many headaches if you stick with not-so-cutting-edge technology. But “enterprise-grade” hardware failure rates may not differ enough from “consumer-grade” ones to be worth paying for.

Categories: Benchmarks and POCs, Business intelligence, Cloud computing, Clustering, Data models and architecture, Data warehousing, NoSQL, Software as a Service (SaaS), Vertica Systems

3 Comments

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Clustering

Differentiation in data management

Cassandra and privacy requirements

Basho and Riak

Couchbase 4.0 and related subjects

Notes on HBase

Quick update on Tachyon

Actian Vector Hadoop Edition

Optimism, pessimism, and fatalism — fault-tolerance, Part 2

MemSQL update

Wants vs. needs

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin