NewSQL

Discussion of NewSQL products and vendors such as Akiban, Tokutek, VoltDB and dbShards. See also transparent sharding.

July 31, 2013

“Disruption” in the software industry

I lampoon the word “disruptive” for being badly overused. On the other hand, I often refer to the concept myself. Perhaps I should clarify. 🙂

You probably know that the modern concept of disruption comes from Clayton Christensen, specifically in The Innovator’s Dilemma and its sequel, The Innovator’s Solution. The basic ideas are:

Market leaders serve high-end customers with complex, high-end products and services, often distributed through a costly sales channel.
Upstarts serve a different market segment, often cheaply and/or simply, perhaps with a different business model (e.g. a different sales channel).
Upstarts expand their offerings, and eventually attack the leaders in their core markets.

In response (this is the Innovator’s Solution part):

Leaders expand their product lines, increasing the value of their offerings in their core markets.
In particular, leaders expand into adjacent market segments, capturing margins and value even if their historical core businesses are commoditized.
Leaders may also diversify into direct competition with the upstarts, but that generally works only if it’s via a separate division, perhaps acquired, that has permission to compete hard with the main business.

But not all cleverness is “disruption”.

Routine product advancement by leaders — even when it’s admirably clever — is “sustaining” innovation, as opposed to the disruptive stuff.
Innovative new technology from small companies is not, in itself, disruption either.

Here are some of the examples that make me think of the whole subject. Read more

Categories: Business intelligence, Data warehousing, Hadoop, Microsoft and SQL*Server, MongoDB, MySQL, Netezza, NewSQL, NoSQL, Oracle, Predictive modeling and advanced analytics, QlikTech and QlikView, Tableau Software

13 Comments

July 20, 2013

The refactoring of everything

I’ll start with three observations:

Computer systems can’t be entirely tightly coupled — nothing would ever get developed or tested.
Computer systems can’t be entirely loosely coupled — nothing would ever get optimized, in performance and functionality alike.
In an ongoing trend, there is and will be dramatic refactoring as to which connections wind up being loose or tight.

As written, that’s probably pretty obvious. Even so, it’s easy to forget just how pervasive the refactoring is and is likely to be. Let’s survey some examples first, and then speculate about consequences. Read more

Categories: Business intelligence, Cloud computing, Clustering, Data models and architecture, Exadata, IBM and DB2, In-memory DBMS, Memory-centric data management, Microsoft and SQL*Server, NewSQL, NoSQL, Oracle, SAP AG, Software as a Service (SaaS), Telecommunications, Teradata, Workday

5 Comments

June 16, 2013

Webinar Wednesday, June 26, 1 pm EST — Real-Time Analytics

I’m doing a webinar Wednesday, June 26, at 1 pm EST/10 am PST called:

Real-Time Analytics in the Real World

The sponsor is MemSQL, one of my numerous clients to have recently adopted some version of a “real-time analytics” positioning. The webinar sign-up form has an abstract that I reviewed and approved … albeit before I started actually outlining the talk. 😉

Our plan is:

I’ll review the multiple technologies and use cases that various companies call “real-time analytics”. I’m not planning for this part to be at all MemSQL-focused.*
MemSQL will review some specific use cases they feel their product — memory-centric scale-out RDBMS — has proven it supports.

*MemSQL is debuting pretty high in my rankings of content sponsors who are cool with vendor neutrality. I sent them a draft of my slides mentioning other tech vendors and not them, and they didn’t blink.

In other news, I’ll be in California over the next week. Mainly I’ll be visiting clients — and 2 non-clients and some family — 10:00 am through dinner, but I did set aside time to stop by GigaOm Structure on Wednesday. I have sniffles/cough/other stuff even before I go. So please don’t expect a lot of posts until I’ve returned, rested up a bit, and also prepared my webinar deck.

Categories: Analytic technologies, In-memory DBMS, MemSQL, NewSQL, Parallelization

1 Comment

April 23, 2013

MemSQL scales out

The third of my three MySQL-oriented clients I alluded to yesterday is MemSQL. When I wrote about MemSQL last June, the product was an in-memory single-server MySQL workalike. Now scale-out has been added, with general availability today.

MemSQL’s flagship reference is Zynga, across 100s of servers. Beyond that, the company claims (to quote a late draft of the press release):

Enterprises are already using distributed MemSQL in production for operational analytics, network security, real-time recommendations, and risk management.

All four of those use cases fit MemSQL’s positioning in “real-time analytics”. Besides Zynga, MemSQL cites penetration into traditional low-latency markets — financial services (various subsectors) and ad-tech.

Highlights of MemSQL’s new distributed architecture start: Read more

Categories: Clustering, Database compression, Emulation, transparency, portability, Games and virtual worlds, Investment research and trading, Log analysis, MemSQL, MySQL, NewSQL, Transparent sharding, Zynga

6 Comments

April 22, 2013

Notes on TokuDB and GenieDB

Last week, I edited press releases back-to-back-to-back for three clients, all with announcements at this week’s Percona Live. The ones with embargoes ending today are Tokutek and GenieDB.

Tokutek’s news is that they’re open sourcing much of TokuDB, but holding back hot backup for their paid version. I approve of this strategy — “doesn’t lose data” is an important feature, and well worth paying for.

I kid, I kid. Any system has at least a bad way to do backups — e.g. one that involves slowing performance, or perhaps even requires taking applications offline altogether. So the real points of good backup technology are:

To keep performance steady.
To make the whole thing as easy to manage as possible.

GenieDB is announcing a Version 2, which is basically a performance release. So in lieu of pretending to have much article-worthy news, GenieDB is taking the opportunity to remind folks of its core marketing messages, with catchphrases such as “multi-regional self-healing MySQL”. Good choice; indeed, I wish more vendors would adopt that marketing tactic.

Along the way, I did learn a bit more about GenieDB. In particular:

GenieDB is now just backed by a hacked version of InnoDB (no more Berkeley DB Java Edition).
Why hacked? Because GenieDB appends a Lamport timestamp to every row, which somehow leads to a need to modify how indexes and caching work.
Benefits of the chamge include performance and simpler (for the vendor) development.
An arguable disadvantage of the switch is that GenieDB no longer can use Berkeley DB’s key-value interface — but MySQL now has one of those too.

I also picked up some GenieDB company stats I didn’t know before — 9 employees and 2 paying customers.

Related links

Recent posts about TokuDB and GenieDB

Categories: GenieDB, Market share and customer counts, MySQL, NewSQL, Open source, Tokutek and TokuDB

3 Comments

April 14, 2013

Introduction to Deep Information Sciences and DeepDB

I talked Friday with Deep Information Sciences, makers of DeepDB. Much like TokuDB — albeit with different technical strategies — DeepDB is a single-server DBMS in the form of a MySQL engine, whose technology is concentrated around writing indexes quickly. That said:

DeepDB’s indexes can help you with analytic queries; hence, DeepDB is marketed as supporting OLTP (OnLine Transaction Processing) and analytics in the same system.
DeepDB is marketed as “designed for big data and the cloud”, with reference to “Volume, Velocity, and Variety”. What I could discern in support of that is mainly:
- DeepDB has been tested at up to 3 terabytes at customer sites and up to 1 billion rows internally.
- Like most other NewSQL and NoSQL DBMS, DeepDB is append-only, and hence could be said to “stream” data to disk.
- DeepDB’s indexes could at some point in the future be made to work well with non-tabular data.*
- The Deep guys have plans and designs for scale-out — transparent sharding and so on.

*For reasons that do not seem closely related to product reality, DeepDB is marketed as if it supports “unstructured” data today.

Other NewSQL DBMS seem “designed for big data and the cloud” to at least the same extent DeepDB is. However, if we’re interpreting “big data” to include multi-structured data support — well, only half or so of the NewSQL products and companies I know of share Deep’s interest in branching out. In particular:

Akiban definitely does. (Note: Stay tuned for some next-steps company news about Akiban.)
Tokutek has planted a small stake there too.
Key-value-store-backed NuoDB and GenieDB probably leans that way. (And SanDisk evidently shut down Schooner’s RDBMS while keeping its key-value store.)
VoltDB, Clustrix, ScaleDB and MemSQL seem more strictly tabular, except insofar as text search is a requirement for everybody. (Edit: Oops; I forgot about Clustrix’s approach to JSON support.)

Edit: MySQL has some sort of an optional NoSQL interface, and hence so presumably do MySQL-compatible TokuDB, GenieDB, Clustrix, and MemSQL.

Also, some of those products do not today have the transparent scale-out that Deep plans to offer in the future.

Categories: Akiban, Cloud computing, Clustrix, Columnar database management, Data models and architecture, Database compression, GenieDB, Market share and customer counts, Memory-centric data management, MySQL, NewSQL, NoSQL, NuoDB, OLTP, Oracle, ScaleDB, Schooner Information Technology, Tokutek and TokuDB, Transparent sharding, VoltDB and H-Store

Some notes on new-era data management, March 31, 2013

Hmm. I probably should have broken this out as three posts rather than one after all. Sorry about that.

Performance confusion

Discussions of DBMS performance are always odd, for starters because:

Workloads and use cases vary greatly.
In particular, benchmarks such as the YCSB or TPC-H aren’t very helpful.

But in NoSQL/NewSQL short-request processing performance claims seem particularly confused. Reasons include but are not limited to:

It’s common for databases or at least working sets to be entirely in RAM — but it’s not always required.
Consistency and durability models vary. What’s more, in some systems — e.g. MongoDB — there’s considerable flexibility as to which model you use.
In particular, there’s an increasingly common choice in which data is written synchronously to RAM on 2 or more servers, then asynchronously to disk on each of them. Performance in these cases can be quite different from when all writes need to be committed to disk. Of course, you need sufficient disk I/O to keep up, so SSDs (Solid-State Drives) can come in handy.
Many workloads are inherently single node (replication aside). Others are not.

MongoDB and 10gen

I caught up with Ron Avnur at 10gen. Technical highlights included: Read more

Categories: Benchmarks and POCs, Cassandra, Clustering, Couchbase, Data models and architecture, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, HBase, In-memory DBMS, Investment research and trading, Market share and customer counts, MarkLogic, Memory-centric data management, MongoDB, NewSQL, NoSQL, Tokutek and TokuDB

8 Comments

February 21, 2013

One database to rule them all?

Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with Postgres and Illustra/Informix, then more recently suggesting the world needs 9 or so kinds of database technology. As for me — well, I agreed with Mike both times. 🙂

Since this is MUCH too big a subject for a single blog post, what I’ll do in this one is simply race through some background material. To a first approximation, this whole discussion is mainly about data layouts — but only if we interpret that concept broadly enough to comprise:

Every level of storage (disk, RAM, etc.).
Indexes, aggregates and raw data alike.

To date, nobody has ever discovered a data layout that is efficient for all usage patterns. As a general rule, simpler data layouts are often faster to write, while fancier ones can boost query performance. Specific tradeoffs include, but hardly are limited to: Read more

Categories: Columnar database management, Data models and architecture, Data warehousing, Database compression, Database diversity, GenieDB, GIS and geospatial, Hadoop, IBM and DB2, MarkLogic, Michael Stonebraker, Microsoft and SQL*Server, NewSQL, NoSQL, Oracle, PostgreSQL, SAP AG, Solid-state memory, Storage, Structured documents, Text, Theory and architecture, Tokutek and TokuDB

20 Comments

January 28, 2013

Attack of the Frankenschemas

In typical debates, the extremists on both sides are wrong. “SQL vs. NoSQL” is an example of that rule. For many traditional categories of database or application, it is reasonable to say:

Relational databases are usually still a good default assumption …
… but increasingly often, the default should be overridden with a more useful alternative.

Reasons to abandon SQL in any given area usually start:

Creating a traditional relational schema is possible …
… but it’s tedious or difficult …
… especially since schema design is supposed to be done before you start coding.

Some would further say that NoSQL is cheaper, scales better, is cooler or whatever, but given the range of NewSQL alternatives, those claims are often overstated.

Sectors where these reasons kick in include but are not limited to: Read more

Categories: Health care, Investment research and trading, Log analysis, NewSQL, NoSQL, Web analytics

8 Comments

January 17, 2013

YCSB benchmark notes

Two different vendors recently tried to inflict benchmarks on me. Both were YCSBs, so I decided to look up what the YCSB (Yahoo! Cloud Serving Benchmark) actually is. It turns out that the YCSB:

Was developed by — you guessed it! — Yahoo.
Is meant to simulate workloads that fetch web pages, including the writing portions of those workloads.
Was developed with NoSQL data managers in mind.
Bakes in one kind of sensitivity analysis — latency vs. throughput.
Is implemented in extensible open source code.

That actually sounds pretty good, especially the extensibility part;* it’s likely that the YCSB can be useful in a variety of product selection scenarios. Still, as recent examples show, benchmark marketing is an annoying blight upon the database industry.

*With extensibility you can test your own workloads and do your own sensitivity analyses.

A YCSB overview page features links both to the code and to the original explanatory paper. The clearest explanation of the YCSB I found there was: Read more