Theory and architecture

Analysis of design choices in databases and database management systems. Related subjects include:

June 8, 2014

Optimism, pessimism and fatalism — fault-tolerance, Part 1

Writing data management or analysis software is hard. This post and its sequel are about some of the reasons why.

When systems work as intended, writing and reading data is easy. Much of what’s hard about data management is dealing with the possibility — really the inevitability — of failure. So it might be interesting to survey some of the many ways that considerations of failure come into play. Some have been major parts of IT for decades; others, if not new, are at least newly popular in this cluster-oriented, RAM-crazy era. In this post I’ll focus on topics that apply to single-node systems; in the sequel I’ll emphasize topics that are clustering-specific.

Major areas of failure-aware design — and these overlap greatly — include:

Long-standing basics

In a single-server, disk-based configuration, techniques for database fault-tolerance start: Read more

May 6, 2014

Notes and comments, May 6, 2014

After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.

Here is a catch-all post to complete the set.  Read more

May 2, 2014

Introduction to CitusDB

One of my lesser-known clients is Citus Data, a largely Turkish company that is however headquartered in San Francisco. They make CitusDB, which puts a scale-out layer over a collection of fully-functional PostgreSQL nodes, much like Greenplum and Aster Data before it. However, in contrast to those and other Postgres-based analytic MPP (Massively Parallel Processing) DBMS:

*One benefit to this strategy, besides the usual elasticity and recovery stuff, is that while PostgreSQL may be single-core for any given query, a CitusDB query can use multiple cores by virtue of hitting multiple PostgreSQL tables on each node.

Citus has thrown a few things against the wall; for example, there are two versions of its product, one which involves HDFS (Hadoop Distributed File System) and one of which doesn’t. But I think Citus’ focus will be scale-out PostgreSQL for at least the medium-term future. Citus does have actual customers, and they weren’t all PostgreSQL users previously. Still, the main hope — at least until the product is more built-out — is that existing PostgreSQL users will find CitusDB easy to adopt, in technology and price alike.

Read more

May 1, 2014

MemSQL update

I stopped by MemSQL last week, and got a range of new or clarified information. For starters:

On the more technical side: Read more

April 30, 2014

Cloudera, Impala, data warehousing and Hive

There’s much confusion about Cloudera’s SQL plans and beliefs, and the company has mainly itself to blame. That said, here’s what I think is going on.

And of course, as vendors so often do, Cloudera generally overrates both the relative maturity of Impala and the relative importance of the use cases in which its offerings – Impala or otherwise – shine.

Related links

April 30, 2014

Spark on fire

Spark is on the rise, to an even greater degree than I thought last month.

*Yes, my fingerprints are showing again.

The most official description of what Spark now contains is probably the “Spark ecosystem” diagram from Databricks. However, at the time of this writing it is slightly out of date, as per some email from Databricks CEO Ion Stoica (quoted with permission):

… but if I were to redraw it, SparkSQL will replace Shark, and Shark will eventually become a thin layer above SparkSQL and below BlinkDB.

With this change, all the modules on top of Spark (i.e., SparkStreaming, SparkSQL, GraphX, and MLlib) are part of the Spark distribution. You can think of these modules as libraries that come with Spark.

Read more

April 19, 2014

Necessary complexity

When I’m asked to talk to academics, the requested subject is usually a version of “What should we know about what’s happening in the actual market/real world?” I then try to figure out what the scholars could stand to hear that they perhaps don’t already know.

In the current case (Berkeley next Tuesday), I’m using the title “Necessary complexity”. I actually mean three different but related things by that, namely:

  1. No matter how cool an improvement you have in some particular area of technology, it’s not very useful until you add a whole bunch of me-too features and capabilities as well.
  2. Even beyond that, however, the simple(r) stuff has already been built. Most new opportunities are in the creation of complex integrated stacks, in part because …
  3. … users are doing ever more complex things.

While everybody on some level already knows all this, I think it bears calling out even so.

I previously encapsulated the first point in the cardinal rules of DBMS development:

Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars.

That’s if things go extremely well.

Rule 2: You aren’t an exception to Rule 1. 

In particular:

  • Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life.
  • Mixed workload management is harder than you’re assuming it is.
  • Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.

My recent post about MongoDB is just one example of same.

Examples of the second point include but are hardly limited to: Read more

April 17, 2014

MongoDB is growing up

I caught up with my clients at MongoDB to discuss the recent MongoDB 2.6, along with some new statements of direction. The biggest takeaway is that the MongoDB product, along with the associated MMS (MongoDB Management Service), is growing up. Aspects include:

Read more

March 28, 2014

NoSQL vs. NewSQL vs. traditional RDBMS

I frequently am asked questions that boil down to:

The details vary with context — e.g. sometimes MySQL is a traditional RDBMS and sometimes it is a new kid — but the general class of questions keeps coming. And that’s just for short-request use cases; similar questions for analytic systems arise even more often.

My general answers start:

In particular, migration away from legacy DBMS raises many issues:  Read more

March 23, 2014

DBMS2 revisited

The name of this blog comes from an August, 2005 column. 8 1/2 years later, that analysis holds up pretty well. Indeed, I’d keep the first two precepts exactly as I proposed back then:

I’d also keep the general sense of the third precept, namely appropriately-capable data integration, but for that one the specifics do need some serious rework.

For starters, let me say: Read more

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.