Michael Stonebraker – DBMS 2 : DataBase Management System Services

Readings in Database Systems

Curt Monash — Thu, 10 Dec 2015 12:26:40 +0000

Mike Stonebraker and Larry Ellison have numerous things in common. If nothing else:

They’re both titanic figures in the database industry.
They both gave me testimonials on the home page of my business website.
They both have been known to use the present tense when the future tense would be more accurate.

I mention the latter because there’s a new edition of Readings in Database Systems, aka the Red Book, available online, courtesy of Mike, Joe Hellerstein and Peter Bailis. Besides the recommended-reading academic papers themselves, there are 12 survey articles by the editors, and an occasional response where, for example, editors disagree. Whether or not one chooses to tackle the papers themselves — and I in fact have not dived into them — the commentary is of great interest.

But I would not take every word as the gospel truth, especially when academics describe what they see as commercial market realities. In particular, as per my quip in the first paragraph, the data warehouse market has not yet gone to the extremes that Mike suggests,* if indeed it ever will. And while Joe is close to correct when he says that the company Essbase was acquired by Oracle, what actually happened is that Arbor Software, which made Essbase, merged with Hyperion Software, and the latter was eventually indeed bought by the giant of Redwood Shores.**

*When it comes to data warehouse market assessment, Mike seems to often be ahead of the trend.

**Let me interrupt my tweaking of very smart people to confess that my own commentary on the Oracle/Hyperion deal was not, in retrospect, especially prescient.

Mike pretty much opened the discussion with a blistering attack against hierarchical data models such as JSON or XML. To a first approximation, his views might be summarized as:

Logical hierarchical models can be OK in certain cases. In particular, JSON could be a somewhat useful datatype in an RDBMS.
Physical hierarchical models are horrible.
Rather, you should implement the logical hierarchical model over a columnar RDBMS.

My responses start:

Nested data structures are more important than Mike’s discussion seems to suggest.
Native XML and JSON stores are apt to have an index on every field. If you squint, that index looks a lot like a column store.
Even NoSQL stores should and I think in most cases will have some kind of SQL-like DML (Data Manipulation Language). In particular, there should be some ability to do joins, because total denormalization is not always a good choice.

In no particular order, here are some other thoughts about or inspired by the survey articles in Readings in Database Systems, 5th Edition.

I agree that OLTP (OnLine Transaction Processing) is transitioning to main memory.
I agree with the emphasis on “data in motion”.
While I needle him for overstating the speed of the transition, Mike is right that columnar architectures are winning for analytics. (Or you could say they’ve won, if you recognize that mop-up from the victory will still take 1 or 2 decades.)
The guys seem to really hate MapReduce, which is an old story for Mike, but a bit of a reversal for Joe.
MapReduce is many things, but it’s not a data model, and it’s also not something that Hadoop 1.0 was an alternative to. Saying each of those things was sloppy writing.
The guys characterize consistency/transaction isolation as a rather ghastly mess. That part was an eye-opener.
Mike is a big fan of arrays. I suspect he’s right in general, although I also suspect he’s overrating SciDB. I also think he’s somewhat overrating the market penetration of cube stores, aka MOLAP.
The point about Hadoop (in particular) and modern technologies in general showing the way to modularization of DBMS is an excellent one.
Joe and Mike disagreed about analytics; Joe’s approach rang truer for me. My own opinion is:
- Business intelligence has been important for quite a while, and won’t stop.
- Machine learning is becoming ever more important.
- It’s still early days for the integration of the two areas, but much more will come.
The challenge of whether anybody wants to do machine learning (or other advanced analytics) over a DBMS is sidestepped in part by the previously mentioned point about the modularization of a DBMS. Hadoop, for example, can be both an OK analytic DBMS (although not fully competitive with mature, dedicated products) and of course also an advanced analytics framework.
Similarly, except in the short-term I’m not worried about the limitations of Spark’s persistence mechanisms. Almost every commercial distribution of Spark I can think of is part of a package that also contains a more mature data store.
Versatile DBMS and analytic frameworks suffer strategic contention for memory, with different parts of the system wanting to use it in different ways. Raising that as a concern about the integration of analytic DBMS with advanced analytic frameworks is valid.
I used to overrate the importance of abstract datatypes, in large part due to Mike’s influence. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. A big part of the problem is what I mentioned in the previous point — different parts of a versatile DBMS would prefer to do different things with memory.
I used to overrate the importance of user-defined functions in an analytic RDBMS. Mike had nothing to do with my error. I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. Looser coupling between analytics and data management seems more flexible.
Excellent points are made about the difficulties of “First we build the perfect schema” data warehouse projects and, similarly, MDM (Master Data Management).
There’s an interesting discussion that helps explain why optimizer progress is so slow (both for the industry in general and for each individual product).

Related links

I did a deep dive into MarkLogic’s indexing strategy in 2008, which informed my comment about XML/JSON stores above.
Again with MarkLogic as the focus, in 2010 I was skeptical about document stores not offering joins. MarkLogic has since capitulated.
I’m not current on SciDB, but I did write a bit about it in 2010.
I’m surprised that I can’t find a post to point to about modularization of DBMS. I’ll leave this here as a placeholder until I can.
Edit: As promised, I’ve now posted about the object-relational/abstract datatype boom of the 1990s.

One database to rule them all?

Curt Monash — Thu, 21 Feb 2013 05:52:05 +0000

Perhaps the single toughest question in all database technology is: Which different purposes can a single data store serve well? — or to phrase it more technically — Which different usage patterns can a single data store support efficiently? Ted Codd was on multiple sides of that issue, first suggesting that relational DBMS could do everything and then averring they could not. Mike Stonebraker too has been on multiple sides, first introducing universal DBMS attempts with Postgres and Illustra/Informix, then more recently suggesting the world needs 9 or so kinds of database technology. As for me — well, I agreed with Mike both times.

Since this is MUCH too big a subject for a single blog post, what I’ll do in this one is simply race through some background material. To a first approximation, this whole discussion is mainly about data layouts — but only if we interpret that concept broadly enough to comprise:

Every level of storage (disk, RAM, etc.).
Indexes, aggregates and raw data alike.

To date, nobody has ever discovered a data layout that is efficient for all usage patterns. As a general rule, simpler data layouts are often faster to write, while fancier ones can boost query performance. Specific tradeoffs include, but hardly are limited to:

Big blocks of data compress better, and can be also be faster to retrieve than a number of smaller blocks holding the same amount of data. Small blocks of data can be less wasteful to write. And different kinds of storage have different minimum block sizes.
Operating on compressed data offers multiple significant efficiencies. But you have to spend cycles (de)compressing it, and it’s only practical for some compression schemes.
Fixed-length tabular records can let you compute addresses rather than looking them up in indexes. Yay! But they also waste space.
Tokenization can help with the fixed-/variable-length tradeoff.
Pointers are wonderfully efficient for some queries, at least if you’re not using spinning disk. But they can create considerable overhead to write and update.
Indexes, materialized views, etc. speed query performance, but can be costly to write and maintain.
Storing something as a BLOB (Binary Large OBject), key-value payload, etc. is super-fast — but if you want to look at it, you usually have to pay for retrieving the whole thing.

What’s more, different data layouts can have different implications for logging, locking, replication, backup and more.

So what would happen if somebody tried to bundle all conceivable functionality into a single DBMS, with a plan to optimize the layout of any particular part of the database as appropriate? I think the outcome would be tears — for the development effort would be huge, while the benefits would be scanty. The most optimistic cost estimates could run in the 100s of millions of dollars, with more realistic ones adding a further order of magnitude. But no matter what the investment, the architects would be on the horns of a nasty dilemma:

If there’s much commonality among the component DBMS, each one would be sub-optimal.
If there’s little commonality among them, then there’s also little benefit to the combination.

Adding insult to injury, all the generality would make it hard to select optimum hardware for this glorious DBMS — unless, of course, a whole other level of development effort made it work well across very heterogeneous clusters.

Less megalomaniacally, there have been many attempts to combine two or more alternate data layouts in a single DBMS, with varying degrees of success. In the relational-first world:

Analytic DBMS have combined row and column data models so fluidly that I’ve made fun of Oracle for not being able to pull it off. SAP HANA sort of does the same thing, but perhaps with a columnar bias, and not just for analytics.
Relational DBMS can also have a variety of index types, suitable for different relational use cases. This is especially true for analytic uses of general-purpose RDBMS.
Oracle, DB2, PostgreSQL, and Informix have had full extensibility architectures since the 1990s. That said:
- Almost all the extensions come from the DBMS vendors themselves.
- Extensions that resemble (or are) a tabular datatype — for example geospatial or financial-date — are often technically well-regarded.
- Others are usually not so strong technically, but in a few cases sell well anyway (e.g. Oracle Text).
- While Microsoft never went through the trouble of offering full extensibility, otherwise the SQL Server story is similar.
- Sybase’s extensibility projects went badly in the 1990s, and Sybase doesn’t seem to have tried hard in that area since.
IBM DB2, Microsoft SQL Server, and Oracle added XML capabilities around the middle of the last decade.
Analytic platforms can wind up with all sorts of temporary data structures.
Analytic DBMS have various ways to reach out and touch Hadoop.

Further:

Non-relational DBMS commonly have indexes that at least support relational-like SELECTs. JOINs can be more problematic, but MarkLogic finally has them. Tokutek even offers a 3rd-party indexing option for MongoDB.
Hadoop is growing into what is in effect is a family of DBMS and other data stores — generic HDFS, HBase, generic Hive, Impala, and so on. At the moment, however, none of them is very mature. BDAS/Spark/Shark ups the ante further, but of course that’s less mature yet.
Hadapt combines Hadoop and PostgreSQL.
DataStax combines Cassandra, Hadoop, and Solr.
Akiban fondly thinks its data layouts are well-suited for relational tables, JSON, and XML alike. (But business at Akiban may be in flux.)
GenieDB (Version 1 only) and NuoDB are both implemented over key-value stores. GenieDB Version 2 is implemented over Berkeley DB or MySQL.
Membase/Couchbase was first implemented over SQLite, then over (a forked version of) CouchDB.

Related links

A taxonomy of database use cases (July, 2012)
An early form of this discussion in the single domain of analytic RDBMS (February, 2009)

Soundbites: the Facebook/MySQL/NoSQL/VoltDB/Stonebraker flap, continued

Curt Monash — Fri, 15 Jul 2011 08:27:18 +0000

As a follow-up to the latest Stonebraker kerfuffle, Derrick Harris asked me a bunch of smart followup questions. My responses and afterthoughts include:

Facebook et al. are in effect Software as a Service (SaaS) vendors, not enterprise technology users. In particular:
- They have the technical chops to rewrite their code as needed.
- Unlike packaged software vendors, they’re not answerable to anybody for keeping legacy code alive after a rewrite. That makes migration a lot easier.
- If they want to write different parts of their system on different technical underpinnings, nobody can stop them. For example …
- … Facebook innovated Cassandra, and is now heavily committed to HBase.
It makes little sense to talk of Facebook’s use of “MySQL.” Better to talk of Facebook’s use of “MySQL + memcached + non-transparent sharding.” That said:
- It’s hard to see why somebody today would use MySQL + memcached + non-transparent sharding for a new project. At least one of Couchbase or transparently-sharded MySQL is very likely a superior alternative. Other alternatives might be better yet.
- As noted above in the example of Facebook, the many major web businesses that are using MySQL + memcached + non-transparent sharding for existing projects can be presumed able to migrate away from that stack as the need arises.

Continuing with that discussion of DBMS alternatives:

If you just want to write to the memcached API anyway, why not go with Couchbase?
If you want to go relational, why not go with MySQL? There are many alternatives for scaling or accelerating MySQL — dbShards, Schooner, Akiban, Tokutek, ScaleBase, ScaleDB, Clustrix, and Xeround come to mind quickly, so there’s a great chance that one or more will fit your use case. (And if you don’t get the choice of MySQL flavor right the first time, porting to another one shouldn’t be all THAT awful.)
If you really, really want to go in-memory, and don’t mind writing Java stored procedures, and don’t need to do the kinds of joins it isn’t good at, but do need to do the kinds of joins it is, VoltDB could indeed be a good alternative.

And while we’re at it — going schema-free often makes a whole lot of sense. I need to write much more about the point, but for now let’s just say that I look favorably on the Big Four schema-free/NoSQL options of MongoDB, Couchbase, HBase, and Cassandra.

An odd claim attributed to Mike Stonebraker

Curt Monash — Thu, 14 Jul 2011 11:10:34 +0000

This post has a sequel.

Last week, Mike Stonebraker insulted MySQL and Facebook’s use of it, by implication advocating VoltDB instead. Kerfuffle ensued. To the extent Mike was saying that non-transparently sharded MySQL isn’t an ideal way to do things, he’s surely right. That still leaves a lot of options for massive short-request databases, however, including transparently sharded RDBMS, scale-out in-memory DBMS (whether or not VoltDB*), and various NoSQL options. If nothing else, Couchbase would seem superior to memcached/non-transparent MySQL if you were starting a project today.

*The big problem with VoltDB, last I checked, was its reliance on Java stored procedures to get work done.

Pleasantries continued in The Register, which got an amazing-sounding quote from Mike. If The Reg is to be believed — something I wouldn’t necessarily take for granted — Mike claimed that he (i.e. VoltDB) knows how to solve the distributed join performance problem.

So, it’s Stonebraker against the web. And the difference of option is severe. In May, at a MongoDB developer conference in San Francisco, Mongo creator Dwight Merriman told his audience there was “no way” to do distributed joins in a way that really scales. “I’m not smart enough to do distributed joins that scale horizontally, widely, and are super fast. You have to choose something else. We have no choice but to not be relational,” he said

“You can do distributed transactions, but if you do them with no loss of generality and you do them across a thousand machines, it’s not going to be that fast.”

Stonebraker says precisely the opposite, and in typical fashion, he goes right for the jugular. “I reject what Merriman says out of hand,” he tells The Register. Merriman and his company, 10gen, declined to comment for this story. But Stonebaker says words don’t matter. As much as he likes to wield his opinions, he insists the debate will be decided elsewhere. “Let the bake-off begin,” he crows.

But when last I checked, VoltDB made nowhere near that claim. And well it shouldn’t have. In the fully general case, there’s no way to ensure super distributed join performance other than by throwing lots and lots of gear at the problem. But if you do that, many alternatives are fast. More specialized cases may be a different matter — but there are many fast alternatives for those too.

I imagine there will be use cases for which VoltDB sustains a lead as the truly fastest alternative, similarly-architected competitors perhaps excepted.* But what Mike supposedly said seems quite forward-leaning when compared to technical reality.

*The canonical VoltDB use case is e-commerce in virtual goods, the point of “virtual” being that physical inventory might necessitate costlier kinds of joins.

Now we know why Vertica has been so weirdly evasive

Curt Monash — Mon, 14 Feb 2011 16:34:10 +0000

Communicating with Vertica has been tricky recently. But HP is now announced to be buying Vertica, which pretty much forces me to comment about Vertica. So I’ll indulge in a little bit of explanation as to what I know about Vertica, whether for publication or under NDA. My analysis of the HP/Vertica combination, and expectations for same, will go into another post.

Vertica parted ways with marketing VP Dave Menninger in June. I started working with his successor, but despite seeming smart and energetic, she didn’t last long. Her successor didn’t even last long enough for me to meet him. And Vertica’s Colin Mahony, who was filling the gap, was a bit evasive.

I did have a recent NDA briefing with Vertica (Colin plus Shilpa Lawande). When I asked about announcements for this week (the TDWI conference is a common time for announcements), Colin told me there would be a few partnerships, and that one of them would go beyond Barney. I’ve got to give him credit for underselling on that score.

I asked Colin about Vertica’s stated figure of 328 customers by year-end 2010. He assured me that 250 or so were end-sale customers, with the rest being OEM sell-through. In all other ways I could think to ask about, Vertica’s stated customer count sounds clean — revenue recognized, not just for a paid POC, and so on.

By the way, Vertica has impressive market share among flashy internet companies, especially for an East Coast company — Twitter, Mozilla, a large fraction of the larger Facebook game vendors, and surely others that I’m forgetting as well.

Finally, let me point out that two other oddities go together, namely that:

Vertica has positioned itself as an analytic platform company despite not obviously having the technology to back that up.
Vertica went retro in its marketing with some Mike Stonebraker column-store architetural tub-thumping — and then removed the post a few days later when it came under fire.

Obviously — and I can also confirm both parts of this based on recent Vertica discussions — Vertica thinks it will soon have strong analytic platform technology, and doesn’t want to get mired in its “It’s Columnar!!!” marketing strategy of the past.

As for why that post ever went up in the first place — well, YOU try telling Mike Stonebraker not to say something that’s on his mind.

I do actually have quite a few details of product plans and customer success under NDA. I’ll think about what I can or can’t expose, and then perhaps write a more forward-looking HP/Vertica post.

Architectural options for analytic database management systems

Curt Monash — Tue, 18 Jan 2011 14:22:09 +0000

Mike Stonebraker recently kicked off some discussion about desirable architectural features of a columnar analytic DBMS. Let’s expand the conversation to cover desirable architectural characteristics of analytic DBMS in general. But first, a few housekeeping notes:

This is a very long post.
Even so, to keep it somewhat manageable, I’ve cut corners on completeness. Most notably, two important areas are entirely deferred to future posts — advanced-analytics-specific architecture, and in-memory processing (including CEP).
The subjects here are not strictly parallel. The distinction between major add-on modules and “turtles all the way down” core architectural choices is rarely crystal-clear — Mike Stonebraker’s recent post notwithstanding — and I’ve mixed subjects of varying degrees of “fundamentalness” pretty freely.
There’s a long list of links at the end, pointing at posts that help explain or give examples of specific features named in the body of the text, somewhat like unnumbered footnotes.

OK. In my opinion, the four drop-dead requirements for an analytic DBMS are:

Relational/SQL support. That’s how you get great flexibility in more or less easily constructing queries, as well as connectivity to a vast number of tools. In a few cases, I guess MDX might suffice as an alternative.
Sufficiently great query performance, on the queries you’re actually going to run, for however many concurrent users you actually will have.
Sufficiently high data loading throughput and sufficiently low data loading latency.
Sufficiently favorable TCO (Total Cost of Ownership), all things considered, where “all things” at a minimum includes software license, software maintenance, hardware, power, people costs for administration, and people costs for development.

Depending on your use case, you might have additional make-or-break requirements. Possible areas include:

Additional query functionality, of course with good performance. Specific examples include:
- ANSI-standard SQL features that are not universally supported (e.g. windowing).
- Geospatial datatype support.
Further high-performing integrated analytics, such as:
- Data mining/machine learning modeling and scoring.
- Other mathematical functions, such as linear algebra, optimization, or Monte Carlo simulation.
- Extensibility via MapReduce and/or sufficiently robust user-defined function (UDF) capabilities.
Platform support that matches your needs.
Security, auditability, and/or high-performance encryption.

Other possibly important features — but ones that would usually go on “nice to have” rather than “must have” lists — include:

Yet more query functionality, in areas such as:
- Non-standard SQL extensions (e.g. temporal ones)
- Specific prepackaged UDFs.
- Cross-column text search.
Nice administrative tools, in areas such as:
- Single-query performance/optimization.
- Authorization/permission.
- Workload management.
- Data mart spin-out.

So what kinds of architectural choices (or major features) should one look to to support such features? On the performance side there are many candidates, including:

Specialized indexes, more commonly found in older DBMS. Leading examples include star and especially bitmap indices, both of which I was already writing about back in the 1990s. Ditto materialized views, which aren’t exactly indices, but are closely related.
Partition elimination. Single- or multi-level range partitioning can cause whole regions of the database never to be checked in a particular query’s evaluation. (That’s a good thing.) The functionality popularized by Netezza as zone maps does something similar, without requiring the partitions to be chosen in advance.
Scan-friendliness. If a query runs a long time, it may include a lot of (full or partial) table scanning. Assuming you rely on spinning disk — as opposed to solid-state memory — one way to improve your sequential-scan throughput far above your random-read throughput is to support large block sizes.
Parallelism. It’s possible to screw up even multi-core parallelism, but the big issue is multi-server. In particular:
- An analytic DBMS must avoid a “fat head” bottleneck, either because there is no head node at all directing things, or else because data redistribution algorithms are sufficiently mature as to not overload it. (In naive parallel DBMS implementations, intermediate query results get sent back to the head node to be, for example, JOINed together. This is not a good thing.)
- Multiple analytic DBMS vendors have chosen to develop custom data transfer protocols, for more reliable performance than they can get from TCP/IP. Examples include Teradata, Netezza, and ParAccel.
Predicate pushdown. Predicate pushdown takes several forms, in all cases having the goal of executing certain simpler database operations — predicate evaluations — close to the data, thus minimizing I/O or upstream processing.
- Netezza famously offloads the first part of predicate evaluation to FPGA (Field-Programmable Gate Array) chips.
- At least in theory, I like the Exadata form of node specialization, in which a tier of server nodes does the first part of the processing, with the results being sent to a second upstream database tier. But it’s not obvious that any RDBMS vendor has done a great job with it. Oracle is famously secretive about Exadata’s track record, and as of this writing apparently still resists on-site benchmarks. Calpont hasn’t accomplished much. And MarkLogic of course doesn’t sell an RDBMS.
- There’s reason to think predicate pushdown would help exploit flash memory, although I’m not sure vendors are moving in a direction that will let us find out.
Columnar data storage. Columnar storage is pretty much the ultimate in predicate pushdown, and advantageous in many analytic query scenarios. (Main exception: When you’re bringing back the majority of a row anyway, you might as well fetch the thing pre-assembled.) As Mike Stonebraker points out, columnar storage should not incur serious row-ID overhead, and ideally should be available for multiple sort orders on each column.
Compression. This, rightly, is another of Mike Stonebraker’s favorite features. Database compression is hugely important, for I/O and in silicon alike. (And it can also save money on storage.) There are a broad variety of compression techniques, suited for different kinds of data, different kinds of queries, or different points on the storage saving/decompression performance tradeoff spectrum.
Flexible storage. Not all data is best stored the same way, even if it’s in the same database. Some is destined for columnar-friendly use cases, other for whole row. Some is compressed ideally by one technique, some by another. And so on. Some database managers do a good job of letting different parts of the database (even within the same table) be stored in different ways.
Query pipelining. There are a lot of steps to query execution, in both the fine-grained sense (a whole lot of rows) and the coarse-grained (all but the simplest execution plans feature a number of operations each). FPGA-based vendors XtremeData and Kickfire used the innate parallelism of an FPGA to pipeline query execution. Kickfire failed, and XtremeData hasn’t sold many systems, but that doesn’t mean it isn’t a good idea. Kickfire’s assets were sold to Teradata. Meanwhile, VectorWise’s very name speaks to its (Intel-based) vector processing architecture.
Result set reuse. Instead of mixing together different steps of the same query, how about mixing together the same step in different queries, so that you don’t have to repeat it? As a simple example, suppose two queries need to do the same table scan. Well then, it would be nice to only do the scan once. In most cases, query workloads are too diverse for result set reuse of that kind to be very important; still, it’s a cool feature, which Teradata calls synchronized scan.
Suitably optimized execution engine — column, row, whatever. (This is Mike Stonebraker’s “inner loop” point generalized.)
Well-factored query optimizer. No matter what, it’s good for a query optimizer to have been through a few rounds of Bottleneck Whack-A-Mole. Beyond that, an optimizer with sufficiently convenient hooks can have cool and occasionally valuable features such as:
- On-the-fly query re-planning. Do part of the query, rerun column statistics, and re-plan the query if appropriate.
- Not-so-black-box optimization. Work interactively with the DBA to find the best query plan.
- Query rewriting. Any decent optimizer will take a complex query and produce an execution plan that in some cases looks quite unlike the original query. Some optimizers go further in rewriting the query first, essentially to psych themselves into coming up with a better plan.

You can’t do much with an analytic database unless you get data into it in the first place. Thus, performance in writing and loading data are important, and there are a number of architectural decisions that can be helpful in those regards.

Row-based architecture. Column stores have obvious advantages for query, but in a naive column store implementation you have tremendous overhead, pulling the rows apart and storing them in many different columns. This is particularly the case for small, frequent updates.
Batched writes. The classic way to deal with column stores’ data writing challenges is to batch data in memory, then bang it to disk only occasionally. Hopefully the data is available seamlessly for query in RAM before the disk-banging occurs. This technique is by no means restricted to analytic and/or columnar use cases, but the single best-known example may be Vertica’s Read-Optimized Store (disk)/Write-Optimized Store (RAM) pairing.
Lack of indices and materialized views. Indexes and materialized views can help query speed, albeit at the cost of disk space and administrative effort. But maintaining them multiplies the difficulty of loading data in the first place.
Lockless or optimistic-locking concurrency model. Locking models suitable for OLTP can be ridiculous for analytic databases, blocking queries for no good reason. Fortunately, there are alternatives.
Append-only updating. When I/O volumes are high, append-only updating can give an important performance improvement over update-in-place, assuming you have sufficiently good algorithms for garbage-collection/clean-up. If I/O volumes are so low that you don’t care about the performance benefits, maybe it would be nice to have the “time-travel” feature that’s a potential byproduct of MVCC (Multi-Version Concurrency Control). Neither part of this observation applies solely to analytic DBMS.
Parallel load (no fat head). It’s not just query execution that can get bottlenecked at a “head node;” the same can happen with loads, batch or otherwise. That’s not a good thing. Thus, various parallel analytic DBMS vendors have set up ways to load data directly to the nodes where it’s going to be stored.
Specialized load nodes. Aster Data nCluster features specialized data loading nodes, although Aster has introduced a more conventional kind of parallel load as well.

And of course, all of the above need to be implemented in the context of well-configured combinations of hardware, networking, and software.

Topics I know I’ve left out include advanced-analytics functionality, and in-memory processing (CEP or otherwise). Also missing are specifics of compression algorithms — or indeed of anything else. I’m sure there’s much else missing besides, so please point out the most glaring omissions in the comment thread below.

Related links:

Why even in-database scoring can be important (May, 2010).
Three big myths about MapReduce (October, 2009).
Why you might ever want to integrate MapReduce into your DBMS (August, 2008).
The future of data marts, specifically data mart spin-out. (June, 2009).
Netezza offers both zone maps and clustered base tables (June, 2010).
Oracle Exadata Storage Indexes are like Netezza zone maps (January, 2010).
How Netezza uses the FPGA (August, 2010).
Oracle is reluctant to do on-site Exadata POCs (February, 2009). As of the end of 2010, that doesn’t seem to have changed.
The Netezza and IBM DB2 approaches to compression (June, 2010, which is before IBM acquired Netezza).
The secret sauce to Rainstor’s extreme compression (May, 2009, when Rainstor was still called Clearpace).
The row-based/columnar distinction gets blurred, e.g. by Vertica FlexStore (August, 2009).
And by Greenplum (October, 2009). Also contains the observation that even row-style compression works better when data is stored columnarly.
And by Aster Data (September, 2010).
Teradata is particularly aggressive about query rewrite (August, 2009).
Netezza’s logless, lockless architecture (September, 2006).

Mike Stonebraker on “real column stores”

Curt Monash — Wed, 12 Jan 2011 13:43:07 +0000

Mike Stonebraker has a post up on Vertica’s blog trying to differentiate “real” from “pretend” column stores. (Edit: That post seems to have come back down, but as of 1/19 it can be found in Google Cache.) In essence, Mike argues that the One Right Way to design a column store is Vertica’s, a position that Daniel Abadi used to share but since has retreated from.

There are some good things about that post, and some not-so-good. The worst paragraph is probably

Several row-store vendors (including Oracle, Greenplum and Aster Data) now claim to be selling a column store. Obviously, this would require a complete rewrite of a DBMS to move from Figure 1 to Figure 2. Hence, none of the “pretenders” have actually done this. Instead all have implemented some aspects of column stores, and then claim to be the real thing. This blog defines what the “real enchilada” looks like, and how to tell it from the pretenders.

which I question on two levels. First, the vendors cited don’t actually claim to be selling a column store; thus, the whole premise of Mike’s post is incorrect. Second, neither those vendors nor Mike are really correct. What Mike is really doing is differentiating, in his opinion,* good column stores from bad or mediocre ones.

*That Mike’s opinion in that regard is neither (wholly) unreasonable nor (wholly) unbiased should go pretty much without saying.

A lesser oopsie is Mike’s criterion “IO-1”, which is written so confusingly that it technically seems not to be met by any of the vendors cited — including Vertica, which introduced Vertica FlexStore in mid-2009. And while I’m at it — Aster Data nCluster definitely meets criterion IO-3; I confirmed that by asking Tasso Agyros. Mike’s “No” for Sybase IQ on his criterion CPU-5 is also pretty questionable, given that Sybase IQ operates on compressed data until “the last possible moment.”

With the minor stuff cleared away, let’s get to the heart of the matter. Mike in essence concedes that multiple competitors can get the I/O benefits of a column store, even “aggressive compression.” However, he asserts that a designed-from-the-ground-up column store also can and should have major CPU advantages over row stores or row/column hybrids, for three reasons (as I paraphase them):

CPU-5 Good column stores operate on compressed data, while other DBMS decompress first.
CPU-6 Good column stores benefit from storing data in multiple sort orders on disk, while other DBMS don’t.
CPU-4 Good column stores have column-oriented inner loops, while other DBMS don’t.

Actually, I have my doubts about the competitive-comparison aspect CPU-5. I think multiple DBMS that have dictionary/token compression, for example, operate on tokenized data in memory. I’ll confess to not having a current list memorized as to who does or doesn’t, but anyhow it’s a solvable technical problem. Also, as Tasso points out, if you use a bitmapped index you’re surely operating on compressed data.

On the other hand, the goodness of CPU-5 functionality is beyond reasonable dispute. For many queries (albeit by no means all), operating on compressed data is a major advantage.

For CPU-6, things are the other way around. Vertica is probably alone in the flexibility of how it orders columns on disk. Any other system I can think of is generally restricted to two storage orders at most — e.g., some kind of universal ID/row-ID, plus a sort on the actual values of the column. But is this a significant advantage at all?

Competitors like to argue that storing even in sort-by-value order is not advantageous at all, because of the overhead at data loading time, and the questionable number of queries that benefit. That extreme seems overstated. Why would the overhead be higher than that to, for example, maintain a b-tree index? And surely queries try to pick out specific values and/or value ranges, for a significant fraction of all columns.

On the other hand, total flexibility in storage sort order might require yet more overhead, and would also be of rarer benefit. And while Vertica claims to have fixed a prior drawback to the feature — administrative complexity — in Vertica 4.0, I don’t have hard facts as to how complete the fix really is.

As for the CPU-4 inner loop point — I must confess to not knowing much about it.

A few notes from XLDB 4

Curt Monash — Sun, 10 Oct 2010 17:49:03 +0000

As much as I believe in the XLDB conferences, I only found time to go to (a big) part of one day of XLDB 4 myself. In general:

XLDB 4 had a good crowd, including Phil Bernstein (quiet), Mike Stonebraker (not quiet), Martin Kersten (ditto), Luke Lonergan (ditto), Todd Walter (almost unrecognizable without his usual cowboy gear), Oliver Ratzesberger, and a bunch of actual science types.
XLDB 4 had one weakness — panels with lots of participants, but only a single microphone among them. That tends to make for serial declamations more than true interactive discussion, at least until the audience starts chiming in, which thankfully it tends to eventually do. (I had the same problem in spades while moderating the Boston Big Data Summit panel last year; at least at XLDB 4 nobody was TRYING to filibuster.)

My notes have unfortunately disappeared, but from memory:

Mike Stonebraker asserted that SciDB outperforms sharded MySQL by two orders of magnitude for some classes of scientific application. One of the big reasons was that SciDB lets you overlap partitions, so that for any feature you want to extract, you can be confident there’s at least one partition that actually contains it.
I chatted with Peter Breunig of Chevron about analytic issues in the oil & gas industry. I got the impression:
- Refineries are generally well-instrumented with sensors.
- Oil wells may not be, especially the less valuable/lower producing ones.
- He’d love to scatter passive sensors all around, waiting for natural tremors — as opposed to just geologist-set explosions — to provide more insight into what’s under the ground.
- 50-100 TB geological data sets are common. Processing them takes 2-3 weeks. As the technology gets better, so do the results (rather than the time being shortened).
- All this suggests that there’s a huge need for better technology in resovoir analysis.
- His other big unmet analytic desire is refinery simulation.
Kevin Winsen told about the proposed radio astronomy project ASKAP, which will have raw data volumes that make the LSST’s look small. (More precisely, ASKAP is the name proposed by Australia, one of the two finalists for the location; South Africa presumably has a different name for it.) 8 petabytes/day were mentioned, although most of this will be rapidly discarded. That could be the largest unclassified data acquisition rate out there, although it’s known that there’s a classified one at >10 PB/day (image data).
Health care researchers repeatedly complained that privacy regulations get in the way of them using clinical data for medical research. Just more grist for my “HIPAA must die so that people can live” mill.
Mike Stonebraker is pushing the idea of a “science benchmark.” (A paper on same has been posted.) The idea is that the existence of said benchmark should provide a spur for DBMS vendors to make their products run faster for scientific purposes, in line with the supposed salutary effects of TPC-A, TPC-B, and TPC-C. Notwithstanding that attendees included Oracle, Microsoft, EMC/Greenplum, Teradata, and Aster Data — with Greenplum, IBM, and Aster Data also being sponsors — I am skeptical because:
- Leaders of the XLDB effort seem convinced that only open source DBMS can meet their needs.
- They further characterize scientific DBMS as a zero billion dollars/year market.

Details and analysis of the VoltDB argument

Curt Monash — Wed, 30 Jun 2010 14:37:37 +0000

Todd Hoff (High Scalability blog) posted a lengthy examination of the case and use cases for VoltDB. That excellent post, in turn, is based on a Mike Stonebraker* webinar for VoltDB, for which the slide deck is happily available. It’s all nicely consistent with what I wrote about VoltDB last month, in connection with its launch.

*Who, in Todd’s apt description, is “the sword wielding Johnny Appleseed of the database world”.

Todd wrote:

What matters to VoltDB is: speed at scale, speed at scale, speed at scale, SQL, and ACID. If that matches your priorities then you’ll probably be happy. Otherwise, as you’ll see, everything is sacrificed for speed at scale and what is sacrificed is often ease of use, generality, and error checking. It’s likely we’ll see ease of use improve over time, but for now it looks like rough going, unless of course, you are a going for speed at scale.

Indeed.

Todd’s list of interesting VoltDB features is also pretty good, namely

Main-memory storage.

Run transactions to completion –single threaded –in timestamp order.

Replicas.

Tables are partitioned across multiple servers.

Stored procedures, written in Java, are the unit of transaction.

A limited subset of SQL ’99 is supported.

Design a schema and workflow to use single-sited procedures.

Challenging operations model.

No WAN support.

OLAP is purposefully kept separate.

VoltDB finally launches

Curt Monash — Tue, 25 May 2010 07:15:04 +0000

VoltDB is finally launching today. As is common for companies in sectors I write about, VoltDB — or just “Volt” — has discovered the virtues of embargoes that end 12:01 am. Let’s go straight to the technical highlights:

VoltDB is based on the H-Store technology, which I wrote about in February, 2009. Most of what I said about H-Store then applies to VoltDB today.
VoltDB is a no-apologies ACID relational DBMS, which runs entirely in RAM.
VoltDB has rather limited SQL. (One example: VoltDB can’t do SUMs in SQL.) However, VoltDB guy Tim Callaghan (Mark Callaghan’s lesser-known but nonetheless smart brother) asserts that if you code up the missing functionality, it’s almost as fast as if it were present in the DBMS to begin with, because there’s no added I/O from the handoff between the DBMS and the procedural code. (The data’s in RAM one way or the other.)
VoltDB’s Big Conceptual Performance Story is that it does away with most locks, latches, logs, etc., and also most context switching.
In particular, you’re supposed to partition your data and architect your application so that most transactions execute on a single core. When you can do that, you get VoltDB’s performance benefits. To the extent you can’t, you’re in two-phase-commit performance land. (More precisely, you’re doing 2PC for multi-core writes, which is surely a major reason that multi-core reads are a lot faster in VoltDB than multi-core writes.)
VoltDB has a little less than one DBMS thread per core. When the data partitioning works as it should, you execute a complete transaction in that single thread. Poof. No context switching.
A transaction in VoltDB is a Java stored procedure. (The early idea of Ruby on Rails in lieu of the Java/SQL combo didn’t hold up performance-wise.)
Solid-state memory is not a viable alternative to RAM for VoltDB. Too slow.
Instead, VoltDB lets you snapshot data to disk at tunable intervals. “Continuous” is one of the options, wherein a new snapshot starts being made as soon as the last one completes.
In addition, VoltDB will also spool a kind of transaction log to the target of your choice. (Obvious choice: An analytic DBMS such as Vertica, but there’s no such connectivity partnership actually in place at this time.)

I should also note that when Tim Callaghan described architectural options to get around 2PC performance issues, they sounded a lot like eventual consistency. Maybe tunable RYW consistency isn’t in the cards, but at least there’s a NoSQL-like possibility with VoltDB.

VoltDB’s open source strategy is:

VoltDB will be open sourced.
Community VoltDB will be GPLed. Professional Edition VoltDB has a non-GPL license.
The VoltDB Professional Edition won’t start out with features beyond the Community Edition ones, but will gain such later on. I didn’t get the sense the plans for those features were completely baked yet, but ideas mentioned included:
- Management/monitoring tools.
- Integration with expense closed-source enterprise software products, such as ones in the management/monitoring area.
- Yet more “extreme”/edge-case performance.
Before VoltDB decided for sure that it wasn’t selling licenses, it sold a license to Getco, which also seems to be an investor in the company.

VoltDB had a beta test with about 150 participants. None is in production yet, although at least a few are clearly headed there. Most VoltDB beta testers are in some kind of online business, with a particular concentration in everybody’s new favorite market, online gaming. Most of the rest are in investment/trading — a major target market for at least three different Mike Stonebraker companies — and a few are in telecom. VoltDB assures me that some of the beta users are companies one actually has heard of before, but VoltDB is not in a position to name any of those.

VoltDB is not ideally suited for a classic order management system, since you’d want to partition both on CustomerID and SKU, the latter because you’d constantly updating inventory stock levels. However, this argument doesn’t apply in the case of virtual goods. Virtual goods that are sold for real money — and hence need ACID levels of transaction integrity — are thus a clear target market for VoltDB. (The example that came up was in, you guessed it, online gaming.) The other interesting use case that Tim highlighted was low-latency analytics/ELT. For reasons I didn’t totally grasp, Tim likes to call this “Stateful ELT.” (Given that the data goes into the VoltDB database before much else happens to it, I’m pretty sure I heard “ELT” correctly. But I guess I might have been mishearing “ETL”.)

VoltDB company highlights include:

VoltDB has about a dozen employees, all but two of whom are technical. (However, I’m not sure they’re counting Andy Ellicott against the two. But then, last I heard he wasn’t full time at VoltDB.)
VoltDB’s venture funding status is, if I may paraphrase, “Mumble mumble.”
Although long separate from Vertica, VoltDB is still located in Vertica’s offices.