Mike Stonebraker recently kicked off some discussion about desirable architectural features of a columnar analytic DBMS. Let’s expand the conversation to cover desirable architectural characteristics of analytic DBMS in general. But first, a few housekeeping notes:
- This is a very long post.
- Even so, to keep it somewhat manageable, I’ve cut corners on completeness. Most notably, two important areas are entirely deferred to future posts — advanced-analytics-specific architecture, and in-memory processing (including CEP).
- The subjects here are not strictly parallel. The distinction between major add-on modules and “turtles all the way down” core architectural choices is rarely crystal-clear — Mike Stonebraker’s recent post notwithstanding — and I’ve mixed subjects of varying degrees of “fundamentalness” pretty freely.
- There’s a long list of links at the end, pointing at posts that help explain or give examples of specific features named in the body of the text, somewhat like unnumbered footnotes.
OK. In my opinion, the four drop-dead requirements for an analytic DBMS are:
- Relational/SQL support. That’s how you get great flexibility in more or less easily constructing queries, as well as connectivity to a vast number of tools. In a few cases, I guess MDX might suffice as an alternative.
- Sufficiently great query performance, on the queries you’re actually going to run, for however many concurrent users you actually will have.
- Sufficiently high data loading throughput and sufficiently low data loading latency.
- Sufficiently favorable TCO (Total Cost of Ownership), all things considered, where “all things” at a minimum includes software license, software maintenance, hardware, power, people costs for administration, and people costs for development.
Depending on your use case, you might have additional make-or-break requirements. Possible areas include:
- Additional query functionality, of course with good performance. Specific examples include:
- ANSI-standard SQL features that are not universally supported (e.g. windowing).
- Geospatial datatype support.
- Further high-performing integrated analytics, such as:
- Data mining/machine learning modeling and scoring.
- Other mathematical functions, such as linear algebra, optimization, or Monte Carlo simulation.
- Extensibility via MapReduce and/or sufficiently robust user-defined function (UDF) capabilities.
- Platform support that matches your needs.
- Security, auditability, and/or high-performance encryption.
Other possibly important features — but ones that would usually go on “nice to have” rather than “must have” lists — include:
- Yet more query functionality, in areas such as:
- Non-standard SQL extensions (e.g. temporal ones)
- Specific prepackaged UDFs.
- Cross-column text search.
- Nice administrative tools, in areas such as:
- Single-query performance/optimization.
- Workload management.
- Data mart spin-out.
So what kinds of architectural choices (or major features) should one look to to support such features? On the performance side there are many candidates, including:
- Specialized indexes, more commonly found in older DBMS. Leading examples include star and especially bitmap indices, both of which I was already writing about back in the 1990s. Ditto materialized views, which aren’t exactly indices, but are closely related.
- Partition elimination. Single- or multi-level range partitioning can cause whole regions of the database never to be checked in a particular query’s evaluation. (That’s a good thing.) The functionality popularized by Netezza as zone maps does something similar, without requiring the partitions to be chosen in advance.
- Scan-friendliness. If a query runs a long time, it may include a lot of (full or partial) table scanning. Assuming you rely on spinning disk — as opposed to solid-state memory — one way to improve your sequential-scan throughput far above your random-read throughput is to support large block sizes.
- Parallelism. It’s possible to screw up even multi-core parallelism, but the big issue is multi-server. In particular:
- An analytic DBMS must avoid a “fat head” bottleneck, either because there is no head node at all directing things, or else because data redistribution algorithms are sufficiently mature as to not overload it. (In naive parallel DBMS implementations, intermediate query results get sent back to the head node to be, for example, JOINed together. This is not a good thing.)
- Multiple analytic DBMS vendors have chosen to develop custom data transfer protocols, for more reliable performance than they can get from TCP/IP. Examples include Teradata, Netezza, and ParAccel.
- Predicate pushdown. Predicate pushdown takes several forms, in all cases having the goal of executing certain simpler database operations — predicate evaluations — close to the data, thus minimizing I/O or upstream processing.
- Netezza famously offloads the first part of predicate evaluation to FPGA (Field-Programmable Gate Array) chips.
- At least in theory, I like the Exadata form of node specialization, in which a tier of server nodes does the first part of the processing, with the results being sent to a second upstream database tier. But it’s not obvious that any RDBMS vendor has done a great job with it. Oracle is famously secretive about Exadata’s track record, and as of this writing apparently still resists on-site benchmarks. Calpont hasn’t accomplished much. And MarkLogic of course doesn’t sell an RDBMS.
- There’s reason to think predicate pushdown would help exploit flash memory, although I’m not sure vendors are moving in a direction that will let us find out.
- Columnar data storage. Columnar storage is pretty much the ultimate in predicate pushdown, and advantageous in many analytic query scenarios. (Main exception: When you’re bringing back the majority of a row anyway, you might as well fetch the thing pre-assembled.) As Mike Stonebraker points out, columnar storage should not incur serious row-ID overhead, and ideally should be available for multiple sort orders on each column.
- Compression. This, rightly, is another of Mike Stonebraker’s favorite features. Database compression is hugely important, for I/O and in silicon alike. (And it can also save money on storage.) There are a broad variety of compression techniques, suited for different kinds of data, different kinds of queries, or different points on the storage saving/decompression performance tradeoff spectrum.
- Flexible storage. Not all data is best stored the same way, even if it’s in the same database. Some is destined for columnar-friendly use cases, other for whole row. Some is compressed ideally by one technique, some by another. And so on. Some database managers do a good job of letting different parts of the database (even within the same table) be stored in different ways.
- Query pipelining. There are a lot of steps to query execution, in both the fine-grained sense (a whole lot of rows) and the coarse-grained (all but the simplest execution plans feature a number of operations each). FPGA-based vendors XtremeData and Kickfire used the innate parallelism of an FPGA to pipeline query execution. Kickfire failed, and XtremeData hasn’t sold many systems, but that doesn’t mean it isn’t a good idea. Kickfire’s assets were sold to Teradata. Meanwhile, VectorWise’s very name speaks to its (Intel-based) vector processing architecture.
- Result set reuse. Instead of mixing together different steps of the same query, how about mixing together the same step in different queries, so that you don’t have to repeat it? As a simple example, suppose two queries need to do the same table scan. Well then, it would be nice to only do the scan once. In most cases, query workloads are too diverse for result set reuse of that kind to be very important; still, it’s a cool feature, which Teradata calls synchronized scan.
- Suitably optimized execution engine – column, row, whatever. (This is Mike Stonebraker’s “inner loop” point generalized.)
- Well-factored query optimizer. No matter what, it’s good for a query optimizer to have been through a few rounds of Bottleneck Whack-A-Mole. Beyond that, an optimizer with sufficiently convenient hooks can have cool and occasionally valuable features such as:
- On-the-fly query re-planning. Do part of the query, rerun column statistics, and re-plan the query if appropriate.
- Not-so-black-box optimization. Work interactively with the DBA to find the best query plan.
- Query rewriting. Any decent optimizer will take a complex query and produce an execution plan that in some cases looks quite unlike the original query. Some optimizers go further in rewriting the query first, essentially to psych themselves into coming up with a better plan.
You can’t do much with an analytic database unless you get data into it in the first place. Thus, performance in writing and loading data are important, and there are a number of architectural decisions that can be helpful in those regards.
- Row-based architecture. Column stores have obvious advantages for query, but in a naive column store implementation you have tremendous overhead, pulling the rows apart and storing them in many different columns. This is particularly the case for small, frequent updates.
- Batched writes. The classic way to deal with column stores’ data writing challenges is to batch data in memory, then bang it to disk only occasionally. Hopefully the data is available seamlessly for query in RAM before the disk-banging occurs. This technique is by no means restricted to analytic and/or columnar use cases, but the single best-known example may be Vertica’s Read-Optimized Store (disk)/Write-Optimized Store (RAM) pairing.
- Lack of indices and materialized views. Indexes and materialized views can help query speed, albeit at the cost of disk space and administrative effort. But maintaining them multiplies the difficulty of loading data in the first place.
- Lockless or optimistic-locking concurrency model. Locking models suitable for OLTP can be ridiculous for analytic databases, blocking queries for no good reason. Fortunately, there are alternatives.
- Append-only updating. When I/O volumes are high, append-only updating can give an important performance improvement over update-in-place, assuming you have sufficiently good algorithms for garbage-collection/clean-up. If I/O volumes are so low that you don’t care about the performance benefits, maybe it would be nice to have the “time-travel” feature that’s a potential byproduct of MVCC (Multi-Version Concurrency Control). Neither part of this observation applies solely to analytic DBMS.
- Parallel load (no fat head). It’s not just query execution that can get bottlenecked at a “head node;” the same can happen with loads, batch or otherwise. That’s not a good thing. Thus, various parallel analytic DBMS vendors have set up ways to load data directly to the nodes where it’s going to be stored.
- Specialized load nodes. Aster Data nCluster features specialized data loading nodes, although Aster has introduced a more conventional kind of parallel load as well.
And of course, all of the above need to be implemented in the context of well-configured combinations of hardware, networking, and software.
Topics I know I’ve left out include advanced-analytics functionality, and in-memory processing (CEP or otherwise). Also missing are specifics of compression algorithms — or indeed of anything else. I’m sure there’s much else missing besides, so please point out the most glaring omissions in the comment thread below.
- Why even in-database scoring can be important (May, 2010).
- Three big myths about MapReduce (October, 2009).
- Why you might ever want to integrate MapReduce into your DBMS (August, 2008).
- The future of data marts, specifically data mart spin-out. (June, 2009).
- Netezza offers both zone maps and clustered base tables (June, 2010).
- Oracle Exadata Storage Indexes are like Netezza zone maps (January, 2010).
- How Netezza uses the FPGA (August, 2010).
- Oracle is reluctant to do on-site Exadata POCs (February, 2009). As of the end of 2010, that doesn’t seem to have changed.
- The Netezza and IBM DB2 approaches to compression (June, 2010, which is before IBM acquired Netezza).
- The secret sauce to Rainstor’s extreme compression (May, 2009, when Rainstor was still called Clearpace).
- The row-based/columnar distinction gets blurred, e.g. by Vertica FlexStore (August, 2009).
- And by Greenplum (October, 2009). Also contains the observation that even row-style compression works better when data is stored columnarly.
- And by Aster Data (September, 2010).
- Teradata is particularly aggressive about query rewrite (August, 2009).
- Netezza’s logless, lockless architecture (September, 2006).