A few months ago, CEO Mayank Bawa of Aster Data commented to me on his surprise at how “profound” the relationship was between design choices in one aspect of a data warehouse DBMS and choices in other parts. The word choice in that was all Mayank, but the underlying thought is one I’ve long shared, and that I’m certain architects of many analytic DBMS share as well.
For that matter, the observation is no doubt true in many other product categories as well. But in the analytic database management arena, where there are literally 10-20+ competitors with different, non-stupid approaches, it seems most particularly valid. Here are some examples of what I mean.
Hash partitioning distribution. In shared-nothing or shared-not-very-much database architectures, multiple processors pull data off disk in parallel. Ideally, it will be the case that for each long-running query, the amount of data retrieved at each node is almost identical. That way, each node is done at the same time, with no wasteful waiting.
Consequently, data should be distributed more or less randomly across the nodes. That can be done through “round-robin” allocation — each node takes a turn in strict order receiving new records or blocks. Or it can be done by hashing on a particular key — in essence, by assigning data to different disks depending on the value in some particular field or combination of fields.
Hash partitioning distribution is a wonderful optimization. For most large tables, there’s a obvious join key that will be relevant to a significant fraction of all long-running queries. Pre-hashing on that key saves a huge step in the execution of hash joins involving that key, and hence can provide a significant reduction in the total query processing workload. Nor is this benefit confined to single-fact-table or single-primary-key schemas. When different kinds of data are stored in the same warehouse, each large fact table can be hash partitioned distributed on its own key.
For almost all databases on almost all shared-nothing vendors’ systems, hash partitioning distribution is the way to go. Even so, a couple of products don’t even bother supporting it. Oracle Exadata isn’t going to perform joins of that kind anyway until data is moved from the storage to the database tier, so hash partitioning distribution has no benefit in Exadata’s multi-tier architecture. Kognitio, while not having such a clean proof of why hash partitioning distribution is utterly beside the point, thinks the costs of violating strict randomness outweigh the costs in its silicon-centric approach.
Indexing alternatives. More generally, analytic DBMS generally differ from OLTP DBMS in that they’re optimized to run more table scans and fewer updates and pinpoint queries. I’ve written about that many times, even coining the phrase index-light to encapsulate the story. The general idea is that if you’re retrieving a lot of rows per query, it becomes inefficient to keep spinning the disk to ensure you get only the rows you want. You get a lot more bytes/second doing sequential than random reads, so if a sufficiently large fraction of the rows are ones you actually want, it’s better to just scan them all.
If you’re going to follow an extreme form of that approach (e.g. Netezza, DATAllegro), you might as well have huge block sizes for your data (1megabyte+). If you think indexes of various kinds will actually be useful a reasonable fraction of the time, you might go with smaller sizes, such as 128K, which is what Teradata and HP (Neoview) favor.
Meanwhile, columnar vendor Vertica recreates some of the benefits of indexes by storing the same column in multiple sort orders. And that leads me to the next point.
High availability/failover alternatives. Most analytic DBMS mirror the data on-the-fly. But strategies differ. Some just rely on a storage vendor’s technology; others build in their own forms redundancy.
Particularly interesting is Vertica’s approach. Not only does Vertica allow multiple copies of the data to each be used for querying; it encourages the storage of the same columns in different sort orders, with the optimizer obviously choosing to query the copy that’s sorted in the way most useful for a specific query’s execution plan.
Redundancy and failover strategies are tightly tied to other administration issues too. For example, Aster Data and other vendors brag, with varying degrees of emphasis, that a new node can be added to a system, and the whole thing reconfigures itself automagically with zero down time. Similarly, different systems respond differently to node failure, in terms of metrics such as time to reestablish normal operation, performance hit (if any) after normal operation resumes, performance hit before normal operation resumes, and time window (if any) that redundancy is lost — so that a second failure would crash the whole system.
Bottom line: There never will be an analytic DBMS that simultaneously possesses all highly desirable architectural attributes for the product category.