This is the third of a five-part series on database management system choices. For the first post in the series, please click here.
High-end OLTP relational database management system vendors try to offer one-stop shopping for almost all data management needs. But as I noted in my prior post, their product category is facing two major competitive threats. One comes from specialty data warehouse database management system products. I’ve covered those extensively in this blog, with key takeaways including:
- Specialty data warehouse products offer huge cost advantages versus less targeted DBMS. This applies to purchase/maintenance and administrative costs alike. And it’s true even when the general-purposed DBMS boast data warehousing features such as star indexes, bitmap indexes, or sophisticated optimizers.
- The larger the database, the bigger the difference. It’s almost inconceivable to use Oracle for a 100+ terabyte data warehouse. But if you only have 5 terabytes, Oracle is a perfectly viable – albeit annoying and costly – alternative.
- Most specialty data warehouse products have a shared-nothing architecture. Smaller parts are cheaper per unit of capacity. Hence shared nothing/grid architectures are inherently cheaper, at least in theory. In data warehousing, that theoretical possibility has long been made practical.
- Specialty data warehouse products with row-based architectures are commonly sold in appliance formats. In particular, this is true of Teradata, Netezza, DATAllegro, and Greenplum. One reason is that they’re optimized to stream data off of disk fairly sequentially, as opposed to relying on random seeks.
- Specialty data warehouse products with columnar architectures are commonly available in software-only formats. Even so, Vertica and ParAccel also boast appliance deals, with HP and Sun respectively.
- There is tremendous technical diversity and differentiation in the specialty data warehouse system market.
Let me expand on that last point. Different features may or may not be important to you, depending on whether your precise application needs include:
- Absolute scalability. Teradata, DATAllegro, and SAS have customers with multi-hundred terabyte data warehouses (user data). Netezza is headed for that range soon. Many other vendors effectively top out in the tens of terabytes right now.
- Pre-projection. Columnar data warehouse products only retrieve the columns needed for a particular query. Depending on how wide your rows are, that can be a huge advantage. Through materialized views – and in Netezza’s case a hardware assist – row-based vendors have varying degrees of effectiveness at accomplishing similar things.
- Pinpoint queries. For some data warehouse applications (e.g., in call centers), it is important to look up specific records at high speed. Conventional-but-much-faster row-based systems like Teradata, Greenplum, or DATAllegro can usually meet these needs. Not all columnar vendors would do as well.
- Data scoring. Similarly, data scoring – e.g., in applications such on-the-fly call center offer optimization — tends to involve entire rows of data. Row-based systems may outperform columnar ones for that kind of use.
- Partitioning. A major virtue of most specialty data warehouse products is that they give pretty good performance without a lot of pre-tuning. Even so, range partitioning can ensure that data with particular field values is concentrated together, meaning that less data total needs to come off of disk to satisfy certain queries. A major use of this capability is in applications where data with recent dates is retrieved or analyzed more intensely than older information – and there are a lot of those.
- Load speed. Row-based data warehouse managers usually have fast load speeds, for bulk loads and trickle feeds alike. Columnar vendors differ in how fast they are at which kinds of loads.
- Concurrency. If you need your data warehouse to support a large and diverse set of concurrent reports and queries, Teradata will definitely do the job for you. Netezza and DATAllegro have also cleaned up bottlenecks through a few revisions each. Newer products might or might not do as well.
- Schema support. Some data warehouse products are optimized for pure star schemas with single fact tables, or schemas only a little more complex than that. Others are more broadly applicable. For many data warehouses, the simple schemas suffice.
- Compressibility. Every serious data warehouse product offers at least 2-5X compression. (Columnar systems are somewhat ahead of row-based ones.) Some spike up into double digits, especially for certain kinds of columns. Depending on the nature of your data, compression may or may not be a huge differentiating factor.
- Transparency/compatibility. If you need your data warehouse to run exactly the SQL that SQL Server does, you might want to talk to ParAccel. If you want to be compatible with MySQL, Infobright Brighthouse is the ticket.
- MDX or “Model by” instead of standard SQL.
One more complication: There also are a number of products that accelerate data analysis outside the DBMS, usually in a memory-centric way. Ones I’ve written about in the past include QlikView, SAP BI Accelerator, Applix, and the whole category of complex event/stream processing.
The complete series
- Part 1: Database management system choices – overview
- Part 2: Database management system choices – 4 categories of relational
- Part 3: Database management system choices – relational data warehouse
- Part 4: Database management system choices – mid-range-relational
- Part 5: Database management system choices – beyond relational