If I had to name one company with the broadest possible overview of the data warehouse engine market, it would have to be IBM. IBM offers software and hardware, services-heavy deals and quasi-appliances, OLTP and ROLAP, shared-everything and shared-nothing, integrated-(almost)-everything and best-of-breed. So their ROLAP recommendations, while still rather self-serving (just as any other vendor’s would be), are at least somewhat more than just a case of “Where you stand depends upon where you sit.”
At its core, the current IBM ROLAP story is:
- Shared nothing MPP.
- Flexible indexing, lightly applied.
- Normalized data models.
- Thoroughly mixed workloads.
- Preconfigured hardware.
Here’s some more detail, about IBM and other vendors alike.
For years, IBM could reasonably be seen as straddling the shared-everything and shared-nothing worlds. On the mainframe, DB2 is a shared-everything system. But the portable version of DB2 is primarily shared-nothing, not least because of the difficulty of even matching Oracle’s degree of platform-independent shared-everything scalability.
And shared-nothing, or primarily shared-nothing, is now IBM’s clear choice for data warehousing. E.g., while a few years ago mainframe DB2 was actually a serious contender for high-end data warehouse projects, now it’s positioned as something you should run only if you’re a legacy mainframe user or if you need absurdly high degrees of uptime for your warehouse system. In fact, when DB2 is deployed on a big SMP box, IBM favors turning off or disregarding some of the sharing and effectively treating that box like more of a shared-nothing system.
When I posted a few days ago about Expansion Ratios, I passed along Teradata’s estimate that there was a significant gap between Teradata and IBM. IBM’s own estimate, however, is that a typical Expansion Ratio for DB2 is a Teradata-like 2-3X in DB2 Version 8, and that it’s falling further in Version 9 due to compression. (DB2’s compression sounds like it’s the most aggressive in the business, specialty columnar or MOLAP products perhaps excepted.) Incidentally, this suggests that the indexing features DB2 has that Teradata doesn’t – e.g., alternate datatypes like geospatial – aren’t heavily used by a large fraction of the customer base.
In MOLAP systems, explosion/expansion ratios are primarily based on technical features. But in ROLAP the greatest deciding factor is often the degree of indexing and aggregation users choose to use in order to get performance – including prejoins, materialized views, denormalization, etc. Increasingly, IBM’s favored and recommended data architecture strategy is similar to Teradata’s, which basically boils down to:
- Normalize everything.
- To make MPP work, you have to distribute the data across the nodes. So you might as well hash partition it on a useful key or set of keys, which in effect gives a little indexing and join acceleration for free. (More than a little, actually, if the data structure is a fairly simple one with a single fact table.)
- Depending on workloads, put other column indices in place.
- Selectively add materialized views/join indices.
- Multilevel range partitioning is extremely useful, and any vendor who doesn’t have it should be working on it actively.
By way of contrast, DATallegro would endorse 1, 2, and 5, but argue that table scans via sequential reads (I’ve happily given up the “coarse-grained” terminology, since almost nobody cares) obviate most or all of the need for 3 and 4. And Netezza – well, I guess I shouldn’t comment on their views, because of their strict NDA policy.
That difference in approach underscores a key IBM/Teradata claim – if you want to do small queries with high performance, you need indices. And if you want to do truly mixed workloads — including “operational” or “low-latency” BI, or even high-volume scheduled reporting — you need to be able to execute efficiently on those kinds of queries. And the thing is – Netezza’s customer Ross Stores notwithstanding, I don’t know of a lot of counterexamples in the DATallegro or Netezza customer bases disproving that claim.
Finally, there is the matter of custom or preconfigured hardware. Historically, distributed DB2 had been sold as a software-only product. But in response to the market disruption caused by the appliance upstarts, IBM’s response was basically “Twist my arm – please!” Base Configuration Units (BCU) are either a consulting template (if you don’t want to use IBM branded hardware) or an actual hardware configuration (if you do). Either way, IBM sells more than a hardware license. There also is some discussion of special database acceleration chips, but given how IBM folks seem not to hear me when I ask about them, I presume they’re not yet a major issue worth talking about.
As for Teradata – well, they’re obviously a hardware/software bundler through and through.