In a comment thread on Seth Grimes’ blog, Rita Sallam of Oracle engaged in a passionate defense of her data warehousing software. I’d like to take it upon myself to respond to a few of here points here.
If a shared disk architecture is not scalable, then how is it that Oracle consistently has more customers in the Winter Group top ten data warehouses than any other vendor?
Unfortunately, the Winter Corporation list is a joke, which may be why it hasn’t been updated since 2005. (I mean it — Dick Winter seems like too good a guy to keep publishing something so misleading indefinitely.) It counts not just user data, but indices, aggregates, and everything else. Based on that, I’d guess the largest Oracle site listed there to be at 10-20 terabytes of user data, and all the others to be at 5 TB or less. Even assuming 3-5X database growth since the list was compiled, that puts Oracle behind Teradata, DATAllegro, Netezza, Dataupia, and probably SAS — just counting ones I can think of quickly — none of whom are actually represented on the list. Teradata in particular blows Oracle away on warehouse size.
And by the way — the largest Oracle warehouse by far on that list is at Yahoo. But Oracle isn’t Yahoo’s major data warehouse software provider.
If a shared disk architecture is not scalable, then how is it that Oracle is the leader in Data Warehouse Performance. It is the TPC-H leader in the 300GB, 1TB, 3TB, and 30TB categories.
TPCs are a joke too. Oracle’s third-longest-serving exec (or maybe second-longest — I always forget whether he or Ken Jacobs has been there longer) e-mailed me a few years ago, asking for my help in making them go away. Be that as it may:
- Oracle probably won the 30 TB TPC-H because it’s the only vendor to submit a result.
- Oracle is the “leader” on the 10 TB TPC-H by 10% in price-performance, using a system that hasn’t shipped yet, over a system that has already shipped. 5 months is worth more than 10% in this day and age. Anyhow, the other contenders are Microsoft and IBM, which may be why Oracle finds it reasonably easy to keep up.
- Oracle trails Exasol by a factor of 27 — that is not a typo — on the 3 TB TPC-H.
- Oracle trails Exasol by a factor of 20 — also not a typo — on the 1 TB TPC-H.
- Oracle trails Kickfire by a factor of 14 — also not a typo — on the 300 GB TPC-H
A shared disk architecture (Oracle) is more flexible. Since all processing units can see all data the system can at runtime decide what the degree of parallelism should be. In addition, some queries may be more efficiently run in serial (simple index lookups) in which case parallelism isnt even used.
Data warehouse appliances (at least the row-based ones) excel at fast table scans.
Also, if individual servers in a cluster contain many CPUs (or cores) the parallelism can be co-located on the node. Hence, statements may run in parallel but do not require the interconnect to ship data.
Appliance makers use multicore systems too. Everybody does, these days.
Oracles Shared Everything architecture provides the ability to dynamically optimize each query requirement. The current workload is examined and the degree of parallelism is adjusted rather than blindly starting with the same degree of parallelism every time. Therefore, the degree of parallelism is optimized for every query and there is no requirement for a minimal degree of parallelism across all nodes. Operations can run in parallel using one, some or all nodes of a Real Application cluster depending on the current workload, the characteristics and importance of the query.
That’s all irrelevant to the chief benefit of parallelism. Parallelism isn’t about optimizing the use of CPUs. Parallelism is about optimizing the system where it’s actually bottlenecked, which is getting data off of disk.
Parallelism is not related to the partitioning strategy of the data as in a shared-nothing environment.
Parallelism isn’t particularly related to partitioning strategy in a shared-nothing environment either. Kognitio offers a competitive shared-nothing system with no partitioning whatsoever. And many queries on most vendors’ systems relate to partitioning only in that the data is distributed so that approximately equal amounts of data may be found on each node.
True, since that’s done by hash partitioning, you try to pick hash keys so that you get lucky as often as possible, and benefit from the hash key when doing a hash join. And further partitioning can be added as an optimization. But that’s hardly a disadvantage for shared-nothing systems vs. Oracle.
With Oracles shared disk approach, fail over is built-in and the configuration remains balanced.
If your disks are failing often enough for that to be more than a tiny benefit, you might want to consider changing your storage supplier.
Oracle does not require re-organization of data. Oracles hash partitioning is also automatic and does not require re-distribution of data. The Oracle Optimizer automatically tunes queries. In addition the Oracle Database 10g ADDM, (Automatic Database Diagnostic Monitor) runs automatically to make performance recommendations. Index management is very simple in Oracle. The ADDM tool recommends indexes, generates script to create indexes and will run them with the DBAs approval. Oracle also supports all types of data including stars, normalized and de-normalized data. Oracle supports Join Indexes and Aggregate Join Indexes.
Somebody please remind me to start an international Scrabulous tournament for Oracle DBAs, since they have nothing else to occupy their time.
In addition, Oracle supports superior concurrency and parallelism, Oracle can execute several queries at the same time (in parallel) without performance degradation. With Oracle’s model, there are several checkout counters that customers can use, which parallelizes the process and provides a higher throughput. Even if a customer at one counter takes a long time to checkout, other customers are not affected. Once all checkout counters are full, Oracle queues the remaining customers (queries) until the next checkout counter opens up and sends the next customer in line to the open counter. If this starts occurring consistently, and the processing of customers (queries) slows down, Oracle allows for more checkout counters to be added dynamically using RAC or by simply adding more CPUs to an environment.
That one’s probably real.