A lot of evidence is pointing to a major paradigm shift in data warehouse RDBMS, along the lines of:
Old way: Assume I/O is random; lower total execution time by improving selectivity and thus lowering the amount of I/O.
New way: Drive the amount of random I/O to near zero, and do as much sequential I/O as necessary to achieve this goal.
- Data warehouse appliances (see especially this discussion of DATallegro’s architecture)
- Columnar systems (see Nathan Myer’s first comment in this discussion of the much-hyped Required Technologies prototype)
- Memory-centric systems, notably SAP’s BI Accelerator
The hardware logic is compelling, as long as we rely on hard disks rather than, say, flash memory. Rotation speed has only gone up 12.5-fold in the entire 50-year history of the hard drive, and currently maxes out at 15,000 RPM, which puts a floor of 2 ms on average random access time. But streaming data on and off disk gets exponentially faster, in line with increases in disk density and semiconductor performance. Hence sequential data access gets ever faster, while random access does not.
What I don’t 100% understand yet, however, is the full array of techniques used by the traditional leaders to co-opt or combat this trend. I’m looking into that; in particular, I have a call scheduled with Oracle.
I hope to write about this issue in my October Computerworld column. (My columns are typically submitted on the first Monday or Tuesday morning of the month, to appear in the following week’s edition.) Or if it slips from October, then soon thereafter. Any thoughts in the interim would be most welcome.