Shortly before Tuesday’s launch of DB2 10, IBM’s Conor O’Mahony checked in for a relatively non-technical briefing.* More precisely, this is about DB2 for “distributed” systems, aka LUW (Linux/Unix/Windows); some of the features have already been in the mainframe version of DB2 for a while. IBM is graciously permitting me to post the associated DB2 10 announcement slide deck.
*I hope any errors in interpretation are minor.
Major aspects of DB2 10 include new or improved capabilities in the areas of:
- Analytic query performance.
- Data ingest.
- Multi-temperature data management.
- Workload management.
- Graph management/relationship analytics.
- Time-travel, bitemporal features, and bitemporal time-travel.
Of course, there are various other enhancements too, including to security (fine-grained access control), Oracle compatibility, and DB2 pureScale. Everything except the pureScale part is also reflected in IBM InfoSphere Warehouse, which is a near-superset of DB2.*
*Also, the data ingest part isn’t in base DB2.
The most remarkable claims Conor made were in the area of compression. Previously, IBM claimed 2.2-3X compression as typical, with 7X as best case. But as is (approximately) illustrated by Slide 12, IBM now says 7X is typical, with 4-10X being a realistic range and 45X having been the best case to date. Apparently, the DB2 compression strategy is now:
- Keep the old DB2 compression scheme, which is dictionary compression across the top 4096 values in a table or range partition. Notably, that compression …
- … extends to indexes, temp space, and so on, as well as the data itself.
- Add a similar page-level compression scheme. Other than saying it too was dictionary-based, Conor didn’t give details.
- Have some automation determining which values are compressed table-wide and which are compressed at the page level.
Those numbers are pretty bold claims for dictionary compression, especially in a row-based system.* The two special features I can think of in IBM’s compression that might allow it to outdo other dictionary schemes are:
- You can compress multiple columns at once. (The canonical example is different fields in an address.)
- (If I understood Conor correctly) You can also compress substrings within a column, or across columns.
*Row-based vs. columnar doesn’t matter for table-wide dictionary compression, but it does for page-level; the more comparable values you have per page, the better your chances to compress.
IBM claims consistent 3X query performance improvements on a variety of (non-published) benchmarks, with occasional examples of much higher figures. If the compression claims are really true, they could explain much of the query speed-up right there. Beyond that, the associated feature list is on Slides 7 and 8. The feature Conor called out was pre-fetching of indexes, which makes good index organization less important (Slide 9), which hence means DBAs have to worry less about index maintenance.
Prior to DB2 10, it appears that data ingest was through a single core, and it required the core to be dedicated. Now data ingest is just one more task that can be parallelized, workload-managed, and so on. It would seem that the biggest relevance of this feature is when data is being streamed from a transactional system — which is of course what you want to do whenever practical, versus the batch-load alternative.*
*My first clue for that was the feature name “real-time data warehousing.”
IBM DB2 10 introduces the beginnings of multi-temperature data management. That is, you can have different ranges in the same range-partitioned table be on different classes of storage — solid-state, faster disks, slower disks, whatever.
DB2’s workload management as described by Conor sounds more primitive than what Tim Vincent told me about a year and a half ago. Probably it’s just a difference in emphasis or something. Anyhow, DB2 workload management:
- Newly sets limits on CPU consumed by certain workloads, rather than just divvying up CPU resources.
- Doesn’t manage I/O or RAM.
- Newly works on its own, rather than relying on underlying operating systems.
- Takes the “temperature” of data (or type of storage it’s on?) into account as part of workload prioritization.
IBM is introducing both time-travel and bitemporal capabilities, but we didn’t spend much — um — time on them. “Time-travel” means you can do queries on the state of the database as of some previous date. “Bitemporal” means data can have an effective dates — i.e., dates on which the fact recorded (e.g. insurance coverage) begins or stops to be true.
IBM is also introducing some graph data features, and is showing the good taste to use my term relationship analytics.* Mainly, this is SPARQL 1.0 support, implemented via a variety of relational tables. We’re planning a follow-up briefing for me to learn more. An internal benchmark — 3.5X speed-up — is memorialized on Slide 17.