IBM’s “Viper” version of DB2 is in open beta test, whatever that means, and Microsoft’s SQL Server 2005, nee Yukon, is in general release. Both have native XML capabilities surpassing Oracle’s – which is interesting in its own right, because it’s rare for either of those vendors to pull ahead of Oracle in an OLTP feature, and almost unprecedented for both to do so at once.
So let’s talk about native XML support, what it is, and who might or should care about it. (Well, the apps part is actually in a separate Part 2 post.) Most of this is based on research that’s several months old, but except for a scarcity of actual user interviews, that shouldn’t matter much.
There are two main non-native ways to put XML into a SQL database such as Oracle – shredding and LOBs (BLOBs or CLOBs – i.e., Binary or Character Large OBjects). Both can perform poorly, for different reasons. Shredding takes XML documents and distributes them among a bunch of tables. So one update in XML can become many updates when shredded, and one lookup in XML can become a complex join from shredded storage. LOB storage obviates those problems, but creates another – even when you’re only looking for part of a document, you have to retrieve and handle the whole thing, and the same goes for updates.
So native storage can be a good thing when you can afford neither the performance hit of shredding, nor of LOB storage, nor of any available hybrid. It also could be good if getting good performance from non-native storage, while possible, would create undue burdens on application development, or if there’s some other reason one or both of the shredding and LOB approaches isn’t viable.
One nice feature is that native-XML storage has almost no downside, at least if you get it from the high-end DBMS vendors. IBM, Oracle, and Microsoft have all worked out ways to have integrated query parsing and query optimization, while letting storage be more or less separate. More precisely, Oracle actually still sticks everything into one data store (hence the lack of native XML support), but allows near-infinite flexibility in how it is accessed. Microsoft has already had separate servers for tabular data, text, and MOLAP, although like Sybase, it doesn’t have general datatype extensibility that it can expose to customers, or exploit itself to provide a great variety of datatypes. IBM has had Oracle-like extensibility all along, although it hasn’t been quite as aggressive at exploiting it; now it’s introduced a separate-server option for XML. Both Microsoft and IBM claim that their administrative tools are slick enough that the DBA has little extra work from their offerings than would be present in a true single-server solution.
So how does the storage actually work? The basic idea is exactly what you’d think. Data is stored in name-value pairs, with pointers connecting parents to children. The secret sauce (and here I have less detail than I’d like) is the extra information that’s stored, either at the nodes directly, or in an overarching index. Obviously, there’s a tradeoff between update and retrieval speed. And equally obviously, I need to learn more of the particulars.
And on that somewhat lame note, let me point you at Part 2 of this post, which discusses whether and how this stuff will actually be used. (Preview: It will, big time – I think.)