On August 29, I had a great call with IBM about DB2 pureXML (most of the IBM side of the talking was done by Conor O’Mahony and Qi Jin). I’m finally getting around to writing it up now. (The world of tabular data warehousing has kept me just a wee bit busy …)
As I write it, I see there are a considerable number of holes, but that’s the way it seems to go when researching XML storage. I’m also writing up a September call from which I finally figured out (I think) the essence of how MarkLogic Server works – but only after five months of trying. It turns out that MarkLogic works rather differently from DB2 pureXML. Not coincidentally, IBM and Mark Logic focus on rather different use cases for native XML storage.
What I understand so far about the basic DB2 pureXML architecture goes like this:
- DB2 pureXML stores XML in “true hierarchical format.” Based on all the discussion of indexing, I’m guessing that the way it does this is somewhat similar to that in MarkLogic.
- Unlike MarkLogic, DB2 pureXML gives you the choice of what tags to index on.
- In a big difference from Marklogic, text search on DB2 pureXML involves two separate indexes – XML and text (the latter being of the usual inverted-list variety). You can text-search both contents and tags, with the usual CONTAINS semantics.
- PureXML has a data store separate from the rest of DB2’s, notwithstanding IBM’s references to XML “columns.” DB2’s general datatype extensibility framework is not used; I don’t wholly understand why.
- I neglected to ask how well DB2 backup, management tools, and so on extend to DB2 pureXML.
- You can talk to DB2 pureXML via two data manipulation languages: SQL/XML, and XQuery. Both are compiled down to the same run-time instructions. IBM said there’s an abstraction layer sitting over both the relational store and the XML store that allows for this I don’t totally understand what that means, since presumably the SQL/XML starts out by being sent to DB2’s parser.
A big part of IBM’s XML business strategy is to support various (typically vertical market) XML standards. IBM has implemented support for these standards and made it freely downloadable. What does “support” mean? It surely starts with a DTD (Document Type Definition), and apparently also includes mappings to generic web services interfaces. It turns out that there are a lot of them, so I’m listing some in a separate post.
More generally, it seems that the sales and uses for IBM pureXML are concentrated in two main (overlapping) cases:
- When XML was going to be used anyway. One big example of this is the case of the standards-based industry data interchanges. Another example is when pureXML, albeit disk-based, acts as a kind of quasi-cache or mini-MDM hub (Master Data Management) for WebSphere-based enterprise application integration (EAI). IBM reports that DB2 pureXML has been sold as an intermediate EAI data store at least once each in banking, retailing, health care, and insurance.
- When schema flexibility is of great importance.
Experience teaches me that schema flexibility is a subject that can attract considerable flames, in the general vein of “Omigod! The relational model is perfect because it’s mathematically proven to ensure referential integrity!!” So I’ll split out the main discussion of that into yet another separate post, and keep going.
IBM actually breaks out the pureXML use cases into four main groups:
- Transactional. This comprises the transactional logging of information that just happens to be XML, such as in financial services.
- Forms-oriented. This comprises, for example, the tax authority use case.
- Service bus acceleration. That’s a fancy phrase to cover both the standards-based interchanges and the other EAI-related uses.
- Event-driven data warehousing. This one was kind of blurry to me. What I think it means is that if you have transactional data in XML, and you want to use it in near-real-time business intelligence, DB2 pureXML can help you with that.
#1, 3, and 4 seem to fit into my “When XML was going to be used anyway” category. Part of “Schema flexibility” matches #2; I’m not clear on where in IBM’s four buckets the rest of schema flexibility goes.
Finally, I asked directly in what areas there were significant numbers of DB2 pureXML customers. IBM offered two examples. One was financial services in general — especially in North America, notwithstanding the importance of the UNIFI standard in Europe. The other was health care data interchange outside the United States — especially in China, where regional and national centers are being established to more closely oversee local hospitals.
- IBM kindly gave me permission to make available the slide presentation from our August 29 briefing. The last page has a large number of links to further IBM pureXML resources.
- Conor O’Mahony has a good blog.
- As noted above, I am putting up separate posts on standards-based data interchange and schema flexibility.