December 14, 2005

Reasons to use native XML

From a DevX article on Microsoft’s SQL Server 2005

Depending on your situation, XML can also be the best choice for storing even highly structured data. Here are a few practical reasons to consider storing data in a field of type XML:

* Repeated shredding or publishing—On-demand transformations carry a performance penalty. If you have to shred or publish the same document over and over again, consider storing it natively as XML. You can always expose it to relational consumers with an XML view.
* Rapidly changing data structures—When modeled correctly, XML lives up to its name: It’s extensible. Developers can add new pieces of data—even new hierarchies—to a schema without compromising existing software. Extensibility is an extra advantage when prototyping, or when working with rapidly changing problem domains such as bioinformatics.
* Atomic data—Sometimes, you’ll have XML data that’s never consumed except as a whole. Think of this as logical atomicity—if you never access the parts individually, you might as well store it in one big chunk.
* Debugging—Especially for new releases, it can be a good idea to tuck away a copy of your XML imports. The data may be redundant, but keeping the original makes tracking down problems a whole lot easier.

Nothing there to disagree with too heavily, although I can think of some other reasons that might rank higher yet.

December 12, 2005

Two kinds of DBMS extensibility

Microsoft took slight exception to my claim that they lack fully general DBMS extensibility. The claim is actually correct, but perhaps it could lead to confusion. And anyhow there’s a distinction here worth drawing, namely:

There are two different kinds of DBMS extensibility.

The first one, which Microsoft has introduced in SQL Server 2005 (but which other vendors have had for many years) is UDTs (User-Defined Types), sometimes in other systems called user-defined functions. These are in essence datatypes that are calculated functions of existing datatypes. You could use a UDT, for example, to make the NULLs in SQL go away, if you hate them. Or you can calculate bond interest according to the industry-standard “360 day year.” Columns of these datatypes can be treated just like other columns — one can use them in joins, one can index on them, the optimizer can be aware of them, etc.

The second one, commonly known by the horrible name of abstract datatypes (ADTs), is found mainly in Oracle, DB2, and previously the Informix/Illustra products. Also, if my memory is accurate, Ingres has a very partial capability along those lines, and PostgresSQL is said to be implementing them too. ADTs offer a way to add totally new datatypes into a relational system, with their own data access methods (e.g., index structures). That’s how a DBMS can incorporate a full-text index, or a geospatial datatype. It can also be a way to more efficiently implement something that would also work as a UDT.

In theory, Oracle et al. expose the capability to users to create ADTs. In practice, you need to be a professional DBMS developer to write them, and they done either by the DBMS vendors themselves, or by specialist DBMS companies. E.g., much geospatial data today is stored in ESRI add-ons to Oracle; ESRI of course offered a speciality geospatial DBMS before ADTs were on the market.

Basically, implementing a general ADT capability is a form of modularity that lets new datatypes be added more easily than if you don’t have it. But it’s not a total requirement for new datatypes. E.g., I was wrong about Microsoft’s native XML implementation; XML is actually managed in the relational system. (More on that in a subsequent post.)

December 9, 2005

SAP’s version of DBMS2

I just spent a couple of days at SAP’s analyst meeting, and realized something I’d somewhat forgotten – much of the DBMS2 concept was inspired by SAP’s technical strategy. That’s not to say that SAP’s techies necessarily agree with me on every last point. But I do think it is interesting to review SAP’s version of DBMS2, to the extent I understand it.

1. SAP’s Enterprise Services Architecture (ESA) is meant to be, among other things, an abstraction layer over relational DBMS. The mantra is that they’re moving to a “message-based architecture” as opposed to a “database architecture.” These messages are in the context of a standards-based SOA, with a strong commitment to remaining open and standards-based, at least on the data and messaging levels. (The main limitation on openness that I’ve detected is that they don’t think much of standards such as BPEL in the business process definition area, which aren’t powerful enough for them.)

2. One big benefit they see to this strategy is that it reduces the need to have grand integrated databases. If one application manages data for an entity that is also important to another application, the two applications can exchange messages about the entity. Anyhow, many of their comments make it clear that, between partner company databases (a bit of a future) and legacy app databases (a very big factor in the present day), SAP is constantly aware of situations in which a single integrated database in infeasible.

3. SAP is still deeply suspicious of redundant transactional data. They feel that with redundant data you can’t have a really clean model – unless, of course, you code up really rigorous synchronization. However, if for some reason synchronization is preferred – e.g., for performance reasons — it can be hidden from users and most developers.

4. One area where SAP definitely favors redundancy and synchronization is data warehousing. Indeed, they have an ever more elaborate staging system to move data from operational to analytic systems.

5. In general, they are far from being relational purists. For example, Shai Agassi referred to doing things that you can’t do in a pure relational approach. And Peter Zencke reminded me that this attitude is nothing new. SAP has long had complex business objects, and even done some of its own memory management to make them performant, when they were structured in a manner that RDBMS weren’t well suited for. (I presume he was referring largely to BAPI.)

6. That said, they’re of course using relational data stores today for most things. One exception is text/content, which they prefer to store in their own text indexing/management system TREX. Another example is their historical support for MOLAP, although they seem to be edging as far away from that as they can without offending the MOLAP-loving part of their customer base.

Incidentally, the whole TREX strategy is subject to considerable doubt too. It’s not a state-of-the-art product, and they currently don’t plan to make it into one. In particular, they have a prejudice against semi-automated ontology creation, and that has clearly become a requirement for top-tier text technologies.

7. One thing that Peter said which confused me a bit is when we were talking about nonrelational data retrieval. The example he used was retrieving information on all of a specific sales reps’ customers, or perhaps on several sales reps’ customers. I got the feeling he was talking about the ability to text search on multiple columns and/or multiple tables/objects/whatever at once, but I can’t honestly claim that I connected all the dots.

And of course, the memory-centric ROLAP tool BI Accelerator — technology that’s based on TREX — is just another example of how SAP is willing to go beyond passively connecting to a single RDBMS. And while their sponsorship of MaxDB isn’t really an example of that, it is another example of how SAP’s strategy is not one to gladden the hearts of the top-tier DBMS vendors.

December 9, 2005

Relational DBMS versus text data

There seems to be tremendous confusion about “search,” “meaning,” “semantics,” the suitability of relational DBMS to manage text data, and similar subjects. Here are some observations that may help sort some of that out.

1. Relational database theorists like to talk about the “meaning” or “semantics” of data as being in the database (specifically its metadata, and more specifically its constraints). This is at best a very limited use of the words “meaning” or “semantics,” and has little to do with understanding the meaning of plain English (or other language) phrases, sentences, paragraphs, etc. that may be stored in the database. Hugh Darwen is right and his fellow relational theorists are confused.

2. The standard way to manage text is via a full-text index, designed like this: For hundreds of thousands of words, the index maintains a list of which documents the word appears in, and at what positions in the document it appears. This is a columnar, memory-centric approach, that doesn’t work well with the architecture of mainstream relational products. Oracle pulled off a decent single-server integration nonetheless, although performance concerns linger to this day. Others, like Sybase, which attempted a Verity integration, couldn’t make it work reasonably at all. Microsoft, which started from the Sybase architecture, didn’t even try, or if they tried it wasn’t for long; Microsoft’s text search strategy has been multi-server more or less from the getgo.

3. Notwithstanding point #2, Oracle, IBM, Microsoft, and others have SQL DBMS extended to handle text via the SQL3 (or SQL/MM ) standard. (Truth be told, I get the names and sequencing of the SQL standard versions mixed up.) From this standpoint, the full text of a document is in a single column, and one can write WHERE clauses on that column using a rich set of text search operators.

But while such SQL statements formally fit into the relational predicate logic model, the fit is pretty awkward. Text search functions aren’t two-valued binary yes/no types of things; rather, they give scores, e.g. with 101 possible values (the integers from 0 – 100). Compounding them into a two-valued function typically throws away information, especially since that compounding isn’t well understood (which is why it’s so hard to usefully federate text searches across different corpuses).

4. Something even trickier is going on. Text search can be carried out against many different kinds of things. One increasingly useful target is the tables of a relational database. Where a standard SQL query might have trouble finding all the references in a whole database to a particular customer organization or product line or whatever, a text search can do a better job. This kind of use is becoming increasingly frequent. And while it works OK against relational products, it doesn’t fit into the formal relational model at all (at least not without a tremendous amount of contortion).

5. Relational DBMS typically manage the data they index. Text search systems often don’t. But that difference is almost a small one compared with some of the others mentioned above, especially since it’s a checkmark item for leading RDBMS to have some sort of formal federation capability.

December 9, 2005

More flame war stupidity

Robert Seiner (publisher of TDAN) and Fabian Pascal are now claiming that Computerworld approached Bob and asked him to do something about the false charge that I personally engaged in censorship. To the best of my knowledge, they’re both lying. It was just me, and me alone, who approached Bob, which is exactly what one would think, if for some odd reason one cared about the matter at all. I don’t have the faintest idea why they fabricated this story, or what they think it demonstrates — but they did.

Seiner also picked a title for an article of mine he published, then published one by Fabian attacking me for the title. Classy.

Bob also made two promises in the matter which he didn’t keep. Nor did he have the courtesy to inform me that he’d changed his mind, nor did he really address it when I called him on it.

I wondered why Seiner kept on publishing Pascal’s stuff, even for free, when most of Fabian’s other publishers have dropped him. Now I have a better idea. They’re soulmates.

A pity. Partway through our discussions, Bob sounded eminently reasonable. That’s why I jumped at his suggestion I write an article for him. Oh well; live and learn.

And for the record — no, I won’t respond to Pascal’s critiques point by point. He typically attacks straw men, rather than restricting his barbs to my actual opinions. In those areas where we do actually disagree, I haven’t hesitated to publish follow-on arguments, repeatedly and at length, here and elsewhere. I’ve given that relative nonentity much more attention than he deserves.

Also for the record — even though I don’t respond to every nasty shot Pascal and his associates take at me, I’m of course not conceding that his other libels and opinions are actually correct. I just think that by and large he’s a waste of bandwidth, because even his coherent ideas are quickly sidetracked by highly illogical fulminations. Even in articles where he’s otherwise making enough sense to respond to, he usually goes off on some extremist semantics-related kick that doesn’t mesh well with his own imperfect command of the English language.

(I really want to respond to his film contracts example from a three-year-old anti-XML diatribe. But the article gets bogged down with various “definitions” that are not easily reconciled to normal usage of the words, and it’s too much trouble to sort through them all. Maybe I’ll respond to the idea without linking to the article itself, when I get around to it.)

Exception to the above slam at Pascal — he recently posted a good interchange he had with Hugh Darwen, which I’m referencing in another post in this blog. His side was wrong, but both sides were well-presented.

December 2, 2005

Some Moore’s Law data points

I’m not a hardware guy, but here are some data points around the subject of Moore’s Law, quasi-Moore laws, and their bearing on random access times to disk and RAM. This line of inquiry is central to my argument favoring memory-centric data management.

Human-Computer Interaction cites 10 nanoseconds for RAM access time, 7 milliseconds for disk, and those figures are a couple of years old — that’s pretty supportive of my figure, namely the 1,000,000:1 ratio.

Don Burleson asserts that RAM speed has been 50 nanoseconds for decades. Hmm. I’m not sure what that means, since I’d think that RAM access speeds are bounded by clock speed. Of course, 20 megahertz is 50 nanoseconds per cycle, so some multiple of 20 megahertz would suffice to allow true 50 nanosecond access.

EDIT: I looked again, and he says that the 50 nanosecond limit is based on “speed of light and Proximity to CPU.” I’m even more confused than before, since light travels about 30 centimeters per nanocecond (at least in a vacuum).

A summary of a February, 2000 Jim Gray article makes some interesting points and claims in the same general subject area. One that stands out:

Storage capacity improves 100x / decade, while storage device throughput increases 10x / decade. At the same time the ratio between disk capacity and disk accesses/second is increasing more than 10x / decade. Consequently, disk accesses become more precious and disk data becomes colder with time, at a rate of 10x / decade.

Also, the disk capacity analog to Moore’s Law is sometimes named after Mark Kryder, and the network capacity version is attributed to George Gilder.

And finally — the fastest disks now made seem to spin at 15000 RPM. Those would take 2 milliseconds to spin halfway around, for the most naive estimate of their average random access time. And the naive estimate seems not to be too bad — depending on the exact model, they’re actually advertised with 3.3-3.9 millisecond seek times.

November 21, 2005

Is Oracle losing its edge?

Over in the Monash Report, I posed the question: Is Oracle losing its edge in DBMS? Here are some of the data points that make me suspect it has. A number of these points also apply to the other large mainstream DBMS vendors; a number, however, do not.

That’s a lot of evidence, even without mentioning threats from the open sourcers and the data warehouse appliance guys. So why am I not wholly convinced yet? Well, reasons include a variety of scalability features, extensibility features that are rivaled only by IBM’s, market share dominance on Linux, and Andy Mendelsohn. That’s a pretty compelling list too. Still, the Oracle colossus is teetering a little bit, and it’s not beyond imagination that some future earthquake could bring it crashing down.

November 18, 2005

Two purely theoretical problems with TransRelational(TM)

There’s a vigorous discussion of TransRelational over on Alf Pedersen’s blog (Edit: Link died), although it’s completely polluted by some usual-suspects flame war BS.

Alf did poke through the dreck, however, to make a reasonable challenge, which can be paraphrased as:

OK.  Suppose you’re right that no implementation has ever provided evidence of TransRelational’s usefulness for building a True Relational DBMS. It’s still theoretically fascinating.

My response was as follows:

Here are two big problems with TransRelational that are perfectly theoretical.

First, it assumes that values can be concisely stated, presumably as numbers or character strings. That isn’t a good match to complex datatypes such as, say, documents that should be full-text indexed.

Second, it assumes that there’s a natural sort order. That could be a bit of a problem even for, say, geospatial. One would think there’s a workaround in the geospatial case, e.g. like Oracle’s old hhencode. But hhencode was a fiasco, I think because it didn’t actually measure proximity very effectively.

Admittedly, both of my objections also apply to good old b-trees. Still, they speak against the potential of a TransRelational implementation to achieve the kind of generality I think modern applications do and will increasingly demand.

Basically, I think a “True Relational” DBMS that was only useful for columns with natural sort orders wouldn’t be particularly interesting. And “The Third Manifesto” notwithstanding, that’s the only kind anybody seems to have even hinted at trying to bring to market.

November 17, 2005

Native XML Storage, Part 2 (apps)

The introduction and technical-implementation part of this discussion was in Part 1.

It seems likely that widespread adoption of native XML storage is, at best, several years off, if for no other reason than that the DML (Data Manipulation Language) situation is still rather primitive. But looking beyond that nontrivial problem, it does seem as if there are broad classes of application that might go better in native XML. Here’s a survey.

First of all, there’s what might be called custom document composition – technical publishing, customized technical manuals, etc. If you make complex products, or sell information, this is obviously an important specialty application for you. Otherwise, it probably is rather peripheral, at least for now. If you do have an interest in this area, by the way, you shouldn’t only look at the big guys’ XML offerings; you should even talk to specialists like Mark Logic. (Mark Logic sells an XML-only DBMS with a strong text-search orientation.)

Second, there are complex documents with low update rates. Medical records are a prime example – and, by the way, may of those are stored in InterSystems’ OODBMS Cache rather than in a relational system. Other examples might include insurance claims, media assets, etc. – basically, the areas that have been thought of as the purview of document management systems. In many cases, these apps ain’t broke and shouldn’t be fixed, such as when they exist mainly to satisfy slow-changing regulatory requirements. Besides, it’s not obvious that native XML is particularly useful for these apps anyway. Often, the information is in a DBMS for three main reasons: General manageability (e.g., backup), ad-hoc searchability, and management of metadata. If the metadata is simple enough to fit comfortably into a tabular structure, extended-relational DBMS may be satisfactory as underpinnings for these apps indefinitely.

Third, and here’s where it really begins to get interesting, is complex transactional documents. One of the flagship apps in Viper’s alpha test was financial derivatives trading, with complex, number-laden, term-laden contracts being processed very quickly, and it’s easy to envision that kind of functionality spreading across the trading sector. Governments – wisely or not – may want to require new complex forms to be filled out, or to make older ones easier to process. (E.g., tax returns, or applications for various kinds of permits.) If privacy concerns allow, medical information might be collected and processed centrally by governments or large insurance providers. Complex service-level agreements could be negotiated for a broad variety of product and service categories. Customers might demand radically faster processing of insurance claims than has historically been necessary. Indeed, it’s hard to think of an industry sector where complex transactional documents might not gain a foothold. And if you’re looking for high performance access to portions of documents, native XML may well be the best storage choice.

Finally, there’s a fourth category, which I’ll give the trendy-looking name Profiles 2.0, in imitation of Web 2.0, Identity 2.0, and so on. Here’s what I mean by it. A number of the hottest buzzconcepts in computing focus on collecting, organizing, and using information about individual people – presence, identity, personalization/customization, narrowcasting/market-of-one, data mining/predictive analytics, weblog analysis, social software, and so on. Put all those together, and you have a humongous hairball of a user profile that no current systems come close to handling properly.

Let’s think about some characteristics of this data. Some of it is transient. Some of it is unreliable. Some of it indeed is guesswork – albeit educated guesswork – rather than fact (e.g., the results of data mining analyses). Much of it exists for some profilees but not others. Much of it is naturally tree- or graph-shaped (e.g., information about website traversal, product category interests, relationship networks, role-based authorizations, etc.) There are many kinds of it; pulling it all together relationally can lead to Joins From Hell.

And this isn’t just for individuals; similar kinds of stories can be told for information about organizations, battleships, and so on. Those are objects with rich internal structures. True, those can usually be modeled hierarchically – but at each node, some of the complications mentioned in the prior paragraph occur. Profiling an enterprise is even messier than profiling a single individual who shops or works there.

Applications using this kind of information are typically extremely primitive, even though the beginnings of the personalization hype are now 7-8 years in the past. I don’t think we’re going to get these systems kind right until we take a true, holistic view of individuals and their profiles – and until we learn how to think about apps whose fundamental objects keep changing in shape. But as hard as the problem is, it has to be worked on immediately, because what I’m talking about here are some of the major classes of competitive-advantage app.

So Profiles 2.0 isn’t something we can just ignore. And when we do pay attention to it, I don’t think we’ll find that it looks very natural dressed in rows and columns.

November 17, 2005

Native XML storage, Part 1 (technology)

IBM’s “Viper” version of DB2 is in open beta test, whatever that means, and Microsoft’s SQL Server 2005, nee Yukon, is in general release. Both have native XML capabilities surpassing Oracle’s – which is interesting in its own right, because it’s rare for either of those vendors to pull ahead of Oracle in an OLTP feature, and almost unprecedented for both to do so at once.

So let’s talk about native XML support, what it is, and who might or should care about it. (Well, the apps part is actually in a separate Part 2 post.) Most of this is based on research that’s several months old, but except for a scarcity of actual user interviews, that shouldn’t matter much.

There are two main non-native ways to put XML into a SQL database such as Oracle – shredding and LOBs (BLOBs or CLOBs – i.e., Binary or Character Large OBjects). Both can perform poorly, for different reasons. Shredding takes XML documents and distributes them among a bunch of tables. So one update in XML can become many updates when shredded, and one lookup in XML can become a complex join from shredded storage. LOB storage obviates those problems, but creates another – even when you’re only looking for part of a document, you have to retrieve and handle the whole thing, and the same goes for updates.

So native storage can be a good thing when you can afford neither the performance hit of shredding, nor of LOB storage, nor of any available hybrid. It also could be good if getting good performance from non-native storage, while possible, would create undue burdens on application development, or if there’s some other reason one or both of the shredding and LOB approaches isn’t viable.

One nice feature is that native-XML storage has almost no downside, at least if you get it from the high-end DBMS vendors. IBM, Oracle, and Microsoft have all worked out ways to have integrated query parsing and query optimization, while letting storage be more or less separate. More precisely, Oracle actually still sticks everything into one data store (hence the lack of native XML support), but allows near-infinite flexibility in how it is accessed. Microsoft has already had separate servers for tabular data, text, and MOLAP, although like Sybase, it doesn’t have general datatype extensibility that it can expose to customers, or exploit itself to provide a great variety of datatypes. IBM has had Oracle-like extensibility all along, although it hasn’t been quite as aggressive at exploiting it; now it’s introduced a separate-server option for XML. Both Microsoft and IBM claim that their administrative tools are slick enough that the DBA has little extra work from their offerings than would be present in a true single-server solution.

So how does the storage actually work? The basic idea is exactly what you’d think. Data is stored in name-value pairs, with pointers connecting parents to children. The secret sauce (and here I have less detail than I’d like) is the extra information that’s stored, either at the nodes directly, or in an overarching index. Obviously, there’s a tradeoff between update and retrieval speed. And equally obviously, I need to learn more of the particulars.

And on that somewhat lame note, let me point you at Part 2 of this post, which discusses whether and how this stuff will actually be used. (Preview: It will, big time – I think.)

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.