I’ve been interested in the Mark Logic story from the first time CEO Dave Kellogg told me about it. Basically, Mark Logic sells an XML-based DBMS optimized for text search, called MarkLogic Server. For obvious reasons, they don’t want to position it as a DBMS; hence they call it an “XML content server” instead. I posted about their marketing and application focus over on Text Technologies. In this post, I’ll dive a little deeper into the core technology.
In essence, the MarkLogic technology is search-engine-plus. So let’s review how a conventional search engine works. Its core index amounts logically to a sparse matrix, whose “rows” refer to documents, and whose columns refer to words (or phrases, or n-grams). The matrix cell entries aren’t single values; rather, they’re lists of positions for the word’s possibly multiple appearances in the document. Otherwise, however, the whole thing looks a lot like an inverted-list or bitmapped index from the conventional DBMS world.
The “plus” is where XML comes in. In principle but not in practice, I am told, the rows are not just documents, but any document subset (chapter, heading, paragraph, whatever) marked up in XML. I didn’t ask exactly what that meant, but I strongly guess that the index is on documents, and then when a document is hit its segments are drilled down into via a different but well-integrated data access method. I think there’s also something to do with XML in the columns, but I didn’t really grasp that part at all.
So how different are MarkLogic’s capabilities from those of conventional RDBMS or search engines? Well, MarkLogic does a lot less than the RDBMS do, obviously, so a better question would be: What capabilities does MarkLogic offer that conventional competitors don’t?
Full-spectrum DBMS, like those from Oracle, Microsoft, and IBM, offer some sort of integrated SQL, XML, and text-search-DML. Thus, it’s hard for Mark Logic to beat the big guys in search functionality, although the company plausibly claims that somehow things work out more slickly with their technology than with the generalists’. And that’s even before considering performance, although since Mark Logic doesn’t talk about performance much I imagine it’s not a particular strong suit of theirs.
Where MarkLogic is more differentiated, at least vis-à-vis Oracle, is in retrieval. Oracle can index and retrieve entire XML documents in CLOBs, no problemo. But suppose you only want to retrieve a single paragraph of a document. Uh, that can be quite problematic …
As for MarkLogic vs. search engines – well, search engines generally aren’t too smart about XML. On the other hand, they have lots of performance and fine-grained relevancy-enhancing features that MarkLogic may not match. At least, FAST does.
And by the way – I wish they’d decide for once and for all whether or not there’s a space between “Mark” and “Logic.”