While I previously posted in great detail about how MarkLogic Server is an ACID-compliant XML-oriented DBMS with integrated text search that indexes everything in real time and executes range queries fairly quickly, I didn’t have a good feel for how all those apparently contradictory characteristics fit into a single product. But I finally had a call with Mark Logic Director of Engineering Ron Avnur, and think I have a better grasp of the MarkLogic architecture and story.
Ron described MarkLogic Server as a DBMS for trees. That is, MarkLogic is designed for an XML data model that’s all about nodes and relationships, but not necessarily for XML data per se. The fundamental paradigm is thus searches for nodes and/or trees, for example:
- Find all the trees or nodes anywhere in the system that have certain text in them.
- Find all the nodes anywhere in the system that have certain text in their children or immediate descendants.
- Find all trees that have a pair of nodes, of a certain kind, in a certain relationship to each other.
Also important are aggregates and perhaps also — although Mark Logic rarely mentions them unless prompted — range queries.
So let’s start with some basics about Mark Logic Server’s indexing and storage strategy:
- As in a conventional text indexing system, most MarkLogic indexes are just term lists. A term is a token, which usually is just a word.
- MarkLogic can actually have two term lists for each token – one for the XML tags (i.e., attribute values), and one for the plain text content. (Most tokens, of course, never actually occur in a tag.)
- MarkLogic also indexes node names – i.e., attribute name, or type of node – and node relationships.
- “Document” has its usual meaning in MarkLogic Server. (This is worth saying because XQuery books sometimes assume that everything in an XML database is concatenated into a single cosmic document.)
- Documents can be broken up into fragments, with metadata as to how they fit together. The smallest unit MarkLogic retrieves is such a fragment (or of course an entire document).
- Largely because of how tag information is tokenized and stored, MarkLogic doesn’t store documents as serialized XML. Mark Logic says this eliminates a great deal of XML’s traditional bloat.
- If I understood correctly, Mark Logic said that a single document – perhaps one that’s 100K long as serialized XML – could be referenced by 1000s or 10s of 1000s of term lists. Frankly, that looks a little high to me, so it’s possible Mark Logic was talking about using that number of term lists for a medium- or large-sized collection. (Mark Logic says that MarkLogic has been used for databases up to the 100s of terabytes, and for civilian databases in the 10s of terabytes range.)
- MarkLogic’s term list indexes, I’m pretty sure, store not only where a term occurs but also what position in a document (section) it appears in.
- MarkLogic indexes also record the position of each node in a tree.
- Term lists are all ordered “the same way.” Therefore, Mark Logic claims, search time scales linearly with the number of documents.
In addition to those indexes, which comprise what Mark Logic calls the “Universal Index,” there are scalar indexes. These are columnar, where a column can cover either a single element name (i.e., attribute name) or a set of (presumably related) element names. Two copies of each column are kept, one sorted by TreeID and one by value. Mark Logic believes these suffice to give good performance on aggregations and range lookups.
With all those columns and term lists, the question naturally arises: What about the MarkLogic update strategy? Highlights include:
- MarkLogic uses MVCC (MultiVersion Concurrency Control). Updates are never done in place; rather, a new version of a record is appended.
- Documents are divided into subsets, each with its own set of indexes. Each subset that is being actively updated is small. This helps with update performance. It’s actually more important for the scalar columns than for the term lists. You update a term list by adding values to the end, so length hardly matters. But in a column index you insert new values into the middle, so length is indeed relevant.
- When document subsets get to be a certain size, they’re merged into yet larger subsets, which no longer have new documents added to them. This is when MVCC cruft is cleaned up as well. However, you can specify a minimum time that old versions are guaranteed to survive.
- This architecture gives an obvious form of partitioning, letting you parallelize in an obvious way. This partitioning is random by default, although there’s an API that lets you programmatically make it nonrandom.
Why not some kind of intelligent horizontal partitioning, whether range-based or otherwise? Well, we’ve finally gotten to a MarkLogic weakness. Join performance is not a MarkLogic long suit or Mark Logic priority. Indeed, Mark Logic insists that XML data is inherently denormalized, with joins (complex or otherwise) therefore rarely arising.
For many of today’s use cases, that’s probably true. For example, when I heard this I quickly started thinking “What if a book publisher changes name? That information is in a lot of individual book records.” But the fact is, people want author/publisher information that’s accurate as of the time of release, not updated for subsequent publishing company mergers and the like.
For other use cases, however, joins may be more important. For example:
- Derivatives contracts need to survive changes in brokerage firm corporate control. (IBM reports that derivatives are an important XML use case today.)
- Medical records need to be connected with a patient’s most current contact information. (IBM reports that XML is making inroads into health care applications.)
- Consumer profiling done right would benefit greatly from XML’s schema flexibility, but would also require significant use of joins.