This post has been long in the writing for several reasons, the biggest being that I stopped working for almost a month due to family issues. Please forgive its particularly choppy writing style; having waited this long already, I now lack the patience to further clean it up.
- Is an ACID-compliant, document-oriented, non-SQL, XML-based scale-out DBMS vendor of non-trivial size and momentum.
- Still has the same technical approach I previously described.
- Recently posted an internally-written white paper with a lot of technical detail.
- Recently had a point release — MarkLogic 4.2 — a lot of which seems to be “Oh, you didn’t have that before?” kinds of stuff.
- Has given me permission to post most of the slides from same, the first few of which give a nice overview of the MarkLogic story.
- Claims 200+ each of customers and employees (that’s from a slide MarkLogic did ask me to remove from the deck).
- Is a client again.
- Not coincidentally, is interested in branching out past the vertical markets of media and government/intelligence, in particular to the financial services market.
- Has finally rationalized its company and product names so that both are now “MarkLogic.”
- Has finally grasped that if it is proud of its ACID-compliance it probably shouldn’t be trying to market itself as “NoSQL”.
I think it’s time to do a catch-up post about MarkLogic.
As a practical matter, most MarkLogic users fall into the overlapping areas of “content publishing” and “search.” However, in saying that I should note that from a search standpoint, MarkLogic is both less and more than a standard state-of-the-art text search engine. What I mean is:
- MarkLogic offers a simplistic search engine out-of-the-box.
- MarkLogic also offers a long list of sophisticated search features (see the white paper linked above), but you have to integrate those in yourself. Some of these are new in Release 4.2.
- MarkLogic also lets you integrate in search on structured fields in a way classic text search engines do not.
However, MarkLogic does not do relational JOINs.
A couple of specific notes on MarkLogic’s search capabilities:
- MarkLogic’s tokenization/stemming are OEMed from SAP subsidiary Inxight.
- As Verity did in the late 1990s, MarkLogic lets you run individual documents by long lists of queries, to drive alerting-oriented applications. I gather MarkLogic calls this “reverse query,” although overall that seems to be a somewhat ambiguous/overloaded phrase.
One exception to MarkLogic’s historical search/publishing focus is OEM OpenConnect, a then-client to whom I recommended MarkLogic, and who found MarkLogic to have great performance sorting through what amounts to graphs or trees of log events. Other exceptions may be found in the financial services market, where MarkLogic speaks of one customer that stores the complex information defining derivatives contracts as XML documents.
There are multiple aspects to the conjunction of MarkLogic and ETL/ELT/ELTL (Extract/Transform/Load/Transform):
- XML was invented for ETL. Other XML DBMS vendors focus more on ETL-oriented XML use than MarkLogic does. Perhaps in the future MarkLogic’s target markets will include more in that area.
- MarkLogic 4.2 adds two major pieces of ETL-oriented functionality:
- An ETL development GUI, thrillingly named “Information Studio.”
- Support for XSLT as a first-class language along with XQuery, which in particular is being used to support Information Studio.
- MarkLogic thinks it’s a great idea if you enhance data before or after putting it into a MarkLogic database.
- MarkLogic’s new replication functionality (more on that below) can be adapted to produce a transformed version rather than (or in addition to) a faithful copy. One use: A partial copy (specifically, metadata only) of high-security information might be suitable for people with lower clearances (who might be allowed to know all intelligence conclusions, but only to see the identities of the specific sources on a need-to-know basis).
MarkLogic scales out via a form of node heterogeneity, in that there are two types of nodes — evaluator and data. All MarkLogic evaluator nodes talk to all MarkLogic data nodes, and hence vice-versa. There is no third kind of “head” node, so I presume any evaluator node can act as a head for any particular query. The whole thing sounds fairly Exadata-like, if we ignore the fact that MarkLogic probably got there before Oracle Exadata shipped. Documents are distributed among data nodes via a hashing mechanism.
Note: The choice of hash key presumably doesn’t matter as much as it does in a relational DBMS, since MarkLogic has no concept of join, hash join, or having a join be accelerated by the fact that data is pre-hashed on the join key.
One focus area for MarkLogic 4.2 was revamping the MarkLogic availability story. In particular:
- MarkLogic has offered failover for a while. But in previous releases MarkLogic’s failover was based on an underlying clustered file system, whereas MarkLogic 4.2 lets you use cheaper local disk.
- Prior to MarkLogic 4.2, replication (e.g. for disaster recovery) relied on the storage system, unless you wanted to pony up for a professional services engagement. Now MarkLogic offers built-in document-level replication.