Analysis of Mark Logic and its Marklogic Server search-friendly XML DBMS product. Related subjects include:
This is Part 1 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
There’s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:
- The terminology around text data is inaccurate.
- Data volume estimates for text are misleading.
- Multiple different technologies are in the mix, including:
- Enterprise text search.
- Text analytics — text mining, sentiment analysis, etc.
- Document stores — e.g. document-oriented NoSQL, or MarkLogic.
- Log management and parsing — e.g. Splunk.
- Text archiving — e.g., various specialty email archiving products I couldn’t even name.
- Public web search — Google et al.
- Text search vendors have disappointed, especially technically.
- Text analytics vendors have disappointed, especially financially.
- Other analytic technology vendors ignore what the text analytic vendors actually have accomplished, and reinvent inferior wheels rather than OEM the state of the art.
Above all: The use cases for text data vary greatly, just as the use cases for simply-structured databases do.
There are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: Read more
|Categories: Analytic technologies, Archiving and information preservation, Google, Log analysis, MarkLogic, NoSQL, Oracle, Splunk, Text||2 Comments|
A reporter tweeted: “Is there a simple plain English definition for NoSQL?” After reminding him of my cynical yet accurate Third Law of Commercial Semantics, I gave it a serious try, and came up with the following. More precisely, I tweeted the bolded parts of what’s below; the rest is commentary added for this post.
NoSQL is most easily defined by what it excludes: SQL, joins, strong analytic alternatives to those, and some forms of database integrity. If you leave all four out, and you have a strong scale-out story, you’re in the NoSQL mainstream. Read more
|Categories: Cassandra, dbShards and CodeFutures, MarkLogic, MySQL, Object, Open source, Petabyte-scale data management, Schooner Information Technology||7 Comments|
The emphasis I’m putting on derived data is leading to a variety of questions, especially about how to tease apart several related concepts:
- Derived data.
- Many-step processes to produce derived data.
- Schema evolution.
- Temporary data constructs.
So let’s dive in. Read more
|Categories: Data models and architecture, Data warehousing, Derived data, MarkLogic, Text||Leave a Comment|
Six months ago, I argued the importance of derived analytic data, saying
… there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted data for analytic data processing. The five areas I have in mind are, loosely speaking:
- Aggregates, when they are maintained, generally for reasons of performance or response time.
- Calculated scores, commonly based on data mining/predictive analytics.
- Text analytics.
- The kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for.
- Adjusted data, especially in scientific contexts.
Probably there are yet more examples that I am at the moment overlooking.
Well, I did overlook at least one category.
A surprisingly important kind of derived data is metadata, especially for large, poly-structured data sets. For example, CERN has vastly quantities of experiment sensor data, stored as files; just the metadata alone fills over 10 terabytes in an Oracle database. MarkLogic is big on storing derived metadata, both on the publishing/media and intelligence sides of the business.
We hear much these days about unstructured or semi-structured (as opposed to) structured data. Those are misnomers, however, for at least two reasons. First, it’s not really the data that people think is un-, semi-, or fully structured; it’s databases.* Relational databases are highly structured, but the data within them is unstructured — just lists of numbers or character strings, whose only significance derives from the structure that the database imposes.
*Here I’m using the term “database” literally, rather than as a concise synonym for “database management system”. But see below.
Second, a more accurate distinction is not whether a database has one structure or none – it’s whether a database has one structure or many. The easiest way to see this is for databases that have clearly-defined schemas. A relational database has one schema (even if it is just the union of various unrelated sub-schemas); an XML database, however, can have as many schemas as it contains documents.
One small terminological problem is easily handled, namely that people don’t talk about true databases very often, at least when they’re discussing generalities; rather, they talk about data and DBMS.* So let’s talk of DBMS being “structured” singly or multiply or whatever, just as the databases they’re designed to manage are.
*And they refer to the DBMS as “databases,” because they don’t have much other use for the word.
All that said — I think that single vs. multiple database structures isn’t a bright-line binary distinction; rather, it’s a spectrum. For example: Read more
|Categories: Cassandra, Couchbase, Data models and architecture, HBase, IBM and DB2, MarkLogic, MongoDB, NoSQL, Splunk, Theory and architecture||21 Comments|
My clients at MarkLogic have a new CEO, Ken Bado, even though former CEO Dave Kellogg was quite successful. If you cut through all the happy talk and side issues, the reason for the change is surely that the board wants to see MarkLogic grow faster, and specifically to move beyond its traditional niches of publishing (especially technical publishing) and national intelligence.
So what other markets could MarkLogic pursue? Before Ken even started work, I sent over some thoughts. They included (but were not limited to): Read more
Edit: This disclosure has been superseded by a March, 2012 version.
From time to time, I disclose our vendor client lists. Another iteration is below. To be clear:
- This is a list of Monash Advantage members.
- All our vendor clients are Monash Advantage members, unless …
- … we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
- We do not usually disclose our user clients.
- We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
- Included in the list below are two expired Monash Advantage members who haven’t said they will renew, as mentioned in my recent post on analyst bias. (You can probably imagine a couple of reasons for that obfuscation.)
With that said, our vendor client disclosures at this time are:
- Aster Data
- SAND Technology
- Schooner Information Technology
When people talk about document-oriented NoSQL or some similar term, they usually mean something like:
Or, if they really mean,
The essence of whatever it is that CouchDB and MongoDB have in common.
well, that’s pretty much the same thing as what I said in the first place.
Of the various questions that might arise, three of the more definitional ones are:
- Why JSON rather than XML?
- What’s with this fluidity between the terms “document” and “object”?
- Are you serious about the lack of joins?
Let me take a crack at each. Read more
When I talked with MarkLogic’s Ken Chestnut about MarkLogic 4.2, I was surprised to learn that MarkLogic really, truly doesn’t do anything like a join. Unlike some other non-SQL DBMS, MarkLogic has no SQL interface, no ODBC or JDBC. Nothing, nada. (MarkLogic has a Java interface for Xquery, but not for anything like SQL.)
|Categories: CouchDB, MarkLogic, NoSQL, Structured documents, Text, Theory and architecture||7 Comments|
This post has been long in the writing for several reasons, the biggest being that I stopped working for almost a month due to family issues. Please forgive its particularly choppy writing style; having waited this long already, I now lack the patience to further clean it up.
- Is an ACID-compliant, document-oriented, non-SQL, XML-based scale-out DBMS vendor of non-trivial size and momentum.
- Still has the same technical approach I previously described.
- Recently posted an internally-written white paper with a lot of technical detail.
- Recently had a point release — MarkLogic 4.2 — a lot of which seems to be “Oh, you didn’t have that before?” kinds of stuff.
- Has given me permission to post most of the slides from same, the first few of which give a nice overview of the MarkLogic story.
- Claims 200+ each of customers and employees (that’s from a slide MarkLogic did ask me to remove from the deck).
- Is a client again.
- Not coincidentally, is interested in branching out past the vertical markets of media and government/intelligence, in particular to the financial services market.
- Has finally rationalized its company and product names so that both are now “MarkLogic.”
- Has finally grasped that if it is proud of its ACID-compliance it probably shouldn’t be trying to market itself as “NoSQL”.