MarkLogic
Analysis of Mark Logic and its Marklogic Server search-friendly XML DBMS product. Related subjects include:
- Native XML database management
- Text data management
- (in Text Technologies) Mark Logic viewed from a text search perspective
Big data terminology and positioning
Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions:
- Bigness — Volume, Velocity, size
- Structure — Variety, Variability, Complexity
given that
- High-velocity “big data” problems are usually high-volume as well.*
- Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction.
But the conflation should stop there.
*Low-volume/high-velocity problems are commonly referred to as “event processing” and/or “streaming”.
When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2×2 matrix of possibilities. For want of better alternatives, my suggestions are:
- Relational big data is data of high volume that fits well into a relational DBMS.
- Multi-structured big data is data of high volume that doesn’t fit well into a relational DBMS. Alternative: Poly-structured big data.
- Conventional relational data is data of not-so-high volume that fits well into a relational DBMS. Alternatives: Ordinary/normal/smaller relational data.
- Smaller poly-structured data is data for which dynamic schema capabilities are important, but which doesn’t rise to “big data” volume.
MarkLogic’s Hadoop connector
It’s time to circle back to a subject I skipped when I otherwise wrote about MarkLogic 5: MarkLogic’s new Hadoop connector.
Most of what’s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you:
- Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API for streaming documents into or out of a MarkLogic database.
- Hadoop can make requests to MarkLogic in MarkLogic’s normal mode of operation, namely to address any node in the MarkLogic cluster, which then serves as a “head” node for the duration of that particular request. But alternatively, Hadoop can use a long-standing MarkLogic option to circumvent the whole DBMS cluster and only talk to one specific MarkLogic node.
Otherwise, the whole thing is just what you would think:
- Hadoop can read from and write to MarkLogic, in parallel at both ends.
- If Hadoop is just writing to MarkLogic, there’s a good chance the process is properly called “ETL.”
- If Hadoop is reading a lot from MarkLogic, there’s a good chance the process is properly called “batch analytics.”
MarkLogic said that it wrote this Hadoop connector itself.
| Categories: Clustering, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, MarkLogic, Parallelization, Workload management | 2 Comments |
MarkLogic 5, and why you might care
MarkLogic is releasing MarkLogic 5. Key elements of the announcement are:
- More-of-the-same in line with MarkLogic’s core positioning.
- A new bi-directional Hadoop connector.
- A free MarkLogic Express edition, limited in license terms more than in actual features, as per Slide 27 of the deck MarkLogic graciously supplied for me to post.
Also, MarkLogic is early with a feature that most serious DBMS vendors will soon have – support for tiered storage, with writes going first to solid-state storage, then being flushed to disk via a caching-style algorithm.* And as befits a sometime search-engine-substitute, MarkLogic has finally licensed a large set of document filters, from an Australian company called Isys. Apparently, the special virtue of the Isys filters is that they’re good at extracting not only text, but metadata as well.
*If there’s a caching algorithm that doesn’t contain a major element of LRU (Least Recently Used), I don’t recall ever hearing about it.
MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:
- MarkLogic is a serious, enterprise-class DBMS (see for example Slide 12 of the MarkLogic deck) …
- … which has been optimized from the getgo for poly-structured data.
- MarkLogic can and does scale out to handle large amounts of data.
- MarkLogic is a general-purpose DBMS, suitable for both short-request and analytic tasks.
- MarkLogic is particularly well suited for analyses with long chains of “progressive enhancement” (MarkLogic’s favorite term when talking about derived data).
- MarkLogic often plays the role of a content assembler and/or search engine, and the people who use MarkLogic in those ways are commonly doing things that can be described as research and analysis.
Based on that reality, MarkLogic talks a lot about Volume, Velocity, Variety, Big Data, unstructured data, semi-structured data, and big data analytics.
| Categories: Hadoop, MarkLogic, Market share and customer counts, Scientific research, Solid-state memory, Structured documents, Text | 1 Comment |
Text data management, Part 2: General and short-request
This is Part 2 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
I’ve recently given widely varied advice about managing text (and similar files — images and so on), ranging from
Sure, just keep going with your old strategy of keeping .PDFs in the file system and pointing to them from the relational database. That’s an easy performance optimization vs. having the RDBMS manage them as BLOBs.
to
I suspect MongoDB isn’t heavyweight enough for your document management needs, let alone just dumping everything into Hadoop. Why don’t you take a look at MarkLogic?
Here are some reasons why.
There are three basic kinds of text management use case:
- Text as payload.
- Text as search parameter.
- Text as analytic input.
| Categories: MarkLogic, NoSQL, Text | 5 Comments |
Text data management, Part 1: Confusion
This is Part 1 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
There’s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:
- The terminology around text data is inaccurate.
- Data volume estimates for text are misleading.
- Multiple different technologies are in the mix, including:
- Enterprise text search.
- Text analytics — text mining, sentiment analysis, etc.
- Document stores — e.g. document-oriented NoSQL, or MarkLogic.
- Log management and parsing — e.g. Splunk.
- Text archiving — e.g., various specialty email archiving products I couldn’t even name.
- Public web search — Google et al.
- Text search vendors have disappointed, especially technically.
- Text analytics vendors have disappointed, especially financially.
- Other analytic technology vendors ignore what the text analytic vendors actually have accomplished, and reinvent inferior wheels rather than OEM the state of the art.
Above all: The use cases for text data vary greatly, just as the use cases for simply-structured databases do.
There are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: Read more
| Categories: Analytic technologies, Archiving and information preservation, Google, Log analysis, MarkLogic, NoSQL, Oracle, Splunk, Text | 2 Comments |
Defining NoSQL
A reporter tweeted: ”Is there a simple plain English definition for NoSQL?” After reminding him of my cynical yet accurate Third Law of Commercial Semantics, I gave it a serious try, and came up with the following. More precisely, I tweeted the bolded parts of what’s below; the rest is commentary added for this post.
NoSQL is most easily defined by what it excludes: SQL, joins, strong analytic alternatives to those, and some forms of database integrity. If you leave all four out, and you have a strong scale-out story, you’re in the NoSQL mainstream. Read more
| Categories: Cassandra, MarkLogic, MySQL, Object, Open source, Petabyte-scale data management, Schooner Information Technology, dbShards and CodeFutures | 6 Comments |
Derived data, progressive enhancement, and schema evolution
The emphasis I’m putting on derived data is leading to a variety of questions, especially about how to tease apart several related concepts:
- Derived data.
- Many-step processes to produce derived data.
- Schema evolution.
- Temporary data constructs.
So let’s dive in. Read more
| Categories: Data models and architecture, Data warehousing, MarkLogic, Text | Leave a Comment |
Another category of derived data
Six months ago, I argued the importance of derived analytic data, saying
… there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted data for analytic data processing. The five areas I have in mind are, loosely speaking:
- Aggregates, when they are maintained, generally for reasons of performance or response time.
- Calculated scores, commonly based on data mining/predictive analytics.
- Text analytics.
- The kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for.
- Adjusted data, especially in scientific contexts.
Probably there are yet more examples that I am at the moment overlooking.
Well, I did overlook at least one category.
A surprisingly important kind of derived data is metadata, especially for large, poly-structured data sets. For example, CERN has vastly quantities of experiment sensor data, stored as files; just the metadata alone fills over 10 terabytes in an Oracle database. MarkLogic is big on storing derived metadata, both on the publishing/media and intelligence sides of the business.
| Categories: Data models and architecture, Hadoop, MarkLogic | 1 Comment |
What to do about “unstructured data”
We hear much these days about unstructured or semi-structured (as opposed to) structured data. Those are misnomers, however, for at least two reasons. First, it’s not really the data that people think is un-, semi-, or fully structured; it’s databases.* Relational databases are highly structured, but the data within them is unstructured — just lists of numbers or character strings, whose only significance derives from the structure that the database imposes.
*Here I’m using the term “database” literally, rather than as a concise synonym for “database management system”. But see below.
Second, a more accurate distinction is not whether a database has one structure or none – it’s whether a database has one structure or many. The easiest way to see this is for databases that have clearly-defined schemas. A relational database has one schema (even if it is just the union of various unrelated sub-schemas); an XML database, however, can have as many schemas as it contains documents.
One small terminological problem is easily handled, namely that people don’t talk about true databases very often, at least when they’re discussing generalities; rather, they talk about data and DBMS.* So let’s talk of DBMS being “structured” singly or multiply or whatever, just as the databases they’re designed to manage are.
*And they refer to the DBMS as “databases,” because they don’t have much other use for the word.
All that said — I think that single vs. multiple database structures isn’t a bright-line binary distinction; rather, it’s a spectrum. For example: Read more
| Categories: Cassandra, Couchbase, Data models and architecture, HBase, IBM and DB2, MarkLogic, MongoDB and 10gen, NoSQL, Splunk, Theory and architecture | 18 Comments |
Whither MarkLogic?
My clients at MarkLogic have a new CEO, Ken Bado, even though former CEO Dave Kellogg was quite successful. If you cut through all the happy talk and side issues, the reason for the change is surely that the board wants to see MarkLogic grow faster, and specifically to move beyond its traditional niches of publishing (especially technical publishing) and national intelligence.
So what other markets could MarkLogic pursue? Before Ken even started work, I sent over some thoughts. They included (but were not limited to): Read more
| Categories: MarkLogic, Object, RDF and graphs, Structured documents | 4 Comments |
