Text data management, Part 2: General and short-request
This is Part 2 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
I’ve recently given widely varied advice about managing text (and similar files — images and so on), ranging from
Sure, just keep going with your old strategy of keeping .PDFs in the file system and pointing to them from the relational database. That’s an easy performance optimization vs. having the RDBMS manage them as BLOBs.
to
I suspect MongoDB isn’t heavyweight enough for your document management needs, let alone just dumping everything into Hadoop. Why don’t you take a look at MarkLogic?
Here are some reasons why.
There are three basic kinds of text management use case:
- Text as payload.
- Text as search parameter.
- Text as analytic input.
The simplest way to manage text electronically is to:
- Store it as whole documents — scanned images, .PDFs, word processing files, whatever.
- Find it only via fielded metadata, perhaps manually created — title, author, date, and so on.
For example, an application for college admission is accompanied by recommendation letters, transcripts, and so on; those are then moved around as dumb payloads, until such time as an admissions officer reads them. Most relational database management systems can manage BLOBs (Binary Large OBjects), but performance may be better if you leave the big objects outside the relational system. For text-as-payload, that way of managing documents can often suffice.
In other cases, the text may be so short that it naturally fits into a character field in a relational database. This is particularly likely when the text is typed in at the time of record creation, e.g. by a call center operator, a doctor, or a customer entering a support ticket. In such cases, leaving it under the management of an RDBMS makes perfect sense.
In many situations you actually want to search based on the context of the text. Unless you’re doing simple search on short text snippets in relational character fields, that generally calls for some kind of text index. Text indexes are generally found in text search engines. That said, however:
- Anything that can be done in a standalone text search engine can in principle also be integrated into relational DBMS and other data stores. (How well that works in practice is another matter.)
- Text search engines commonly index data in situ that they don’t actually manage.
In theory, then:
- You can store your text in your chosen DBMS or outside it, as pleases you.
- Your chosen DBMS can have a text search capability.
- This text search capability can be integrated with the query method used to get at the rest of the data managed by that DBMS.
That theory is commonly reflected in actual products, such as Oracle and DB2.
As so often, then, the choice of how to manage text comes down to issues such as performance, programming ease, and other components of total cost of ownership (or of some other general “goodness” metric, such as time-to-value). As a general rule, it seems:
- Text indexing inside relational DBMS has poorer performance than in, say, text search engines, often drastically so.
- BLOB management inside relational databases has poorer performance than leaving the files outside the DBMS’ purview.
- Relational DBMS do just fine at managing text strings up to, say, 2048 characters long.
- Tight integration between text search and SQL is valuable in a few applications, but irrelevant to many others.
And so, to a first approximation:
- If you just have short text snippets, it can make sense to leave them in your relational database.
- If performance is not an issue, you can just leave your BLOBs in your relational database too.
- If performance is an issue, you probably want to have your larger text files outside your RDBMS’ control.
However, that phrasing assumes the default option is a relational DBMS, which may not be the case at all. Other choices include:
- Standalone text search engines. If you want the best available text search, get a search engine. But attempts to start with a search engine and wind up with an application platform have generally run into difficulty.
- Document-oriented (or other) NoSQL systems. The story here is surprisingly like that for relational DBMS. I’ve previously noted that document-oriented NoSQL systems manage objects, not “documents” in the ordinary sense of the word. Even so, conceptually they’re no less suited for management of true documents than relational DBMS are. I’d guess that the correlation between use cases involving true documents and use cases where document-oriented NoSQL is suitable is positive, but not very strong.
- MarkLogic. From one standpoint, MarkLogic is just a heavier-weight version of document-oriented NoSQL. But MarkLogic’s XML (and XQuery) orientation, tuned-for-years indexing, and built-in search engine put it on a different level for document management than the upstarts have reached.
I’ll cover the Hadoop option in the next, more analytically-focused post.
I hope I’ve demonstrated that there are appropriate use cases for each of:
- Letting documents be managed by the file system (and pointing to them from your preferred DBMS).
- Sticking documents straight into your preferred DBMS (SQL or non-SQL as the case may be).
- Using a specialty system such as MarkLogic (or of course, in some cases, an enterprise search engine).
And that’s even before we move on to analytically-oriented text data management.
Comments
5 Responses to “Text data management, Part 2: General and short-request”
Leave a Reply
[…] Choices for text data management (general and short-request). […]
Who, may I ask, is still debating whether it makes more sense to stuff large files into a database or keep them separate?
I figured this was decided a long long time ago after most of us (the engineers designing the info mgmt systems for the last 20+ yrs) chose to keep large files separate to preserve performance and stuff the rest–especially anything captured at the point of creation such as the content of e-forms–inside.
I’ve yet to find a better way to manage the variety of content types in use.
I hope that was a rhetorical question. 🙂
Curt: Are you aware of Microsoft’s FILESTREAM technology, new to SQLServer 2008?
Basically it allows you to read and manipulate BLOBs as files, but with the full text indexing and transaction management provided by the RDBMS. So you get the performance of filesystem with the ACIDity of an RDBMS, and the searchability of a full text search engine. This technology is a big part of SharePoint 2010.
I’d be surprised if MongoDB can provide any better performance than FILESTREAM.
Neil,
Nope. Can’t say I know about FILESTREAM. From your description, I can’t tell how easily MongoDB use cases can be shoehorned into it.
Functionality/API aside, I don’t always give large DBMS vendors the benefit of the doubt on performance for their bolt-on technologies vs. competitors’ purpose-built versions.