October 10, 2011

Text data management, Part 2: General and short-request

This is Part 2 of a three post series. The posts cover:

  1. Confusion about text data management.
  2. Choices for text data management (general and short-request).
  3. Choices for text data management (analytic).

I’ve recently given widely varied advice about managing text (and similar files — images and so on), ranging from

Sure, just keep going with your old strategy of keeping .PDFs in the file system and pointing to them from the relational database. That’s an easy performance optimization vs. having the RDBMS manage them as BLOBs.

to

I suspect MongoDB isn’t heavyweight enough for your document management needs, let alone just dumping everything into Hadoop. Why don’t you take a look at MarkLogic?

Here are some reasons why.

There are three basic kinds of text management use case:

The simplest way to manage text electronically is to:

For example, an application for college admission is accompanied by recommendation letters, transcripts, and so on; those are then moved around as dumb payloads, until such time as an admissions officer reads them. Most relational database management systems can manage BLOBs (Binary Large OBjects), but performance may be better if you leave the big objects outside the relational system. For text-as-payload, that way of managing documents can often suffice.

In other cases, the text may be so short that it naturally fits into a character field in a relational database. This is particularly likely when the text is typed in at the time of record creation, e.g. by a call center operator, a doctor, or a customer entering a support ticket. In such cases, leaving it under the management of an RDBMS makes perfect sense.

In many situations you actually want to search based on the context of the text. Unless you’re doing simple search on short text snippets in relational character fields, that generally calls for some kind of text index. Text indexes are generally found in text search engines. That said, however:

In theory, then:

That theory is commonly reflected in actual products, such as Oracle and DB2.

As so often, then, the choice of how to manage text comes down to issues such as performance, programming ease, and other components of total cost of ownership (or of some other general “goodness” metric, such as time-to-value). As a general rule, it seems:

And so, to a first approximation:

However, that phrasing assumes the default option is a relational DBMS, which may not be the case at all. Other choices include:

I’ll cover the Hadoop option in the next, more analytically-focused post.

I hope I’ve demonstrated that there are appropriate use cases for each of:

And that’s even before we move on to analytically-oriented text data management.

Comments

5 Responses to “Text data management, Part 2: General and short-request”

  1. Text data management, Part 1: Confusion | DBMS 2 : DataBase Management System Services on October 10th, 2011 9:00 pm

    […] Choices for text data management (general and short-request). […]

  2. josephmartins on October 10th, 2011 9:53 pm

    Who, may I ask, is still debating whether it makes more sense to stuff large files into a database or keep them separate?

    I figured this was decided a long long time ago after most of us (the engineers designing the info mgmt systems for the last 20+ yrs) chose to keep large files separate to preserve performance and stuff the rest–especially anything captured at the point of creation such as the content of e-forms–inside.

    I’ve yet to find a better way to manage the variety of content types in use.

  3. Curt Monash on October 10th, 2011 10:20 pm

    I hope that was a rhetorical question. 🙂

  4. Neil Hepburn on October 18th, 2011 1:57 pm

    Curt: Are you aware of Microsoft’s FILESTREAM technology, new to SQLServer 2008?

    Basically it allows you to read and manipulate BLOBs as files, but with the full text indexing and transaction management provided by the RDBMS. So you get the performance of filesystem with the ACIDity of an RDBMS, and the searchability of a full text search engine. This technology is a big part of SharePoint 2010.

    I’d be surprised if MongoDB can provide any better performance than FILESTREAM.

  5. Curt Monash on October 19th, 2011 10:16 am

    Neil,

    Nope. Can’t say I know about FILESTREAM. From your description, I can’t tell how easily MongoDB use cases can be shoehorned into it.

    Functionality/API aside, I don’t always give large DBMS vendors the benefit of the doubt on performance for their bolt-on technologies vs. competitors’ purpose-built versions.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.