This is Part 3 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
I’ve gone on for two long posts about text data management already, but even so I’ve glossed over a major point:
Using text data commonly involves a long series of data enhancement steps.
Even before you do what we’d normally think of as “analysis”, text markup can include steps such as:
- Figure out where the words break.
- Figure out where the clauses and sentences break.
- Figure out where the paragraphs, sections, and chapters break.
- (Where necessary) map the words to similar ones — spelling correction, stemming, etc.
- Figure out which words are grammatically which parts of speech.
- Figure out which pronouns and so on refer to which other words. (Technical term: Anaphora resolution.)
- Figure out what was being said, one clause at a time.
- Figure out the emotion — or “sentiment” — associated with it.
Those processes can add up to dozens of steps. And maybe, six months down the road, you’ll think of more steps yet.
So when you manage text, it is convenient to assume dynamic schemas. That would be an argument for using MarkLogic, NoSQL document stores, and/or Hadoop, rather than strictly relational systems.
That said, text analytics can be done perfectly well in relational databases. Again, I point you to the example of Attensity, which will extract for you a large fraction of the information that can be gotten out of the text, put it into a convenient relational schema, and let you get to work. Once the principal extraction has been done, there’s no reason why your derived data issues need be any more complex than others you deal with relationally, especially on the analytic side of the house.
But what if you want to do your own text enhancement, rather than using a third party tool? The first thing to ask yourself is — why? With all due respect to the 10-20 internet-centric companies that are having fun reinventing large portions of the data processing wheel — if you’re not at one of those companies, you should probably be trying to use as much third-party software as you possibly can.
I can think of a couple of cases where rolling your own technology make sense, namely:
- The hard part of what you’re doing is extracting snippets of text from some data format proprietary to you.
- You’re trying to do very simple things across a variety of languages much broader than the 10-20 that the text analytics vendors currently do a halfway decent job of handling.
I can’t think of many others.
One thing I’d definitely be wary of is using Hadoop as a big bit bucket for individual documents in a variety of formats. I don’t know what you’d do with them once they’re there. Yes, Google invented MapReduce in part to do things like document indexing — but you’d probably prefer not to reinvent the Google stack. That’s quite apart from questions as to whether your document count exceeds Hadoop’s comfortable file-count limit. Solr is a different matter; but while Solr and Hadoop are both open source projects that can be traced back to Doug Cutting, otherwise they’re rather different things.
A useful way of looking at your choices may be to ask:
After text has run through the main pipeline of manipulation and information extraction:
- What will the output look like?
- Where do I want that output to end up?
If the output has to be something that fits into a structured/relational analytic system, then it should probably go into a relational DBMS. If you’re going to do social network analysis of the sort you’d ideally like to do in a graph database — well, unless you’re an intelligence agency with blank-check resources, you’ll probably still end up opting for a relational DBMS. If the output consists of simple, homogeneous text files, plus a few fields of metadata, and you’re not going to do much analysis of it, it can pretty much go anywhere; either SQL or NoSQL might suit your purposes. If you want maximum power and flexibility, MarkLogic may be the ideal destination.
From there, the next question is:
- What pipeline should the text run through to get to its final destination?
Often, as I’ve argued, the right answer is a third-party text analytic system. Those can generally consume text in almost any kind of file format. Other times — less often than you may think — it’s Hadoop. OK, then pass it through Hadoop. Other possibilities could come up as well (text search engines aren’t really as usually as I may have seemed to be suggesting).
Anyhow, when you’ve established where text starts out (that’s usually a given), what it passes through (please see above), and where its best parts need to end up (ditto), you’ve done the hardest parts. Figuring out the rest of your text management architecture should be relatively easy by comparison.