October 10, 2011
Text data management, Part 1: Confusion
This is Part 1 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
There’s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:
- The terminology around text data is inaccurate.
- Data volume estimates for text are misleading.
- Multiple different technologies are in the mix, including:
- Enterprise text search.
- Text analytics — text mining, sentiment analysis, etc.
- Document stores — e.g. document-oriented NoSQL, or MarkLogic.
- Log management and parsing — e.g. Splunk.
- Text archiving — e.g., various specialty email archiving products I couldn’t even name.
- Public web search — Google et al.
- Text search vendors have disappointed, especially technically.
- Text analytics vendors have disappointed, especially financially.
- Other analytic technology vendors ignore what the text analytic vendors actually have accomplished, and reinvent inferior wheels rather than OEM the state of the art.
Above all: The use cases for text data vary greatly, just as the use cases for simply-structured databases do.
There are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include:
- The terms “unstructured” or “semi-structured” data are inherently misleading. That’s why I favor “multi-structured” or “poly-structured” instead. (“Multi-structured” seems to be winning; e.g., it’s been adopted by Teradata and Teradata/Aster.)
- The “social media” text data any one enterprise brings in house isn’t all that much. For example, Attensity serves many different enterprises’ social media needs from a single 20-terabyte data store, and reports that no single enterprise has required as much as 1 terabyte of text yet. Text data may consume a lot of storage on spinning disks somewhere, but it’s not that big a factor in future DBMS industry growth. (That 20 terabyte figure does seem low.)
- Structured databases are typically worth a lot more per bit than other kinds. The most valuable electronic data, per-bit, is probably records of significant economic transactions — purchases, sales, money transfers, etc. The least valuable may be sensor log files, whose contents consist mainly of “Nothing going on here; ping you again in a minute.” Email logs, web interaction data and many other kinds fall somewhere in between. Highly valuable documents — such as signed contracts — generally persist in paper as well as electronic forms. Investors commonly overlook this point.
- The enterprise text search industry is screwed up.
- FAST was a goofy company before it was acquired for far too much money by Microsoft.
- Autonomy was a goofy company before it was acquired for far too much money by HP.
- Google’s enterprise efforts are quiet.
- The integration of text search and relational DBMS — e.g. at Oracle — has languished, with poor performance and evident lack of management attention.
- Smaller text search vendors don’t seem to be getting a lot of traction — e.g., Coveo has a decent reputation, but when’s the last time you heard much about them? What has Attivio actually accomplished?
- Text analytics is a small business. Add up the revenue for Attensity, Clarabridge, Lexalytics, Temis, and all the others, and you might poke above $100 million, especially now that Attensity had a three-way merger. Then again, you might not.
- Even so, the text analytics vendors have developed sophisticated technology. In particular, you can use it to get a pretty good idea as to what people are writing about you, individually or as groups.
Categories: Analytic technologies, Archiving and information preservation, Google, Log analysis, MarkLogic, NoSQL, Oracle, Splunk, Text
Subscribe to our complete feed!
Comments
2 Responses to “Text data management, Part 1: Confusion”
Leave a Reply
I have used Temis system a lot.
what would you say about the strenghts and weaknesses of it in comparison with Lexalytics or Attensity…u know?
Temis and Lexalytics aren’t the same thing, I’d think. Temis and Attensity are closer.
That said, I haven’t talked with the Temis folks for a few years. My first thought would be to look at Temis first for French or Italian, but Attensity for English.