October 10, 2011

Text data management, Part 1: Confusion

This is Part 1 of a three post series. The posts cover:

There’s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:

The terminology around text data is inaccurate.
Data volume estimates for text are misleading.
Multiple different technologies are in the mix, including:
- Enterprise text search.
- Text analytics — text mining, sentiment analysis, etc.
- Document stores — e.g. document-oriented NoSQL, or MarkLogic.
- Log management and parsing — e.g. Splunk.
- Text archiving — e.g., various specialty email archiving products I couldn’t even name.
- Public web search — Google et al.
Text search vendors have disappointed, especially technically.
Text analytics vendors have disappointed, especially financially.
Other analytic technology vendors ignore what the text analytic vendors actually have accomplished, and reinvent inferior wheels rather than OEM the state of the art.

Above all: The use cases for text data vary greatly, just as the use cases for simply-structured databases do.

There are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include:

The terms “unstructured” or “semi-structured” data are inherently misleading. That’s why I favor “multi-structured” or “poly-structured” instead. (“Multi-structured” seems to be winning; e.g., it’s been adopted by Teradata and Teradata/Aster.)
The “social media” text data any one enterprise brings in house isn’t all that much. For example, Attensity serves many different enterprises’ social media needs from a single 20-terabyte data store, and reports that no single enterprise has required as much as 1 terabyte of text yet. Text data may consume a lot of storage on spinning disks somewhere, but it’s not that big a factor in future DBMS industry growth. (That 20 terabyte figure does seem low.)
Structured databases are typically worth a lot more per bit than other kinds. The most valuable electronic data, per-bit, is probably records of significant economic transactions — purchases, sales, money transfers, etc. The least valuable may be sensor log files, whose contents consist mainly of “Nothing going on here; ping you again in a minute.” Email logs, web interaction data and many other kinds fall somewhere in between. Highly valuable documents — such as signed contracts — generally persist in paper as well as electronic forms. Investors commonly overlook this point.
The enterprise text search industry is screwed up.
- FAST was a goofy company before it was acquired for far too much money by Microsoft.
- Autonomy was a goofy company before it was acquired for far too much money by HP.
- Google’s enterprise efforts are quiet.
- The integration of text search and relational DBMS — e.g. at Oracle — has languished, with poor performance and evident lack of management attention.
- Smaller text search vendors don’t seem to be getting a lot of traction — e.g., Coveo has a decent reputation, but when’s the last time you heard much about them? What has Attivio actually accomplished?
Text analytics is a small business. Add up the revenue for Attensity, Clarabridge, Lexalytics, Temis, and all the others, and you might poke above $100 million, especially now that Attensity had a three-way merger. Then again, you might not.
Even so, the text analytics vendors have developed sophisticated technology. In particular, you can use it to get a pretty good idea as to what people are writing about you, individually or as groups.

Categories: Analytic technologies, Archiving and information preservation, Google, Log analysis, MarkLogic, NoSQL, Oracle, Splunk, Text

Subscribe to our complete feed!

Comments

2 Responses to “Text data management, Part 1: Confusion”

Mike on November 10th, 2011 5:23 pm

I have used Temis system a lot.
what would you say about the strenghts and weaknesses of it in comparison with Lexalytics or Attensity…u know?
Curt Monash on November 10th, 2011 8:20 pm

Temis and Lexalytics aren’t the same thing, I’d think. Temis and Attensity are closer.

That said, I haven’t talked with the Temis folks for a few years. My first thought would be to look at Temis first for French or Italian, but Attensity for English.

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Text data management, Part 1: Confusion

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin