I talked with Michelle de Haaff and Ian Hersey of Attensity back in February. We covered a lot of ground, so let’s start with a very high-level view.
- Two years ago, Attensity merged with two other companies in somewhat related businesses, thus expanding 4X or so in size.
- Due to the merger, Attensity now has two core lines of business:
- Text analytics.
- Driving actions, such as call center or social media response, based on text analytics.
- The combined Attensity is part American, part German.
- Attensity’s German part compels it to do some public financial reporting. Attensity will do $50-60 million in 2011 revenue.
- Attensity crunches text in 17 languages. English is preeminent. #2 is — you guessed it! — German.
- A big part of Attensity’s business (or at least of its value proposition) is analyzing the text in social media. Attensity boasts coverage of 75 million social media sources, such as blogs, forums, or review sites.
The four most interesting technical points were probably:
- Attensity has changed how it does exhaustive extraction. I’m having some trouble writing that part up, so for now I’ll just refer you to Attensity’s own description of the new way of doing things.
- Attensity has development work underway meant to address some of the problems in text analytics/other analytics integration. I don’t feel I got enough detail to want to talk about that yet.
- Attensity runs its own data centers, with approximately 60 Hadoop/HBase nodes and 30 nodes of Apache Solr (open source text search). More on that below.
- Attensity now OEMs Vertica. More on that below too.
Some more specific notes include:
- Attensity has long had customers who use text analytics as an input into churn analysis, for example Charles Schwab.
- At least one customer, who may or may not wish to be named, uses Attensity technology to help de-anonymize social media posters. (I didn’t ask how that worked, actually.)
- Attensity’s founding CTO David Bean has been gone for a while.
- Social media analyzers generally require less sophisticated analytics than Attensity’s older kinds of customers.
- Social media has, in part, a vocabulary all its own.
Attensity and relational DBMS
Notes on Attensity’s choice of DBMS to OEM include:
- Attensity uses Hadoop/HBase itself, but didn’t consider it realistic to try to persuade OEM customers to go that way.
- I get the impression that Attensity’s two finalists were Vertica and Sybase IQ.
- Attensity seems to have considered only query, and not more general analytic platform capabilities, which makes sense given that the evaluation was conducted (starting) in 2009.
- One reason Vertica won was that it required very little tuning.
- Another reason Vertica won was true MPP scale-out, notwithstanding that the largest known installation is capable of running on two nodes (although Attensity recommended that the customer get four just to be on the safe side).
- Sybase IQ’s load speed was even better than Vertica’s.
- Database max-size-to-date metrics include:
- Under 1 terabyte.
- 50 million documents (not rows), growing by 1million documents per day..
- Several hundred million sentences (I guess the documents are short, but it makes sense that they would be).
- Several billion rows.
It seems there are two parts to the Attensity schema. The raw output of “exhaustive extraction” sounds as if it has rather narrow rows. But Attensity then builds something more star-schema-like to feed into BI tools. Perhaps the latter is the reason for preferring columnar DBMS. There don’t seem to be a lot of auxiliary tables; the only ones Ian cited were:
- Category tables have ontology up to a couple thousand rows
- Tables of terms
- Structured fields that provide metadata for the triples
Previous Attensity database targets (partner, not OEM) included Teradata, SQL Server, Oracle, and MySQL. Hibernate layers were in the mix somewhere too. SQL Server actually had the best performance. I don’t think that’s counting a more recent Sybase IQ partnership, which only racked up a couple of sales.
Attensity, Hadoop, and other non-relational technologies
But that’s OEM. Attensity runs its own data centers, with approximately 60 Hadoop/HBase nodes and 30 nodes of Apache Solr (open source text search).* One reason for moving out of Amazon EC2 was that Solr cried out for solid-state drives; another was just cost.
*But those are just rough figures, from Ian’s memory.
Attensity uses HBase to store full-text documents. However, it doesn’t seem that this is a classic low-latency update HBase use case; Attensity reports doing 3 loads a day, 50 gigabytes of documents total. Apparently that works out to 1 billion documents/month; I gather Attensity just keeps them for 6 months. HBase has been nicely stable for Attensity.
Attensity uses Solr to build distributed search indexes. Solr has not been nicely stable.
What Attensity does in Hadoop seems to be rather simple NLP (Natural Language Processing), plus things one might do in a relational DBMS instead. Examples include:
- Named entity extraction.
- Scoring for sentiment.
- Influence scores, in whatever ways Attensity can calculate them.
There surely also is some basic preprocessing, ingesting text (and document metadata) in various forms and normalizing it into a more standard format. Some real-time ingesting is done outside of Hadoop, in more of a queuing system, the most obvious example of that being the Twitter firehose. Ian suggested that in the future this system will get more uses, in the form of a UIMA-like pipeline.
I further get the impression that Attensity uses Hadoop to do on a SaaS (Software As A Service) basis what its customers do in Vertica. The old idea that Attensity provides hosted services for about half its customers still seems to apply, at least on the new-customer front. However, I’m not sure exactly which product lines Attensity was referring to when they said that.