I talked with the Sumo Logic folks for an hour Thursday. Highlights included:
- Sumo Logic does SaaS (Software as a Service) log management.
- Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as “Splunk-like”. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to branch out.)
- Sumo Logic has hacked Lucene for faster indexing, and says 10-30 second latencies are typical.
- Sumo Logic’s main differentiation is automated classification of events.
- There’s some kind of streaming engine in the mix, to update counters and drive alerts.
- Sumo Logic has around 30 “customers,” free (mainly) or paying (around 5) as the case may be.
- A truly typical Sumo Logic customer has single to low double digits of gigabytes of log data per day. However, Sumo Logic seems highly confident in its ability to handle a terabyte per customer per day, give or take a factor of 2.
- When I asked about the implications of shipping that much data to a remote data center, Sumo Logic observed that log data compresses really well.
- Sumo Logic recently raised a bunch of venture capital.
- Sumo Logic’s founders are out of ArcSight, a log management company HP paid a bunch of money for.
- Sumo Logic coined a marketing term “LogReduce”, but it has nothing to do with “MapReduce”. Sumo Logic seems to find this amusing.
What interests me about Sumo Logic is that automated classification story. I thought I heard Sumo Logic say:
- It’s largely unsupervised machine learning.
- It’s specific to a particular user/data set.
- It can be up and running and classifying things effectively almost instantly (i.e., on seconds’ or minutes’ worth of data).
- It’s informed by what different users tag as false positives. (Or maybe that is planned for future versions.)
I have a little trouble seeing how all those points fit exactly together, so perhaps I got some details wrong.
The payoff is that machine learning directly informs the Sumo Logic user interface. In particular, large numbers of events are bundled into a small number of categories, hopefully making it much easier for network operations types to scan the UI and pick out what’s important.
In general, the idea of machine-learning informing analytic UIs via some sort of classification is common in text-oriented technologies, notably in:
- Good ol’ text search.
- Text mining vendors’ approaches to clustering hits on words or phrases that say substantially the same thing.
But otherwise it seems kind of rare, if we stipulate that ad-serving/general internet personalization isn’t really an analytic UI — but I’d love to hear of any interesting examples I’ve overlooked.