As I observed yet again last week, much of analytics is concerned with anomaly detection, analysis and response. I don’t think anybody understands the full consequences of that fact,* but let’s start with some basics.
An anomaly, for our purposes, is a data point or more likely a data aggregate that is notably different from the trend or norm. If I may oversimplify, there are three kinds of anomalies:
- Important signals. Something is going on, and it matters. Somebody — or perhaps just an automated system — needs to know about it. Time may be of the essence.
- Unimportant signals. Something is going on, but so what?
- Pure noise. Even a fair coin flip can have long streaks of coming up “heads”.
Two major considerations are:
- Whether the recipient of a signal can do something valuable with the information.
- How “costly” it is for the recipient to receive an unimportant signal or other false positive.
What I mean by the latter point is:
- Something that sets a cell phone buzzing had better be important, to the phone’s owner personally.
- But it may be OK if something unimportant changes one small part of a busy screen display.
Anyhow, the Holy Grail* of anomaly management is a system that sends the right alerts to the right people, and never sends them wrong ones. And the quest seems about as hard as that for the Holy Grail, although this one uses more venture capital and fewer horses.
*The Holy Grail, in legend, was found by 1-3 knights: Sir Galahad (in most stories), Sir Percival (in many), and Sir Bors (in some). Leading vendors right now are perhaps around the level of Sir Kay.
Difficulties in anomaly management technology include:
- Performance is a major challenge. Ideally, you’re running statistical tests on all data — at least on all fresh data — at all times.
- User experiences are held to high standards.
- False negatives are very bad.
- False positives can be very annoying.
- Robust role-based alert selection is often needed.
- So are robust visualization and drilldown.
- Data quality problems can look like anomalies. In some cases, bad data screws up anomaly detection, by causing false positives. In others, it’s just another kind of anomaly to detect.
- Anomalies are inherently surprising. We don’t know in advance what they’ll be.
Consequences of the last point include:
- It’s hard to tune performance when one doesn’t know exactly how the system will be used.
- It’s hard to set up role-based alerting if one doesn’t know exactly what kinds of alerts there will be.
- It’s hard to choose models for the machine learning part of the system.
Donald Rumsfeld’s distinction between “known unknowns” and “unknown unknowns” is relevant here, although it feels wrong to mention Rumsfeld and Sir Galahad in the same post.
And so a reasonable summary of my views might be:
Anomaly management is an important and difficult problem. So far, vendors have done a questionable job of solving it.
But there’s a lot of activity, which I look forward to writing about in considerable detail.
- The most directly relevant companies I’ve written about are probably Rocana and Splunk.