October 10, 2016

Then felt I like some watcher of the skies

When a new planet swims into his ken

— John Keats, “On First Looking Into Chapman’s Homer”

1. In June I wrote about why anomaly management is hard. Well, not only is it hard to do; it’s hard to talk about as well. One reason, I think, is that it’s hard to define what an anomaly is. And that’s a structural problem, not just a semantic one — if something is well enough understood to be easily described, then how much of an anomaly is it after all?

Artificial intelligence is famously hard to define for similar reasons.

“Anomaly management” and similar terms are not yet in the software marketing mainstream, and may never be. But naming aside, the actual subject matter is important.

2. Anomaly analysis is clearly at the heart of several sectors, including:

IT operations

Factory and other physical-plant operations

Security

Anti-fraud

Anti-terrorism

Each of those areas features one or both of the frameworks:

Surprises are likely to be bad.

Coincidences are likely to be suspicious.

So if you want to identify, understand, avert and/or remediate bad stuff, data anomalies are the first place to look.

3. The “insights” promised by many analytics vendors — especially those who sell to marketing departments — are also often heralded by anomalies. Already in the 1970s, Walmart observed that red clothing sold particularly well in Omaha, while orange flew off the shelves in Syracuse. And so, in large college towns, they stocked their stores to the gills with clothing in the colors of the local football team. They also noticed that fancy dresses for little girls sold especially well in Hispanic communities … specifically for girls at the age of First Communion.

4. The examples in the previous point may be characterized as noteworthy correlations that surely are reflecting actual causality. (The beer/diapers story would be another example, if only it were true.) Formally, the same is probably true of most actionable anomalies. So “anomalies” are fairly similar to — or at least overlap heavily with — “statistically surprising observations”.

And I do mean “statistically”. As per my Keats quote above, we have a classical model of sudden-shock discovery — an astronomer finding a new planet, a radar operator seeing a blip on a screen, etc. But Keats’ poem is 200 years old this month. In this century, there’s a lot more number-crunching involved.

Please note: It is certainly not the case that anomalies are necessarily found via statistical techniques. But however they’re actually found, they would at least in theory score as positives via various statistical tests.

5. There are quite a few steps to the anomaly-surfacing process, including but not limited to:

Collecting the raw data in a timely manner.

Identifying candidate signals (and differentiating them from noise).

Communicating surprising signals to the most eager consumers (and letting them do their own analysis).

Giving more tightly-curated information to a broader audience.

Hence many different kinds of vendor can have roles to play.

6. One vendor that has influenced my thinking about data anomalies is Nestlogic, an early-stage start-up with which I’m heavily involved. Here “heavily involved” includes:

I own more stock in Nestlogic than I have in any other company of which I wasn’t the principal founder.

I’m in close contact with founder/CEO David Gruzman.

I’ve personally written much of Nestlogic’s website content.

Nestlogic’s claims include:

For machine-generated data, anomalies are likely to be found in data segments, not individual records. (Here a “segment” might be all the data coming from a particular set of sources in a particular period of time.)

not individual records. (Here a “segment” might be all the data coming from a particular set of sources in a particular period of time.) The more general your approach to anomaly detection, the better, for at least three reasons: In adversarial use cases, the hacker/fraudster/terrorist/whatever might deliberately deviate from previous patterns, so as to evade detection by previously-established filters. When there are multiple things to discover, one anomaly can mask another, until it is detected and adjusted for. (This point isn’t specific to anomaly management) More general tools can mean that an enterprise has fewer different new tools to adopt.

Anomalies boil down to surprising data profiles, so anomaly detection bears a slight resemblance to the data profiling approaches used in data quality, data integration and query optimization.

Different anomaly management users need very different kinds of UI. Less technical ones may want clear, simple alerts, with a minimum of false positives. Others may use anomaly management as a jumping-off point for investigative analytics and/or human real-time operational control.

I find these claims persuasive enough to help Nestlogic with its marketing and fund-raising, and to cite them in my post here. Still, please understand that they are Nestlogic’s and David’s assertions, not my own.

