Over the past week, discussion has exploded about US government surveillance. After summarizing, as best I could, what data the government appears to collect, now I ‘d like to consider what they actually do with it. More precisely, I’d like to focus on the data’s use(s) in combating US-soil terrorism. In a nutshell:
- Reporting is persuasive that electronic surveillance data is helpful in following up on leads and tips obtained by other means.
- Reporting is not persuasive that electronic surveillance data on its own uncovers or averts many terrorist plots.
- With limited exceptions, neither evidence nor logic suggests that data mining or predictive modeling does much to prevent domestic terrorist attacks.
Consider the example of Tamerlan Tsarnaev:
In response to this 2011 request, the FBI checked U.S. government databases and other information to look for such things as derogatory telephone communications, possible use of online sites associated with the promotion of radical activity, associations with other persons of interest, travel history and plans, and education history.
While that response was unsuccessful in preventing a dramatic act of terrorism, at least they tried.
As for actual success stories — well, that’s a bit tough. In general, there are few known examples of terrorist plots being disrupted by law enforcement in the United States, except for fake plots engineered to draw terrorist-leaning individuals into committing actual crimes. One of those examples, that of Najibullah Zazi, was indeed based on an intercepted email — but the email address itself was uncovered through more ordinary anti-terrorism efforts.
As for machine learning/data mining/predictive modeling, I’ve never seen much of a hint of it being used in anti-terrorism efforts, whether in the news or in my own discussions inside the tech industry. And I think there’s a great reason for that — what would they use for a training set? Here’s what I mean.
Unless the jargon is being misused — which of course happens all too often — data mining works like this:
- Data sets are collected in which outcomes are matched to (vectors of) other (dependent) variables. These are called training sets.
- Analytic software is run, with the training sets as inputs and algorithms as outputs. This is called training the model. The output algorithms are produced which purport to estimate which other vectors of dependent variables are likely to be associated with which outcomes.
Yes, I’m saying that predictive modeling software, used at the modeling stage — as opposed to the model scoring/execution stage — has algorithms as output. Depending on details, that’s either literally true or else just true in effect.
For example, in the simplest case, namely a linear regression:
- The outcome is an event such as a product sale (desirable) or equipment failure (to be avoided).
- The algorithm is a weighted sum of the other variables, whose value is interpreted as the probability of that outcome.
- The algorithm discovery process simply boiled down to calculating the coefficients in the weighted sum.
When data mining and predictive modeling get a little more complicated than that, we still call them “statistical analysis”; when they get much more complicated, the name “machine learning” is commonly used instead.
And so my views on the application of predictive modeling to domestic US anti-terrorism start:
- In most respects, there aren’t enough examples to train models to help predict or avert terror attacks.
- Presumably not coincidentally, while I’ve heard of many query and visualization techniques — notably graph analytics — I haven’t heard of predictive modeling applied directly to anti-terrorism.
- There’s one big exception to this rule:
- Surveillance-based anti-terrorism efforts depend heavily on natural language processing …
- … and natural language processing depends heavily on machine learning.
Perhaps there are other examples similar to the natural language one, but nothing is currently coming to mind.
Note that not all these arguments apply to all parts of the world. For example, there have been enough roadside IEDs (Improvised Explosive Devices) in Iraq and Afghanistan that looking for unusual communication patterns associated with them might bear fruit. But when it comes to fending off terrorist attacks on US soil, I believe the main use of surveillance data is for straightforward query and data visualization based on the best educated guesses of smart human analysts.