Edit: Please see the comment thread below for updates. Please also see a follow-on post about how the surveillance data is actually used.
US government surveillance has exploded into public consciousness since last Thursday. With one major exception, the news has just confirmed what was already thought or known. So where do we stand?
My views about domestic data collection start:
- I’ve long believed that the Feds — specifically the NSA (National Security Agency) — are storing metadata/traffic data on every telephone call and email in the US. The recent news, for example Senator Feinstein’s responses to the Verizon disclosure, just confirms it. That the Feds sometimes claim this has to be “foreign” data or they won’t look at it hardly undermines my opinion.
- Even private enterprises can more or less straightforwardly buy information about every credit card purchase we make. So of course the Feds can get that as well, as the Wall Street Journal seems to have noticed. More generally, I’d assume the Feds have all the financial data they want, via the IRS if nothing else.
- Similarly, many kinds of social media postings are aggregated for anybody to purchase, or can be scraped by anybody who invests in the equipment and bandwidth. Attensity’s service is just one example.
- I’m guessing that web use data (http requests, search terms, etc.) is not yet routinely harvested by the US government.* Ditto deanonymization of same. I guess that way basically because I’ve heard few rumblings to the contrary. Further, the consumer psychographic profiles that are so valuable to online retailers might be of little help to national security analysts anyway.
- Video surveillance seems likely to grow, from fixed cameras perhaps to drones; note for example the various officials who called for more public cameras after that Boston Marathon bombing. But for the present discussion, that’s of lesser concern to me, simply because it’s done less secretively than other kinds of surveillance. If there’s a camera that can see us, often we can see it too.
*Recall that these comments are US-specific. Data retention legislation has been proposed or passed in multiple countries to require recording of, among other things, all URL requests, with the stated goal of fighting either digital piracy or child pornography.
As for foreign data:
- Last I heard, we were collecting at least 10s of petabytes of satellite images per day. That’s probably too much even for the US government to persist in its entirety at this time. In the installation I heard of, most of the satellite data was deleted within 12-48 hours. But it may fit into the yottabyte-scale data center in Utah.
- I also once heard the US monitors every radio transmission detectable from North Korea.
Beyond that, use your imagination.
The big question is how much domestic or quasi-domestic communications-content data the US government currently captures. I think it’s a lot more than we previously acknowledged. For example:
- Both Edward Snowden and William Binney have said things that sound like the NSA is comprehensively storing actual communications content. I guess it’s possible that in each case they misspoke.
- Other claims to that effect have been more ringing. For example:
- The secret AT&T room/message splitter story dates back to 2007.
- The FBI itself states that in 2011 it “checked U.S. government databases and other information to look for such things as derogatory telephone communications, possible use of online sites associated with the promotion of radical activity.”
- Much of the PRISM project seems to be about access to communication or file contents.
- The most visible, emphatic denials — e.g. those from President Obama or various tech companies — seem to leave weasel room if one parses them carefully.
And cost is not a barrier. I would guess the order of magnitude* for all email in the US at 10 petabytes/day uncompressed. (100s of billions of messages, 10s of KB per message.) Phone call volumes are probably less. (Fewer than 10 billion calls per day.) The Feds can afford to store that. Hadoop or NoSQL clusters, for example, can be set up for low six figures per petabyte.** HP Vertica will sell anybody an RDBMS cluster (hardware and software) for around $2 million/petabyte.**
*In the most literal high-school-chemistry sense of the phrase.
**Of raw data; particularly compressible data might be managed yet more cheaply.
Coverage of all this has of course been intense. In particular:
- Glenn Greenwald has the big Snowden scoops.
- Matthew Ingram offers an amazing overview of the revelations and discussion.
- Michael Arrington has unleashed a couple of polemics.
And my views can be summarized much as I did three years ago:
- It is inevitable* that governments and other constituencies will obtain huge amounts of information, which can be used to drastically restrict everybody’s privacy and freedom.
- To protect against this grave threat, multiple layers of defense are needed, technical and legal/regulatory/social/political alike.
- One particular layer is getting insufficient attention, namely restrictions upon the use (as opposed to the acquisition or retention) of data.
*And indeed in many ways even desirable