Liberty and privacy
Discussion of issues related to liberty and privacy, and especially how they are affected by and interrelated with data management and analytic technologies. Related subjects include:
Over the past week, discussion has exploded about US government surveillance. After summarizing, as best I could, what data the government appears to collect, now I ‘d like to consider what they actually do with it. More precisely, I’d like to focus on the data’s use(s) in combating US-soil terrorism. In a nutshell:
- Reporting is persuasive that electronic surveillance data is helpful in following up on leads and tips obtained by other means.
- Reporting is not persuasive that electronic surveillance data on its own uncovers or averts many terrorist plots.
- With limited exceptions, neither evidence nor logic suggests that data mining or predictive modeling does much to prevent domestic terrorist attacks.
Consider the example of Tamerlan Tsarnaev:
In response to this 2011 request, the FBI checked U.S. government databases and other information to look for such things as derogatory telephone communications, possible use of online sites associated with the promotion of radical activity, associations with other persons of interest, travel history and plans, and education history.
While that response was unsuccessful in preventing a dramatic act of terrorism, at least they tried.
As for actual success stories — well, that’s a bit tough. In general, there are few known examples of terrorist plots being disrupted by law enforcement in the United States, except for fake plots engineered to draw terrorist-leaning individuals into committing actual crimes. One of those examples, that of Najibullah Zazi, was indeed based on an intercepted email — but the email address itself was uncovered through more ordinary anti-terrorism efforts.
As for machine learning/data mining/predictive modeling, I’ve never seen much of a hint of it being used in anti-terrorism efforts, whether in the news or in my own discussions inside the tech industry. And I think there’s a great reason for that — what would they use for a training set? Here’s what I mean. Read more
|Categories: Application areas, Liberty and privacy, Predictive modeling and advanced analytics, RDF and graphs, Text||9 Comments|
Edit: Please see the comment thread below for updates. Please also see a follow-on post about how the surveillance data is actually used.
US government surveillance has exploded into public consciousness since last Thursday. With one major exception, the news has just confirmed what was already thought or known. So where do we stand?
My views about domestic data collection start:
- I’ve long believed that the Feds — specifically the NSA (National Security Agency) — are storing metadata/traffic data on every telephone call and email in the US. The recent news, for example Senator Feinstein’s responses to the Verizon disclosure, just confirms it. That the Feds sometimes claim this has to be “foreign” data or they won’t look at it hardly undermines my opinion.
- Even private enterprises can more or less straightforwardly buy information about every credit card purchase we make. So of course the Feds can get that as well, as the Wall Street Journal seems to have noticed. More generally, I’d assume the Feds have all the financial data they want, via the IRS if nothing else.
- Similarly, many kinds of social media postings are aggregated for anybody to purchase, or can be scraped by anybody who invests in the equipment and bandwidth. Attensity’s service is just one example.
- I’m guessing that web use data (http requests, search terms, etc.) is not yet routinely harvested by the US government.* Ditto deanonymization of same. I guess that way basically because I’ve heard few rumblings to the contrary. Further, the consumer psychographic profiles that are so valuable to online retailers might be of little help to national security analysts anyway.
- Video surveillance seems likely to grow, from fixed cameras perhaps to drones; note for example the various officials who called for more public cameras after that Boston Marathon bombing. But for the present discussion, that’s of lesser concern to me, simply because it’s done less secretively than other kinds of surveillance. If there’s a camera that can see us, often we can see it too.
*Recall that these comments are US-specific. Data retention legislation has been proposed or passed in multiple countries to require recording of, among other things, all URL requests, with the stated goal of fighting either digital piracy or child pornography.
As for foreign data: Read more
|Categories: Hadoop, HP and Neoview, Liberty and privacy, Petabyte-scale data management, Pricing, Telecommunications, Text, Vertica Systems, Web analytics||10 Comments|
2. Numerous vendors are blending SQL and JSON management in their short-request DBMS. It will take some more work for me to have a strong opinion about the merits/demerits of various alternatives.
The default implementation — one example would be Clustrix’s — is to stick the JSON into something like a BLOB/CLOB field (Binary/Character Large Object), index on individual values, and treat those indexes just like any others for the purpose of SQL statements. Drawbacks include:
- You have to store or retrieve the JSON in whole documents at a time.
- If you are spectacularly careless, you could write JOINs with odd results.
IBM DB2 is one recent arrival to the JSON party. Unfortunately, I forgot to ask whether IBM’s JSON implementation was based on IBM DB2 pureXML when I had the chance, and IBM hasn’t gotten around to answering my followup query.
3. Nor has IBM gotten around to answering my followup queries on the subject of BLU, an interesting-sounding columnar option for DB2.
4. Numerous clients have asked me whether they should be active in DBaaS (DataBase as a Service). After all, Amazon, Google, Microsoft, Rackspace and salesforce.com are all in that business in some form, and other big companies have dipped toes in as well. Read more
I’m usually annoyed by lists of year-end predictions. Still, a reporter asked me for some, and I found one kind I was comfortable making.
Trends that I think will continue in 2013 include:
Growing attention to machine-generated data. Human-generated data grows at the rate business activity does, plus 0-25%. Machine-generated data grows at the rate of Moore’s Law, also plus 0-25%, which is a much higher total. In particular, the use of remote machine-generated data is becoming increasingly real.
Hadoop adoption. Everybody has the big bit bucket use case, largely because of machine-generated data. Even today’s technology is plenty good enough for that purpose, and hence justifies initial Hadoop adoption. Development of further Hadoop technology, which I post about frequently, is rapid. And so the Hadoop trend is very real.
Application SaaS. The on-premises application software industry has hopeless problems with product complexity and rigidity. Any suite new enough to cut the Gordian Knot is or will be SaaS (Software as a Service).
Newer BI interfaces. Advanced visualization — e.g. Tableau or QlikView — and mobile BI are both hot. So, more speculatively, are “social” BI (Business Intelligence) interfaces.
Price discounts. If you buy software at 50% of list price, you’re probably doing it wrong. Even 25% can be too high.
MySQL alternatives. NoSQL and NewSQL products often are developed as MySQL alternatives. Oracle has actually done a good job on MySQL technology, but now its business practices are scaring companies away from MySQL commitments, and newer short-request SQL DBMS are ready for use.
|Categories: Business intelligence, Hadoop, Liberty and privacy, MySQL, NewSQL, NoSQL, Open source, Oracle, Pricing, Software as a Service (SaaS)||3 Comments|
What are the central challenges in internet system design? We probably all have similar lists, comprising issues such as scale, scale-out, throughput, availability, security, programming ease, UI, or general cost-effectiveness. Screw those up, and you don’t have an internet business.
Much new technology addresses those challenges, with considerable success. But the success is usually one silo at a time — a short-request application here, an analytic database there. When it comes to integration, unsolved problems abound.
The top integration and integration-like challenges for me, from a practical standpoint, are:
- Integrating silos — a decades-old problem still with us in a big way.
- Dynamic schemas with joins.
- Low-latency business intelligence.
- Human real-time personalization.
Other concerns that get mentioned include:
- Geographical distribution due to privacy laws, which for some users is a major requirement for compliance.
- Logical data warehouse, a term that doesn’t actually mean anything real.
- In-memory data grids, which some day may no longer always be hand-coupled to the application and data stacks they accelerate.
Let’s skip those latter issues for now, focusing instead on the first four.
An Atlantic article suggests that the digital advertising industry is coalescing around the position “restrict data use if you must, but go easy on data collection and retention.”
There is a fascinating scrum over what “Do Not Track” tools should do and what orders websites will have to respect from users. The Digital Advertising Alliance (of which the NAI is a part), the Federal Trade Commission, W3C, the Internet Advertising Bureau (also part of the DAA), and privacy researchers at academic institutions are all involved. In November, the DAA put out a new set of principles that contain some good ideas like the prohibition of “collection, use or transfer of Internet surfing data across Websites for determination of a consumer’s eligibility for employment, credit standing, healthcare treatment and insurance.”
This week, the White House seemed to side with privacy advocates who want to limit collection, not just uses. Its Consumer Privacy Bill of Rights pushes companies to allow users to “exercise control over what personal data companies collect from them and how they use it.” The DAA heralded its own participation in the White House process, though even it noted this is the beginning of a long journey.
There has been a clear and real philosophical difference between the advertisers and regulators representing web users. On the one hand, as Stanford privacy researcher Jonathan Mayer put it, “Many stakeholders on online privacy, including U.S. and EU regulators, have repeatedly emphasized that effective consumer control necessitates restrictions on the collection of information, not just prohibitions on specific uses of information.” But advertisers want to keep collecting as much data as they can as long as they promise to not to use it to target advertising. That’s why the NAI opt-out program works like it does.
That’s a drum I’ve been beating for years, so to a first approximation I’m pleased. However:
- I don’t think currently proposed protections go nearly far enough, for reasons I previously stated plus others that keep coming to me. (For example, substantially all consumer privacy protections could be nuked simply by user agreements that compel you to “voluntarily” renounce most privacy rights in return for unfettered use of the internet.)
- If current trends are followed, it could end up that data use restrictions are too mild and data collection restrictions are too severe — and maybe that will all work out in a rough balance, at least for a while.
- In the not-so-near term, however, these rough political compromises may not work so well. That’s why I think next-generation digital advertising ecosystem design should start yesterday, or perhaps sooner.
So to sum up my views on consumer privacy:
- Focusing on data use is basically good.
- It is important to also focus on data collection, at least for a transitional period.
- For the whole thing to work out well, a major rethinking of systems is needed.
There’s a growing consensus that consumers require limits on the predictive modeling that is done about them. That’s a theme of the Obama Administration’s recent work on consumer data privacy; it’s central to other countries’ data retention regulations; and it’s specifically borne out by the recent Target-pursues-pregnant-women example. Whatever happens legally, I believe this also calls for a technical response, namely:
Consumers should be shown key factual and psychographic aspects of how they are modeled, and be given the chance to insist that marketers disregard any or all of those aspects.
I further believe that the resulting technology should be extended so that
information holders can collaborate by exchanging estimates for such key factors, rather than exchanging the underlying data itself.
To some extent this happens today, for example with attribution/de-anonymization or with credit scores; but I think it should be taken to another level of granularity.
My name for all this is translucent modeling, rather than “transparent”, the idea being that key points must be visible, but the finer details can be safely obscured.
Examples of dialog I think marketers should have with consumers include: Read more
|Categories: Liberty and privacy, Predictive modeling and advanced analytics, Web analytics||Leave a Comment|
Charles Duhigg of the New York Times wrote a very interesting article, based on a forthcoming book of his, on two related subjects:
- The force of habit on our lives, and how we can/do deal with it. (That’s the fascinating part.)
- A specific case of predictive modeling. (That’s the part that’s getting all the attention. It’s interesting too.)
The predictive modeling part is that Target determined:
- People only change their shopping habits occasionally
- One of those occasions is when they get pregnant
- Hence, it would be a Really Good Idea to market aggressively to pregnant women
and then built a marketing strategy around early indicators of a woman’s pregnancy. Read more
|Categories: Liberty and privacy, Predictive modeling and advanced analytics, Specific users||Leave a Comment|
The Obama Administration recently released a position paper on consumer data privacy. I have mixed feelings about it.
The document admirably says:
- Internet-related regulation should be informal, so as to maintain flexibility in the face of technological change (and, less clearly stated, government technological ignorance).
- Consumers should be given opt-ins and opt-outs regarding data retention, which should have good, clear user interfaces.
- If you don’t have good data security, then you’re not doing a good job of protecting privacy.
But it says less than it seems to about protecting citizens from privacy invasion by businesses. And it says nothing at all about protecting citizens from privacy invasion by government, which in the first footnote it says is beyond the scope of the document. On the whole, I think the document does much less than what is needed.
The core of the paper is a “Consumer Privacy Bill of Rights”, with seven provisions. Here goes: Read more
Last month, I reviewed with the Aster Data folks which markets they were targeting and selling into, subsequent to acquisition by their new orange overlords. The answers aren’t what they used to be. Aster no longer focuses much on what it used to call frontline (i.e., low-latency, operational) applications; those are of course a key strength for Teradata. Rather, Aster focuses on investigative analytics — they’ve long endorsed my use of the term — and on the batch run/scoring kinds of applications that inform operational systems.
|Categories: Analytic technologies, Application areas, Aster Data, Data warehousing, DataStax, Liberty and privacy, RDF and graphs, Teradata, Web analytics||1 Comment|