Discussion of public policy around technological issues, especially but not only surveillance and privacy.
I made a remarkably rumpled video appearance yesterday with SiliconAngle honchos John Furrier and Dave Vellante. (Excuses include <3 hours sleep, and then a scrambling reaction to a schedule change.) Topics covered included, with approximate timechecks:
- 0:00 Introductory pabulum, and some technical difficulties
- 2:00 More introduction
- 3:00 Dynamic schemas and data model churn
- 6:00 Surveillance and privacy
- 13:00 Hadoop, especially the distro wars
- 22:00 BI innovation
- 23:30 More on dynamic schemas and data model churn
Edit: Some of my remarks were transcribed.
- I posted on dynamic schemas data model churn a few days ago.
- I capped off a series on privacy and surveillance a few days ago.
- I commented on various Hadoop distributions in June.
|Categories: Business intelligence, ClearStory Data, Data warehousing, Hadoop, MapR, MapReduce, Surveillance and privacy||Leave a Comment|
I’ve been harping on the grave dangers of surveillance and privacy intrusion. Clearly, something must be done to rein them in. But what?
Well, let’s look at an older and better-understood subject — governmental use of force. Governments, by their very nature, possess tools for tyranny: armies, police forces, and so on. So how do we avoid tyranny? We limit what government is allowed to do with those tools, and we teach our citizens — especially those who serve in government — to obey and enforce the limits.
Those limits can be lumped into two categories:
- Direct — there are very strong controls as to when and how the government may use force.
- Indirect — there are also controls on how the government can even threaten the use of force. I.e., substantially all laws are ultimately backed up by the threat of governmental force — and there are limits as to which laws may or may not be enacted.
The story is similar for surveillance technology:
- As data gathering and analysis technologies skyrocket in power, they become ever more powerful tools for tyranny.
- Direct controls are called for — there is some surveillance the government is and should not be allowed to do.
- Indirect controls are also necessary — even when it has information, there are ways in which the government should not be allowed to use it.
But there’s a big difference between the cases of physical force and surveillance.
- The direct controls on the use of force are strong; under ordinary circumstances, government is NOT allowed to just go out and shoot somebody.
- The direct controls on surveillance, however, are very weak; government has access to all kinds of information.
I’ve worried for years about a terrible and under-appreciated danger of privacy intrusion, which in a recent post I characterized as a chilling effect upon the exercise of ordinary freedoms. When government — or an organization such as your employer, your insurer, etc. — watches you closely, it can be dangerous to deviate from the norm. Even the slightest non-conformity could have serious consequences. I wish that were an exaggeration; let’s explore why it isn’t.
Possible difficulties — most of them a little bit futuristic — include:
- Being perceived as a potential terrorist or terrorist sympathizer. That’s a biggie, of course, at least in “free” countries. Even getting on the No-Fly List is enough to pretty much shut down your travel, and hence your options in many careers. If you want to avoid such problems, it might be prudent not to:
- Visit certain websites.
- Email, telephone, or otherwise communicate with certain people.
- Use certain words or phrases in email or on the telephone.
- Being regarded as too vehement a political dissenter in general. Political dissent is deadly dangerous in too many countries around the world, and has costs even in “free” countries. (Jacob Appelbaum is one recent US example.) To avoid such problems, there are a whole lot of things you might think twice about writing, saying, or doing, and certain people it’s definitely risky to associate with or write nice things about.
- Not being regarded as a probable loyal, hard-working, accepting employee. Think about the difficulties “over-qualified” candidates have getting hired. Then consider what might happen if employers had (accurate or otherwise) psychographic profiles estimating who was most likely to stay at a job, to accept boring job tasks or long hours, or to tolerate sub-standard pay. Then consider how wise it might be to show interest in, for example:
- Other careers.
- Certain hobbies that might be construed as leading to other careers.
- Living in other parts of the country.
- Being perceived as likely to engage in socially-unapproved sexual behavior. In the United States, certain sexual choices — even among consenting adults — could cause problems with discrimination, child custody, or divorce. Elsewhere, your choice of partner could lead to prison or even death. (I don’t know exactly which shopping choices could get one identified as a possible homosexual or philanderer … but just to be on the safe side, you might not want to download any Barbara Streisand songs. :))
- Being regarded as a poor health or safety risk for employment, insurance, or more. Do you like fatty foods? Extreme sports? Night clubs? Recreational drugs? Tobacco? More than a little alcohol? Fast cars? Fast women? Evidence of any of those tastes could move you up the risk charts for heart attack, accident, marital dissolution or some other outcome that an employer or insurer wouldn’t like.
This is the second of a two-part series on the theory of information privacy. In the first post, I review the theory to date, and outline what I regard as a huge and crucial gap. In the second post, I try to fill that chasm.
The first post in this two-part series:
- Reviewed the privacy theory of the past 123 years.
- Declared it inadequate to address today’s surveillance and information privacy issues.
- Suggested a reason for its failure — the harms of privacy violation are too rarely spelled out in concrete terms, making it impractical to do even implicit cost-benefit analyses.
Actually, it’s easy to name specific harms from privacy loss. A list might start:
- Being investigated (rightly or wrongly) for a crime, with all the hassle and legal risk that ensues.
- Being discriminated against for employment, credit, or insurance.
- Being embarrassed publicly, or discriminated against socially.
- Being bullied or stalked by deplorable private-citizen acquaintances.
- Being put on the no-fly list.
I expect that few people in, say, the United States will suffer these harms in the near future, at least the more severe ones. However, the story gets worse, because we don’t know which disclosures will have which adverse effects. For example, Read more
This is the first of a two-part series on the theory of information privacy. In the first post, I review the theory to date, and outline what I regard as a huge and crucial gap. In the second post, I try to fill that chasm.
Discussion of information privacy has exploded, spurred by increasing awareness of data’s collection and use. Confusion reigns, however, for reasons such as:
- Data is often collected behind a veil of secrecy. That’s top-of-mind these days, in light of the Snowden/Greenwald revelations.
- Nobody understands all of the various technologies involved. Telecom experts don’t know a lot about data management and analysis, and vice-versa, while the political reporters don’t understand much about technology at all. I think numerous reporting errors have resulted.
- There’s no successful theory explaining when privacy should and shouldn’t be preserved. To put it quite colloquially:
- Big Brother is watching you …
- … and he’s scary.
- Privacy theory focuses on the “watching” part …
- … but the “scary” part is what really needs to be addressed.
Let’s address the last point. Read more
I’m not having a productive week, part of the reason being a hard drive crash that took out early drafts of what were to be last weekend’s blog posts. Now I’m operating from a laptop, rather than my preferred dual-monitor set-up. So please pardon me if I’m concise even by comparison to my usual standards.
- My recent posts based on surveillance news have been partly superseded by – well, by more news. Some of that news, along with some good discussion, may be found in the comment threads.
- The same goes for my recent Hadoop posts.
- The replay for my recent webinar on real-time analytics is now available. My part ran <25 minutes.
- One of my numerous clients using or considering a “real-time analytics” positioning is Sqrrl, the company behind the NoSQL DBMS Accumulo. Last month, Derrick Harris reported on a remarkable Accumulo success story – multiple US intelligence instances managing 10s of petabytes each, and supporting a variety of analytic (I think mainly query/visualization) approaches.
- Several sources have told me that MemSQL’s Zynga sale is (in part) for Membase replacement. This is noteworthy because Zynga was the original pay-for-some-of-the-development Membase customer.
- More generally, the buzz out of Couchbase is distressing. Ex-employees berate the place; job-seekers check around and then decide not to go there; rivals tell me of resumes coming out in droves. Yes, there’s always some of that, even at obviously prospering companies, but this feels like more than the inevitable low-level buzz one hears anywhere.
- I think the predictive modeling state of the art has become:
- Cluster in some way.
- Model separately on each cluster.
- And if you still want to do something that looks like a regression – linear or otherwise – then you might want to use a tool that lets you shovel training data in WITHOUT a whole lot of preparation* and receive a model back out. Even if you don’t accept that as your final model, it can at least be a great guide to feature selection (in the statistical sense of the phrase) and the like.
- Champion/challenger model testing is also a good idea, at least if you’re in some kind of personalization/recommendation space, and have enough traffic to test like that.**
- Most companies have significant turnover after being acquired, perhaps after a “golden handcuff” period. Vertica is no longer an exception.
- Speaking of my clients at HP Vertica – they’ve done a questionable job of communicating that they’re willing to price their product quite reasonably. (But at least they allowed me to write about $2K/terabyte for hardware/software combined.)
- I’m hearing a little more Amazon Redshift buzz than I expected to. Just a little.
- StreamBase was bought by TIBCO. The rumor says $40 million.
*Basic and unavoidable ETL (Extract/Transform/Load) of course excepted.
**I could call that ABC (Always Be Comparing) or ABT (Always Be Testing), but they each sound like – well, like The Glove and the Lions.
Over the past week, discussion has exploded about US government surveillance. After summarizing, as best I could, what data the government appears to collect, now I ‘d like to consider what they actually do with it. More precisely, I’d like to focus on the data’s use(s) in combating US-soil terrorism. In a nutshell:
- Reporting is persuasive that electronic surveillance data is helpful in following up on leads and tips obtained by other means.
- Reporting is not persuasive that electronic surveillance data on its own uncovers or averts many terrorist plots.
- With limited exceptions, neither evidence nor logic suggests that data mining or predictive modeling does much to prevent domestic terrorist attacks.
Consider the example of Tamerlan Tsarnaev:
In response to this 2011 request, the FBI checked U.S. government databases and other information to look for such things as derogatory telephone communications, possible use of online sites associated with the promotion of radical activity, associations with other persons of interest, travel history and plans, and education history.
While that response was unsuccessful in preventing a dramatic act of terrorism, at least they tried.
As for actual success stories — well, that’s a bit tough. In general, there are few known examples of terrorist plots being disrupted by law enforcement in the United States, except for fake plots engineered to draw terrorist-leaning individuals into committing actual crimes. One of those examples, that of Najibullah Zazi, was indeed based on an intercepted email — but the email address itself was uncovered through more ordinary anti-terrorism efforts.
As for machine learning/data mining/predictive modeling, I’ve never seen much of a hint of it being used in anti-terrorism efforts, whether in the news or in my own discussions inside the tech industry. And I think there’s a great reason for that — what would they use for a training set? Here’s what I mean. Read more
|Categories: Application areas, Predictive modeling and advanced analytics, RDF and graphs, Surveillance and privacy, Text||9 Comments|
Edit: Please see the comment thread below for updates. Please also see a follow-on post about how the surveillance data is actually used.
US government surveillance has exploded into public consciousness since last Thursday. With one major exception, the news has just confirmed what was already thought or known. So where do we stand?
My views about domestic data collection start:
- I’ve long believed that the Feds — specifically the NSA (National Security Agency) — are storing metadata/traffic data on every telephone call and email in the US. The recent news, for example Senator Feinstein’s responses to the Verizon disclosure, just confirms it. That the Feds sometimes claim this has to be “foreign” data or they won’t look at it hardly undermines my opinion.
- Even private enterprises can more or less straightforwardly buy information about every credit card purchase we make. So of course the Feds can get that as well, as the Wall Street Journal seems to have noticed. More generally, I’d assume the Feds have all the financial data they want, via the IRS if nothing else.
- Similarly, many kinds of social media postings are aggregated for anybody to purchase, or can be scraped by anybody who invests in the equipment and bandwidth. Attensity’s service is just one example.
- I’m guessing that web use data (http requests, search terms, etc.) is not yet routinely harvested by the US government.* Ditto deanonymization of same. I guess that way basically because I’ve heard few rumblings to the contrary. Further, the consumer psychographic profiles that are so valuable to online retailers might be of little help to national security analysts anyway.
- Video surveillance seems likely to grow, from fixed cameras perhaps to drones; note for example the various officials who called for more public cameras after that Boston Marathon bombing. But for the present discussion, that’s of lesser concern to me, simply because it’s done less secretively than other kinds of surveillance. If there’s a camera that can see us, often we can see it too.
*Recall that these comments are US-specific. Data retention legislation has been proposed or passed in multiple countries to require recording of, among other things, all URL requests, with the stated goal of fighting either digital piracy or child pornography.
As for foreign data: Read more
|Categories: Hadoop, HP and Neoview, Petabyte-scale data management, Pricing, Surveillance and privacy, Telecommunications, Text, Vertica Systems, Web analytics||10 Comments|
2. Numerous vendors are blending SQL and JSON management in their short-request DBMS. It will take some more work for me to have a strong opinion about the merits/demerits of various alternatives.
The default implementation — one example would be Clustrix’s — is to stick the JSON into something like a BLOB/CLOB field (Binary/Character Large Object), index on individual values, and treat those indexes just like any others for the purpose of SQL statements. Drawbacks include:
- You have to store or retrieve the JSON in whole documents at a time.
- If you are spectacularly careless, you could write JOINs with odd results.
IBM DB2 is one recent arrival to the JSON party. Unfortunately, I forgot to ask whether IBM’s JSON implementation was based on IBM DB2 pureXML when I had the chance, and IBM hasn’t gotten around to answering my followup query.
3. Nor has IBM gotten around to answering my followup queries on the subject of BLU, an interesting-sounding columnar option for DB2.
4. Numerous clients have asked me whether they should be active in DBaaS (DataBase as a Service). After all, Amazon, Google, Microsoft, Rackspace and salesforce.com are all in that business in some form, and other big companies have dipped toes in as well. Read more
I’m usually annoyed by lists of year-end predictions. Still, a reporter asked me for some, and I found one kind I was comfortable making.
Trends that I think will continue in 2013 include:
Growing attention to machine-generated data. Human-generated data grows at the rate business activity does, plus 0-25%. Machine-generated data grows at the rate of Moore’s Law, also plus 0-25%, which is a much higher total. In particular, the use of remote machine-generated data is becoming increasingly real.
Hadoop adoption. Everybody has the big bit bucket use case, largely because of machine-generated data. Even today’s technology is plenty good enough for that purpose, and hence justifies initial Hadoop adoption. Development of further Hadoop technology, which I post about frequently, is rapid. And so the Hadoop trend is very real.
Application SaaS. The on-premises application software industry has hopeless problems with product complexity and rigidity. Any suite new enough to cut the Gordian Knot is or will be SaaS (Software as a Service).
Newer BI interfaces. Advanced visualization — e.g. Tableau or QlikView — and mobile BI are both hot. So, more speculatively, are “social” BI (Business Intelligence) interfaces.
Price discounts. If you buy software at 50% of list price, you’re probably doing it wrong. Even 25% can be too high.
MySQL alternatives. NoSQL and NewSQL products often are developed as MySQL alternatives. Oracle has actually done a good job on MySQL technology, but now its business practices are scaring companies away from MySQL commitments, and newer short-request SQL DBMS are ready for use.
|Categories: Business intelligence, Hadoop, MySQL, NewSQL, NoSQL, Open source, Oracle, Pricing, Software as a Service (SaaS), Surveillance and privacy||3 Comments|