Discussion of public policy around technological issues, especially but not only surveillance and privacy.
A couple of points that arise frequently in conversation, but that I don’t seem to have made clearly online.
“Metadata” is generally defined as “data about data”. That’s basically correct, but it’s easy to forget how many different kinds of metadata there are. My list of metadata kinds starts with:
- Data about data structure. This is the classical sense of the term. But please note:
- In a relational database, structural metadata is rather separate from the data itself.
- In a document database, each document might carry structure information with it.
- Other inputs to core data management functions. Two major examples are:
- Column statistics that inform RDBMS optimizers.
- Value ranges that inform partition pruning or, more generally, data skipping.
- Inputs to ancillary data management functions — for example, security privileges.
- Support for human decisions about data — for example, information about authorship or lineage.
What’s worse, the past year’s most famous example of “metadata”, telephone call metadata, is misnamed. This so-called metadata, much loved by the NSA (National Security Agency), is just data, e.g. in the format of a CDR (Call Detail Record). Calling it metadata implies that it describes other data — the actual contents of the phone calls — that the NSA strenuously asserts don’t actually exist.
And finally, the first bullet point above has a counter-intuitive consequence — all common terminology notwithstanding, relational data is less structured than document data. Reasons include:
- Relational databases usually just hold strings — or maybe numbers — with structural information being held elsewhere.
- Some document databases store structural metadata right with the document data itself.
- Some document databases store data in the form of (name, value) pairs. In some cases additional structure is imposed by naming conventions.
- Actual text documents carry the structure imposed by grammar and syntax.
- A lengthy survey of metadata kinds, biased to Hadoop (August, 2012)
- Metadata as derived data (May, 2011)
- Dataset management (May, 2013)
- Structured/unstructured … multi-structured/poly-structured (May, 2011)
|Categories: Data models and architecture, Hadoop, Structured documents, Surveillance and privacy, Telecommunications||4 Comments|
1. Censorship worries me, a lot. A classic example is Vietnam, which basically has outlawed online political discussion.
And such laws can have teeth. It’s hard to conceal your internet usage from an inquisitive government.
2. Software and software related patents are back in the news. Google, which said it was paying $5.5 billion or so for a bunch of Motorola patents, turns out to really have paid $7 billion or more. Twitter and IBM did a patent deal as well. Big numbers, and good for certain shareholders. But this all benefits the wider world — how?
The purpose of legal intellectual property protections, simply put, is to help make it a good decision to create something. …
Why does “securing … exclusive Right[s]” to the creators of things that are patented, copyrighted, or trademarked help make it a good decision for them to create stuff? Because it averts competition from copiers, thus making the creator a monopolist in what s/he has created, allowing her to at least somewhat value-price her creation.
I.e., the core point of intellectual property rights is to prevent copying-based competition. By way of contrast, any other kind of intellectual property “right” should be viewed with great suspicion.
That Constitutionally-based principle makes as much sense to me now as it did then. By way of contrast, “Let’s give more intellectual property rights to big corporations to protect middle-managers’ jobs” is — well, it’s an argument I view with great suspicion.
But I find it extremely hard to think of a technology industry example in which development was stimulated by the possibility of patent protection. Yes, the situation may be different in pharmaceuticals, or for gadgeteering home inventors, but I can think of no case in which technology has been better, or faster to come to market, because of the possibility of a patent-law monopoly. So if software and business-method patents were abolished entirely – even the ones that I think could be realistically adjudicated — I’d be pleased.
3. In November, 2008 I offered IT policy suggestions for the incoming Obama Administration, especially: Read more
|Categories: Buying processes, Google, IBM and DB2, Public policy, Surveillance and privacy||1 Comment|
In response to the uproar created by the Edward Snowden revelations, the White House commissioned five dignitaries to produce a 300-page report, released last December 12. (Official name: Report and Recommendations of The President’s Review Group on Intelligence and Communications Technologies.) I read or skimmed a large minority of it, and I found enough substance to be worthy of a blog post.
Many of the report’s details fall in the buckets of bureaucratic administrivia,* internal information security, or general pabulum. But the commission started with four general principles that I think have great merit. Read more
Thanks to a court decision that overturned some existing regulations, network neutrality is back in the news. Most people think the key issue is whether
- Telecommunication companies (e.g. wireless and/or broadband services providers) should be allowed to charge …
- … other internet companies (website owners, game companies, streaming media providers, etc., collectively known as edge providers) for …
- … shipping data to internet service consumers in particularly attractive ways.
But I think some forms of charging can be OK — albeit not the ones currently being discussed — and so the question should instead be how the charges are designed.
When I wrote about network neutrality in 2006-7, the issue was mainly whether broadband providers would be allowed to ship different kinds of data at different speeds or reliability. Now the big controversy is whether mobile data providers should be allowed to accept “sponsorship” so as to have certain kinds of data not count against mobile data plan volume caps. Either way:
- The “anything goes” strategy has obvious free-market appeal.
- But proponents of network neutrality regulation — such as Fred Wilson and Nilay Patel — point out a major risk: By striking deals that smaller companies can’t imitate, large, established “edge provider” services may strangle upstart competitors in their cribs.
I think the anti-discrimination argument for network neutrality has much merit. But I also think there are some kinds of payment structure that could leave the playing field fairly level. Imagine, if you will, that: Read more
I think that most sufficiently large enterprise SaaS vendors should offer an appliance option, as an alternative to the core multi-tenant service. In particular:
- SaaS appliances address customer fears about security, privacy, compliance, performance isolation, and lock-in.
- Some of these benefits occur even if the appliance runs in the same data centers that host the vendor’s standard multi-tenant SaaS. Most of the rest occur if the customer can choose a co-location facility in which to place the appliance.
- Whether many customers should or will use the SaaS appliance option is somewhat secondary; it’s a check-mark item. I.e., many customers and prospects will be pleased that the option at least exists.
How I reached them
Core reasons for selling or using SaaS (Software as a Service) as opposed to licensed software start:
- The SaaS vendor handles all software upgrades, and makes them promptly. In principle, this benefit could also be achieved on a dedicated system on customer premises (or at the customer’s choice of co-location facility).
- In addition, the SaaS vendor handles all the platform and operational stuff — hardware, operating system, computer room, etc. This benefit is antithetical to direct customer control.
- The SaaS vendor only has to develop for and operate on a tightly restricted platform stack that it knows very well. This benefit is also enjoyed in the case of customer-premises appliances.
Conceptually, then, customer-premises SaaS is not impossible, even though one of the standard Big Three SaaS benefits is lost. Indeed:
- Microsoft Windows and many other client software packages already offer to let their updates be automagically handled by the vendor.
- In that vein, consumer devices such as game consoles already are a kind of SaaS appliance.
- Complex devices of any kind, including computers, will see ever more in the way of “phone-home” features or optional services, often including routine maintenance and upgrades.
But from an enterprise standpoint, that’s all (relatively) simple stuff. So we’re left with a more challenging question — does customer-premises SaaS make sense in the case of enterprise applications or other server software?
|Categories: Data warehouse appliances, HP and Neoview, salesforce.com, Software as a Service (SaaS), Surveillance and privacy||5 Comments|
I’ve posted a lot about surveillance and privacy intrusion. Even so, I have a few more things to say.
1. Surveillance and privacy intrusion do, of course, have real benefits. That’s a big part of why I advocate a nuanced approach to privacy regulation. Several of those benefits are mentioned below.
2. Nobody’s opinion about privacy rules should be based on the exact state of surveillance today, for at least two reasons:
- The disclosures keep coming.
- Technology keeps changing.
In particular, people may not realize how comprehensive surveillance will get, due largely to the “internet of things”. The most profound reason — and this will take decades to fully play out — is that we’re headed toward a medical revolution in which our vital signs are more or less continually monitored as they go about their business. Such monitoring will, of course, provide a very detailed record of people’s activities and perhaps even states of mind. Further, vehicle movements will all be tracked and our mobile devices will keep noting our location, in each case for multiple reasons.
- Stores CDRs (Call Detail Records), many or all of which are collected via …
- … some kind of back door into the AT&T switches that many carriers use. (See Slide 2.)
- Has also included “subscriber information” for AT&T phones since July, 2012.
- Contains “long distance and international” CDRs back to 1987.
- Currently adds 4 billion CDRs per day.
- Is administered by a Federal drug-related law enforcement agency but …
- … is used to combat many non-drug-related crimes as well. (See Slides 21-26.)
Other notes include:
- The agencies specifically mentioned on Slide 16 as making numerous Hemisphere requests are the DEA (Drug Enforcement Agency) and DHS (Department of Homeland Security).
- “Roaming” data giving city/state is mentioned in the deck, but more precise geo-targeting is not.
I’ve never gotten a single consistent figure, but typical CDR size seems to be in the 100s of bytes range. So I conjecture that Project Hemisphere spawned one of the first petabyte-scale databases ever.
Hemisphere Project unknowns start: Read more
|Categories: Data warehousing, GIS and geospatial, Petabyte-scale data management, Specific users, Surveillance and privacy, Telecommunications||Leave a Comment|
For years I’ve argued three points about privacy intrusions and surveillance:
- Privacy intrusions are a huge threat to liberty. Since the Snowden revelations started last June, this view has become more widely accepted.
- Much of the problem is the very chilling effects they can have upon the exercise of day-to-day freedoms. Fortunately, I’m not as alone in saying that as I once feared. For example, Christopher Slobogin made that point in a recent CNN article, and then pointed me to a paper* citing other people echoing it, including Sonia Sotomayor.
- Liberty can’t be effectively protected just by controls on the collection, storage, or dissemination of data; direct controls are needed on the use of data as well. Use-based data controls are much more robust in the face of technological uncertainty and change than possession-based ones are.
Since that last point is still very much a minority viewpoint,** I’ll argue it one more time below. Read more
I made a remarkably rumpled video appearance yesterday with SiliconAngle honchos John Furrier and Dave Vellante. (Excuses include <3 hours sleep, and then a scrambling reaction to a schedule change.) Topics covered included, with approximate timechecks:
- 0:00 Introductory pabulum, and some technical difficulties
- 2:00 More introduction
- 3:00 Dynamic schemas and data model churn
- 6:00 Surveillance and privacy
- 13:00 Hadoop, especially the distro wars
- 22:00 BI innovation
- 23:30 More on dynamic schemas and data model churn
Edit: Some of my remarks were transcribed.
- I posted on dynamic schemas data model churn a few days ago.
- I capped off a series on privacy and surveillance a few days ago.
- I commented on various Hadoop distributions in June.
|Categories: Business intelligence, ClearStory Data, Data warehousing, Hadoop, MapR, MapReduce, Surveillance and privacy||Leave a Comment|
I’ve been harping on the grave dangers of surveillance and privacy intrusion. Clearly, something must be done to rein them in. But what?
Well, let’s look at an older and better-understood subject — governmental use of force. Governments, by their very nature, possess tools for tyranny: armies, police forces, and so on. So how do we avoid tyranny? We limit what government is allowed to do with those tools, and we teach our citizens — especially those who serve in government — to obey and enforce the limits.
Those limits can be lumped into two categories:
- Direct — there are very strong controls as to when and how the government may use force.
- Indirect — there are also controls on how the government can even threaten the use of force. I.e., substantially all laws are ultimately backed up by the threat of governmental force — and there are limits as to which laws may or may not be enacted.
The story is similar for surveillance technology:
- As data gathering and analysis technologies skyrocket in power, they become ever more powerful tools for tyranny.
- Direct controls are called for — there is some surveillance the government is and should not be allowed to do.
- Indirect controls are also necessary — even when it has information, there are ways in which the government should not be allowed to use it.
But there’s a big difference between the cases of physical force and surveillance.
- The direct controls on the use of force are strong; under ordinary circumstances, government is NOT allowed to just go out and shoot somebody.
- The direct controls on surveillance, however, are very weak; government has access to all kinds of information.