Data model churn
Perhaps we should remind ourselves of the many ways data models can be caused to churn. Here are some examples that are top-of-mind for me. They do overlap a lot — and the whole discussion overlaps with my post about schema complexity last January, and more generally with what I’ve written about dynamic schemas for the past several years..
Just to confuse things further — some of these examples show the importance of RDBMS, while others highlight the relational model’s limitations.
The old standbys
Product and service changes. Simple changes to your product line many not require any changes to the databases recording their production and sale. More complex product changes, however, probably will.
A big help in MCI’s rise in the 1980s was its new Friends and Family service offering. AT&T couldn’t respond quickly, because it couldn’t get the programming done, where by “programming” I mainly mean database integration and design. If all that was before your time, this link seems like a fairly contemporaneous case study.
Organizational changes. A common source of hassle, especially around databases that support business intelligence or planning/budgeting, is organizational change. Kalido’s whole business was based on accommodating that, last I checked, as were a lot of BI consultants’. Read more
Categories: Data warehousing, Derived data, Kalido, Log analysis, Software as a Service (SaaS), Specific users, Text, Web analytics | 3 Comments |
“Disruption” in the software industry
I lampoon the word “disruptive” for being badly overused. On the other hand, I often refer to the concept myself. Perhaps I should clarify. 🙂
You probably know that the modern concept of disruption comes from Clayton Christensen, specifically in The Innovator’s Dilemma and its sequel, The Innovator’s Solution. The basic ideas are:
- Market leaders serve high-end customers with complex, high-end products and services, often distributed through a costly sales channel.
- Upstarts serve a different market segment, often cheaply and/or simply, perhaps with a different business model (e.g. a different sales channel).
- Upstarts expand their offerings, and eventually attack the leaders in their core markets.
In response (this is the Innovator’s Solution part):
- Leaders expand their product lines, increasing the value of their offerings in their core markets.
- In particular, leaders expand into adjacent market segments, capturing margins and value even if their historical core businesses are commoditized.
- Leaders may also diversify into direct competition with the upstarts, but that generally works only if it’s via a separate division, perhaps acquired, that has permission to compete hard with the main business.
But not all cleverness is “disruption”.
- Routine product advancement by leaders — even when it’s admirably clever — is “sustaining” innovation, as opposed to the disruptive stuff.
- Innovative new technology from small companies is not, in itself, disruption either.
Here are some of the examples that make me think of the whole subject. Read more
What our legislators should do about privacy (and aren’t)
I’ve been harping on the grave dangers of surveillance and privacy intrusion. Clearly, something must be done to rein them in. But what?
Well, let’s look at an older and better-understood subject — governmental use of force. Governments, by their very nature, possess tools for tyranny: armies, police forces, and so on. So how do we avoid tyranny? We limit what government is allowed to do with those tools, and we teach our citizens — especially those who serve in government — to obey and enforce the limits.
Those limits can be lumped into two categories:
- Direct — there are very strong controls as to when and how the government may use force.
- Indirect — there are also controls on how the government can even threaten the use of force. I.e., substantially all laws are ultimately backed up by the threat of governmental force — and there are limits as to which laws may or may not be enacted.
The story is similar for surveillance technology:
- As data gathering and analysis technologies skyrocket in power, they become ever more powerful tools for tyranny.
- Direct controls are called for — there is some surveillance the government is and should not be allowed to do.
- Indirect controls are also necessary — even when it has information, there are ways in which the government should not be allowed to use it.
But there’s a big difference between the cases of physical force and surveillance.
- The direct controls on the use of force are strong; under ordinary circumstances, government is NOT allowed to just go out and shoot somebody.
- The direct controls on surveillance, however, are very weak; government has access to all kinds of information.
Categories: Surveillance and privacy | 5 Comments |
Very chilling effects
I’ve worried for years about a terrible and under-appreciated danger of privacy intrusion, which in a recent post I characterized as a chilling effect upon the exercise of ordinary freedoms. When government — or an organization such as your employer, your insurer, etc. — watches you closely, it can be dangerous to deviate from the norm. Even the slightest non-conformity could have serious consequences. I wish that were an exaggeration; let’s explore why it isn’t.
Possible difficulties — most of them a little bit futuristic — include:
- Being perceived as a potential terrorist or terrorist sympathizer. That’s a biggie, of course, at least in “free” countries. Even getting on the No-Fly List is enough to pretty much shut down your travel, and hence your options in many careers. If you want to avoid such problems, it might be prudent not to:
- Visit certain websites.
- Email, telephone, or otherwise communicate with certain people.
- Use certain words or phrases in email or on the telephone.
- Being regarded as too vehement a political dissenter in general. Political dissent is deadly dangerous in too many countries around the world, and has costs even in “free” countries. (Jacob Appelbaum is one recent US example.) To avoid such problems, there are a whole lot of things you might think twice about writing, saying, or doing, and certain people it’s definitely risky to associate with or write nice things about.
- Not being regarded as a probable loyal, hard-working, accepting employee. Think about the difficulties “over-qualified” candidates have getting hired. Then consider what might happen if employers had (accurate or otherwise) psychographic profiles estimating who was most likely to stay at a job, to accept boring job tasks or long hours, or to tolerate sub-standard pay. Then consider how wise it might be to show interest in, for example:
- Other careers.
- Certain hobbies that might be construed as leading to other careers.
- Living in other parts of the country.
- Being perceived as likely to engage in socially-unapproved sexual behavior. In the United States, certain sexual choices — even among consenting adults — could cause problems with discrimination, child custody, or divorce. Elsewhere, your choice of partner could lead to prison or even death. (I don’t know exactly which shopping choices could get one identified as a possible homosexual or philanderer … but just to be on the safe side, you might not want to download any Barbara Streisand songs. 🙂 )
- Being regarded as a poor health or safety risk for employment, insurance, or more. Do you like fatty foods? Extreme sports? Night clubs? Recreational drugs? Tobacco? More than a little alcohol? Fast cars? Fast women? Evidence of any of those tastes could move you up the risk charts for heart attack, accident, marital dissolution or some other outcome that an employer or insurer wouldn’t like.
Categories: Surveillance and privacy | 11 Comments |
Investigative analytics and untrusted code — a quick note
This is probably a good time to disclose that I own a chunk of founders’ stock — no, I didn’t pay cash for it — in LiteStack, the start-up sponsoring ZeroVM.
Jordan Novet posted a survey of Hadoop security, and evidently Merv Adrian is making a big deal about the subject as well. But there’s one point I rarely see mentioned which, come to think of it, could apply to relational analytic platforms as well.
A big use of Hadoop and analytic platforms alike is investigative analytics, and specifically experimentation via hastily-written code. But untrusted code can, at least in theory, compromise the security of the servers it runs on. And when you run the code on the same servers that manage the data, that could compromise the security of your database as well.
Frankly, in most use cases I doubt this is a big deal. Process isolation would probably avert most “accidental attacks”, and a deliberate attack might be hard to pull off in a reliable manner. As for database corruption, also a theoretical danger via the same vector — that danger is much smaller than the risk of bad code being submitted by well-intentioned doofuses.
Still, I’d like to see a forthright discussion of this threat.
Categories: Hadoop | 2 Comments |
The refactoring of everything
I’ll start with three observations:
- Computer systems can’t be entirely tightly coupled — nothing would ever get developed or tested.
- Computer systems can’t be entirely loosely coupled — nothing would ever get optimized, in performance and functionality alike.
- In an ongoing trend, there is and will be dramatic refactoring as to which connections wind up being loose or tight.
As written, that’s probably pretty obvious. Even so, it’s easy to forget just how pervasive the refactoring is and is likely to be. Let’s survey some examples first, and then speculate about consequences. Read more
More notes on predictive modeling
My July 2 comments on predictive modeling were far from my best work. Let’s try again.
1. Predictive analytics has two very different aspects.
Developing models, aka “modeling”:
- Is a big part of investigative analytics.
- May or may not be difficult to parallelize and/or integrate into an analytic RDBMS.
- May or may not require use of your whole database.
- Generally is done by humans.
- Often is done by people with special skills, e.g. “statisticians” or “data scientists”.
More precisely, some modeling algorithms are straightforward to parallelize and/or integrate into RDBMS, but many are not.
Using models, most commonly:
- Is done by machines …
- … that “score” data according to the models.
- May be done in batch or at run-time.
- Is embarrassingly parallel, and is much more commonly integrated into analytic RDBMS than modeling is.
2. Some people think that all a modeler needs are a few basic algorithms. (That’s why, for example, analytic RDBMS vendors are proud of integrating a few specific modeling routines.) Other people think that’s ridiculous. Depending on use case, either group can be right.
3. If adoption of DBMS-integrated modeling is high, I haven’t noticed.
Categories: Ayasdi, Data warehousing, Hadoop, Health care, IBM and DB2, KXEN, Predictive modeling and advanced analytics, SAS Institute | 6 Comments |
Privacy and data use — the problem of chilling effects
This is the second of a two-part series on the theory of information privacy. In the first post, I review the theory to date, and outline what I regard as a huge and crucial gap. In the second post, I try to fill that chasm.
The first post in this two-part series:
- Reviewed the privacy theory of the past 123 years.
- Declared it inadequate to address today’s surveillance and information privacy issues.
- Suggested a reason for its failure — the harms of privacy violation are too rarely spelled out in concrete terms, making it impractical to do even implicit cost-benefit analyses.
Actually, it’s easy to name specific harms from privacy loss. A list might start:
- Being investigated (rightly or wrongly) for a crime, with all the hassle and legal risk that ensues.
- Being discriminated against for employment, credit, or insurance.
- Being embarrassed publicly, or discriminated against socially.
- Being bullied or stalked by deplorable private-citizen acquaintances.
- Being put on the no-fly list.
I expect that few people in, say, the United States will suffer these harms in the near future, at least the more severe ones. However, the story gets worse, because we don’t know which disclosures will have which adverse effects. For example, Read more
Categories: Surveillance and privacy | 12 Comments |
Privacy and data use — a gap in the theory
This is the first of a two-part series on the theory of information privacy. In the first post, I review the theory to date, and outline what I regard as a huge and crucial gap. In the second post, I try to fill that chasm.
Discussion of information privacy has exploded, spurred by increasing awareness of data’s collection and use. Confusion reigns, however, for reasons such as:
- Data is often collected behind a veil of secrecy. That’s top-of-mind these days, in light of the Snowden/Greenwald revelations.
- Nobody understands all of the various technologies involved. Telecom experts don’t know a lot about data management and analysis, and vice-versa, while the political reporters don’t understand much about technology at all. I think numerous reporting errors have resulted.
- There’s no successful theory explaining when privacy should and shouldn’t be preserved. To put it quite colloquially:
- Big Brother is watching you …
- … and he’s scary.
- Privacy theory focuses on the “watching” part …
- … but the “scary” part is what really needs to be addressed.
Let’s address the last point. Read more
Categories: Surveillance and privacy | 4 Comments |
Notes and comments, July 2, 2013
I’m not having a productive week, part of the reason being a hard drive crash that took out early drafts of what were to be last weekend’s blog posts. Now I’m operating from a laptop, rather than my preferred dual-monitor set-up. So please pardon me if I’m concise even by comparison to my usual standards.
- My recent posts based on surveillance news have been partly superseded by – well, by more news. Some of that news, along with some good discussion, may be found in the comment threads.
- The same goes for my recent Hadoop posts.
- The replay for my recent webinar on real-time analytics is now available. My part ran <25 minutes.
- One of my numerous clients using or considering a “real-time analytics” positioning is Sqrrl, the company behind the NoSQL DBMS Accumulo. Last month, Derrick Harris reported on a remarkable Accumulo success story – multiple US intelligence instances managing 10s of petabytes each, and supporting a variety of analytic (I think mainly query/visualization) approaches.
- Several sources have told me that MemSQL’s Zynga sale is (in part) for Membase replacement. This is noteworthy because Zynga was the original pay-for-some-of-the-development Membase customer.
- More generally, the buzz out of Couchbase is distressing. Ex-employees berate the place; job-seekers check around and then decide not to go there; rivals tell me of resumes coming out in droves. Yes, there’s always some of that, even at obviously prospering companies, but this feels like more than the inevitable low-level buzz one hears anywhere.
- I think the predictive modeling state of the art has become:
- Cluster in some way.
- Model separately on each cluster.
- And if you still want to do something that looks like a regression – linear or otherwise – then you might want to use a tool that lets you shovel training data in WITHOUT a whole lot of preparation* and receive a model back out. Even if you don’t accept that as your final model, it can at least be a great guide to feature selection (in the statistical sense of the phrase) and the like.
- Champion/challenger model testing is also a good idea, at least if you’re in some kind of personalization/recommendation space, and have enough traffic to test like that.**
- Most companies have significant turnover after being acquired, perhaps after a “golden handcuff” period. Vertica is no longer an exception.
- Speaking of my clients at HP Vertica – they’ve done a questionable job of communicating that they’re willing to price their product quite reasonably. (But at least they allowed me to write about $2K/terabyte for hardware/software combined.)
- I’m hearing a little more Amazon Redshift buzz than I expected to. Just a little.
- StreamBase was bought by TIBCO. The rumor says $40 million.
*Basic and unavoidable ETL (Extract/Transform/Load) of course excepted.
**I could call that ABC (Always Be Comparing) or ABT (Always Be Testing), but they each sound like – well, like The Glove and the Lions.