Discussion of Facebook’s data management technologies. Related subjects include:
This post is the second of a series. The first one was an overview of privacy dangers, replete with specific examples of kinds of data that are stored for good reasons, but can also be repurposed for more questionable uses. More on this subject may be found in my August, 2010 post Big Data is Watching You!
There are two technology trends driving electronic privacy threats. Taken together, these trends raise scenarios such as the following:
- Your web surfing behavior indicates you’re a sports car buff, and you further like to look at pictures of scantily-clad young women. A number of your Facebook friends are single women. As a result, you’re deemed a risk to have a mid-life crisis and divorce your wife, thus increasing the interest rate you have to pay when refinancing your house.
- Your cell phone GPS indicates that you drive everywhere, instead of walking. There is no evidence of you pursuing fitness activities, but forum posting activity suggests you’re highly interested in several TV series. Your credit card bills show that your taste in restaurant food tends to the fatty. Your online photos make you look fairly obese, and a couple have ashtrays in them. As a result, you’re judged a high risk of heart attack, and your medical insurance rates are jacked up accordingly.
- You did actually have that mid-life crisis and get divorced. At the child-custody hearing, your ex-spouse’s lawyer quotes a study showing that football-loving upper income Republicans are 27% more likely to beat their children than yoga-class-attending moderate Democrats, and the probability goes up another 8% if they ever bought a jersey featuring a defensive lineman. What’s more, several of the more influential people in your network of friends also fit angry-male patterns, taking the probability of abuse up another 13%. Because of the sound statistics behind such analyses, the judge listens.
Not all these stories are quite possible today, but they aren’t far off either.
|Categories: Facebook, Predictive modeling and advanced analytics, Surveillance and privacy, Telecommunications, Web analytics||4 Comments|
This post is the first of a series. The second one delves into the technology behind the most serious electronic privacy threats.
The privacy discussion has gotten more active, and more complicated as well. A year ago, I still struggled to get people to pay attention to privacy concerns at all, at least in the United States, with my first public breakthrough coming at the end of January. But much has changed since then.
On the commercial side, Facebook modified its privacy policies, garnering great press attention and an intense user backlash, leading to a quick partial retreat. The Wall Street Journal then launched a long series of articles — 13 so far — recounting multiple kinds of privacy threats. Other media joined in, from Forbes to CNet. Various forms of US government rule-making to inhibit advertising-related tracking have been proposed as an apparent result.
In the US, the government had a lively year as well. The Transportation Security Administration (TSA) rolled out what have been dubbed “porn scanners,” and backed them up with “enhanced patdowns.” For somebody who is, for example, female, young, a sex abuse survivor, and/or a follower of certain religions, those can be highly unpleasant, if not traumatic. Meanwhile, the Wikileaks/Cablegate events have spawned a government reaction whose scope is only beginning to be seen. A couple of “highlights” so far are some very nasty laptop seizures, and the recent demand for information on over 600,000 Twitter accounts. (Christopher Soghoian provided a detailed, nuanced legal analysis of same.)
At this point, it’s fair to say there are at least six different kinds of legitimate privacy fear. Read more
|Categories: Analytic technologies, Facebook, GIS and geospatial, Health care, Surveillance and privacy, Telecommunications, Web analytics||6 Comments|
Since posting last Wednesday morning that I’m looking into NoSQL and HVSP, I’ve had a lot of conversations, including with (among others):
- Dwight Merriman of 10gen (MongoDB)
- Damien Katz of Couchio (CouchDB)
- Matt Pfeil of Riptano (Cassandra)
- Todd Lipcon of Cloudera (HBase committer)
- Tony Falco of Basho (Riak)
- John Busch of Schooner
- Ori Herrnstadt of Akiban
I was asked to do a magazine article on NoSQL, where by “NoSQL” is meant “whatever they talk about at NoSQL conferences.” By now the number of publications planning to run the article is up to 2, the deadline is next week and, crucially, it has been agreed that I may talk about HVSP in general, NoSQL and SQL alike.
It also is understood that, realistically, I can’t be expected to know and mention the very latest news for all the many products in the categories. Even so, I think this would be fine time to check just where NoSQL and HVSP adoption stand. Here is most of what I know, or links to same; it would be great if you guys would contribute additional data in the comment thread.
In the NoSQL area: Read more
|Categories: Akiban, Cassandra, Clustering, Clustrix, Couchbase, dbShards and CodeFutures, Facebook, Groovy Corporation, NewSQL, NoSQL, OLTP, Parallelization, ScaleDB, Specific users, VoltDB and H-Store, Zynga||17 Comments|
I’m back from a trip to the SF Bay area, with a lot of writing ahead of me. I’ll dive in with some quick comments here, then write at greater length about some of these points when I can. From my trip: Read more
|Categories: Analytic technologies, Aster Data, Calpont, Cassandra, Couchbase, Data warehouse appliances, Data warehousing, EMC, Exadata, Facebook, Greenplum, HP and Neoview, Kickfire, NoSQL, OLTP, ParAccel, Sybase, XtremeData||1 Comment|
Nested data structures have come up several times now, almost always in the context of log files.
- Google has published about a project called Dremel. Per Tasso Agyros, one of Dremel’s key concepts is nested data structures.
- Those arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of course log-oriented. eBay was very interested in that project too.
- Facebook’s log files have a big nested data structure flavor.
I don’t have a grasp yet on what exactly is happening here, but it’s something.
|Categories: eBay, Facebook, Google, Log analysis, Scientific research, Theory and architecture||7 Comments|
I talked yesterday w/ Cory Isaacson, who runs CodeFutures, makers of dbShards. dbShards is a software layer that turns an ordinary DBMS (currently MySQL or PostgreSQL) into an MPP shared-nothing ACID-compliant OLTP DBMS. Technical highlights included: Read more
Tonight’s Cassandra technology post got plenty long enough on its own, so I’m separating out business and adoption issues here. For starters, known Cassandra users include:
- Facebook, which has said it has 150 or so Cassandra nodes (but see below)
- Twitter, which has said it has 45 or so Cassandra nodes
- Rackspace, which used to be Jonathan Ellis’ employer, and now is backing Cassandra company Riptano
- Digg, which along with Twitter and Rackspace was one of the three major users helping advance the Cassandra project
- OpenX, Simple Geo, Digital Reasoning, who Jonathan cited as production users in March
- Cloudkick, as noted and linked in my other post
- Two customers Riptano named at launch (but I’ve forgotten who they were*)
Fetlife, Meebo, and others seem to at least have a healthy interest in Cassandra, based on their level of involvement in a forthcoming Cassandra Summit. That said, the @Fetlife tweetstream features numerous yelps of pain, and I don’t mean the recreational kind. Read more
|Categories: Cassandra, DataStax, Facebook, Market share and customer counts, NoSQL, Open source, Parallelization, Pricing, Specific users||5 Comments|
Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I’m finally finding time to clear my Cassandra/Riptano backlog. I’ll cover the more technical parts below, and the more business- or usage-oriented ones in a companion Cassandra/Riptano post.
Jonathan’s core claims for Cassandra include:
- Cassandra is shared-nothing.
- Cassandra has good approaches to replication and partitioning, right out of the box.
- In particular, Cassandra is good for use cases that distribute a database around the world and want to access it at “local” latencies. (Indeed, Jonathan asserts that non-local replication is a significant non-big-data Cassandra use case.)
- Cassandra’s scale-out is application-transparent, unlike sharded MySQL’s.
- Cassandra is fast at both appends and range queries, which would be hard to accomplish in a pure key-value store.
In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he’s concerned, may well belong in a more traditional SQL DBMS. Read more
|Categories: Amazon and its cloud, Cassandra, DataStax, Facebook, Google, Log analysis, NoSQL, Open source, Parallelization||4 Comments|
“Social graph” is a highly misleading term, and so is “social network analysis.” By this I mean:
There’s something akin to “social graphs” and “social network analysis” that is more or less worthy of all the current hype – but graphs and network analysis are only a minor part of the whole story.
In particular, the most important parts of the Facebook “social graph” are neither social nor a graph. Rather, what’s really important is an aggregate Profile of Revealed Preferences, of which person-to-person connections or other things best modeled by a graph play only a small part.
|Categories: Analytic technologies, Facebook, Games and virtual worlds, RDF and graphs, Surveillance and privacy, Web analytics||13 Comments|