Discussion of Facebook’s data management technologies. Related subjects include:
- Cassandra, which was originated at Facebook
- Hadoop, one of whose largest supporters is Facebook
- Google’s data management technologies
- Amazon’s data management technologies
More on NoSQL and HVSP (or OLRP)
Since posting last Wednesday morning that I’m looking into NoSQL and HVSP, I’ve had a lot of conversations, including with (among others):
- Dwight Merriman of 10gen (MongoDB)
- Damien Katz of Couchio (CouchDB)
- Matt Pfeil of Riptano (Cassandra)
- Todd Lipcon of Cloudera (HBase committer)
- Tony Falco of Basho (Riak)
- John Busch of Schooner
- Ori Herrnstadt of Akiban
| Categories: Akiban, Basho and Riak, Cache, Cassandra, Cloudera, Clustrix, CouchDB, Facebook, HBase, Hadoop, MySQL, NoSQL, OLTP, Object, Open source, Parallelization, Riptano, Schooner, Theory and architecture, Tokutek, memcached | Leave a Comment |
I’m collecting data points on NoSQL and HVSP adoption
I was asked to do a magazine article on NoSQL, where by “NoSQL” is meant “whatever they talk about at NoSQL conferences.” By now the number of publications planning to run the article is up to 2, the deadline is next week and, crucially, it has been agreed that I may talk about HVSP in general, NoSQL and SQL alike.
It also is understood that, realistically, I can’t be expected to know and mention the very latest news for all the many products in the categories. Even so, I think this would be fine time to check just where NoSQL and HVSP adoption stand. Here is most of what I know, or links to same; it would be great if you guys would contribute additional data in the comment thread.
In the NoSQL area: Read more
Links and observations
I’m back from a trip to the SF Bay area, with a lot of writing ahead of me. I’ll dive in with some quick comments here, then write at greater length about some of these points when I can. From my trip: Read more
Nested data structures keep coming up, especially for log files
Nested data structures have come up several times now, almost always in the context of log files.
- Google has published about a project called Dremel. Per Tasso Agyros, one of Dremel’s key concepts is nested data structures.
- Those arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of course log-oriented. eBay was very interested in that project too.
- Facebook’s log files have a big nested data structure flavor.
I don’t have a grasp yet on what exactly is happening here, but it’s something.
| Categories: Facebook, Google, Log analysis, Scientific research, Theory and architecture, eBay | 5 Comments |
dbShards — a lot like an MPP OLTP DBMS based on MySQL or PostgreSQL
I talked yesterday w/ Cory Isaacson, who runs CodeFutures, makers of dbShards. dbShards is a software layer that turns an ordinary DBMS (currently MySQL or PostgreSQL) into an MPP shared-nothing ACID-compliant OLTP DBMS. Technical highlights included: Read more
| Categories: Facebook, MySQL, OLTP, Parallelization, PostgreSQL, dbShards and CodeFutures, dbShards and CodeFutures | 3 Comments |
Riptano, and Cassandra adoption
Tonight’s Cassandra technology post got plenty long enough on its own, so I’m separating out business and adoption issues here. For starters, known Cassandra users include:
- Facebook, which has said it has 150 or so Cassandra nodes (but see below)
- Twitter, which has said it has 45 or so Cassandra nodes
- Rackspace, which used to be Jonathan Ellis’ employer, and now is backing Cassandra company Riptano
- Digg, which along with Twitter and Rackspace was one of the three major users helping advance the Cassandra project
- OpenX, Simple Geo, Digital Reasoning, who Jonathan cited as production users in March
- Cloudkick, as noted and linked in my other post
- Two customers Riptano named at launch (but I’ve forgotten who they were*)
Fetlife, Meebo, and others seem to at least have a healthy interest in Cassandra, based on their level of involvement in a forthcoming Cassandra Summit. That said, the @Fetlife tweetstream features numerous yelps of pain, and I don’t mean the recreational kind. Read more
| Categories: Cassandra, Facebook, Market share, NoSQL, Open source, Parallelization, Pricing, Riptano, Specific users | 3 Comments |
Cassandra technical overview
Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I’m finally finding time to clear my Cassandra/Riptano backlog. I’ll cover the more technical parts below, and the more business- or usage-oriented ones in a companion Cassandra/Riptano post.
Jonathan’s core claims for Cassandra include:
- Cassandra is shared-nothing.
- Cassandra has good approaches to replication and partitioning, right out of the box.
- In particular, Cassandra is good for use cases that distribute a database around the world and want to access it at “local” latencies. (Indeed, Jonathan asserts that non-local replication is a significant non-big-data Cassandra use case.)
- Cassandra’s scale-out is application-transparent, unlike sharded MySQL’s.
- Cassandra is fast at both appends and range queries, which would be hard to accomplish in a pure key-value store.
In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he’s concerned, may well belong in a more traditional SQL DBMS. Read more
| Categories: Amazon and its cloud, Cassandra, Facebook, Google, Log analysis, NoSQL, Open source, Parallelization, Riptano | 4 Comments |
The most important part of the “social graph” is neither social nor a graph
“Social graph” is a highly misleading term, and so is “social network analysis.” By this I mean:
There’s something akin to “social graphs” and “social network analysis” that is more or less worthy of all the current hype – but graphs and network analysis are only a minor part of the whole story.
In particular, the most important parts of the Facebook “social graph” are neither social nor a graph. Rather, what’s really important is an aggregate Profile of Revealed Preferences, of which person-to-person connections or other things best modeled by a graph play only a small part.
| Categories: Analytic technologies, Facebook, Games and virtual worlds, Liberty and privacy, RDF and graphs, Web analytics | 7 Comments |
Information found in public-facing social networks
Here are some examples illustrating two recent themes of mine, namely:
- Easily-available information reveals all sorts of things about us.
- Graph-based analysis is on the rise.
Pete Warden scraped all of Facebook’s social graph (at least for the United States), and put up a really interesting-looking visualization of same. Facebook’s lawyer’s came down on him, and he quickly agreed to destroy the data he’d scraped, but also published ideas on how other people could duplicate his work.
Warden has since given an interview in which he outlines some of the things researchers hoped to do with this data: Read more
| Categories: Analytic technologies, Facebook, Liberty and privacy, RDF and graphs | 1 Comment |
Issues in scientific data management
In the opinion of the leaders of the XLDB and SciDB efforts, key requirements for scientific data management include:
- A data model based on multidimensional arrays, not sets of tuples
- A storage model based on versions and not update in place
- Built-in support for provenance (lineage), workflows, and uncertainty
- Scalability to 100s of petabytes and 1,000s of nodes with high degrees of tolerance to failures
- Support for “external” data objects so that data sets can be queried and manipulated without ever having to be loaded into the database
- Open source in order to foster a community of contributors and to insure that data is never “locked up” — a critical requirement for scientists
However: Read more
