Mike Driscoll and his Metamarkets colleagues organized a bit of a bash Thursday night. Among the many folks I chatted with were Ken Rudin of Zynga, Sam Shah of LinkedIn, and D. J. Patil, late of LinkedIn. I now know more about analytic data management at Zynga and LinkedIn, plus some bonus stuff on LinkedIn’s People You May Know application.
It’s blindingly obvious that Zynga is one of Vertica’s petabyte-scale customers, given that Zynga sends 5 TB/day of data into Vertica, and keeps that data for about a year. (Zynga may retain even more data going forward; in particular, Zynga regrets ever having thrown out the first month of data for any game it’s tried to launch.) This is game actions, for the most part, rather than log files; true logs generally go into Splunk.
I don’t know whether the missing data is completely thrown away, or just stashed on inaccessible tapes somewhere.
I found two aspects of the Zynga story particularly interesting. First, those 5 TB/day are going straight into Vertica (from, I presume, memcached/Membase/Couchbase), as Zynga decided that sending the data to some kind of log first was more trouble than it’s worth. Second, there’s Zynga’s approach to analytic database design. Highlights of that include:
- Data is divided into two parts. One part has a pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) About half the data is in each part, but I don’t think that’s by deliberate choice.
- Zynga adds data into the real schema when it’s clear it will be needed for a while. This isn’t a matter of query volumes, for the most part; rather, it’s when Zynga’s tests (e.g. of new games?) have determined that the data will keep being collected and used for a while.
- Zynga only adds columns to its analytic database; it never goes through the more complex process of deleting them.
Just as Zynga is one of Vertica’s flagship accounts, LinkedIn is one of Aster Data’s. Specifically, before leaving LinkedIn for Aster, Jonathan Goldman built LinkedIn’s People You May Know feature in Aster nCluster. This was long ago, and I’m not sure how sophisticated his use of SQL and MapReduce would be in today’s terms; for example, I was told he didn’t use “nPath or anything like that.” (Edit: See the comments below for clarifications from Jonathan.) Anyhow, LinkedIn has replaced Aster for PYMK with Hadoop, and in my opinion is getting much better results.
That, from an Aster standpoint, is the bad news. The good news is that LinkedIn is happily using Aster nCluster for several other applications; LinkedIn folks doesn’t seem to regret throwing out* Greenplum for Aster; and they also seem to have a very high opinion of Jonathan and his work while he was there.
*And this time that is indeed the phrase that was used.
One thing that astonished me is that LinkedIn PYMK is based only on data innate to LinkedIn (as opposed to imported email addresses, the results of web crawls, and so on). Given that, I am at a loss to explain how it suggested a couple of old friends, to whom I have no discernable chain of connection. Yes, we were at Harvard at the same time, but if that’s all it was, there would be a huge number of false positives I’m not actually seeing.