September 5, 2011

Data management at Zynga and LinkedIn

Mike Driscoll and his Metamarkets colleagues organized a bit of a bash Thursday night. Among the many folks I chatted with were Ken Rudin of Zynga, Sam Shah of LinkedIn, and D. J. Patil, late of LinkedIn. I now know more about analytic data management at Zynga and LinkedIn, plus some bonus stuff on LinkedIn’s People You May Know application. 🙂

It’s blindingly obvious that Zynga is one of Vertica’s petabyte-scale customers, given that Zynga sends 5 TB/day of data into Vertica, and keeps that data for about a year. (Zynga may retain even more data going forward; in particular, Zynga regrets ever having thrown out the first month of data for any game it’s tried to launch.) This is game actions, for the most part, rather than log files; true logs generally go into Splunk.

I don’t know whether the missing data is completely thrown away, or just stashed on inaccessible tapes somewhere.

I found two aspects of the Zynga story particularly interesting. First, those 5 TB/day are going straight into Vertica (from, I presume, memcached/Membase/Couchbase), as Zynga decided that sending the data to some kind of log first was more trouble than it’s worth. Second, there’s Zynga’s approach to analytic database design. Highlights of that include:

Data is divided into two parts. One part has a pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) About half the data is in each part, but I don’t think that’s by deliberate choice.
Zynga adds data into the real schema when it’s clear it will be needed for a while. This isn’t a matter of query volumes, for the most part; rather, it’s when Zynga’s tests (e.g. of new games?) have determined that the data will keep being collected and used for a while.
Zynga only adds columns to its analytic database; it never goes through the more complex process of deleting them.

Just as Zynga is one of Vertica’s flagship accounts, LinkedIn is one of Aster Data’s. Specifically, before leaving LinkedIn for Aster, Jonathan Goldman built LinkedIn’s People You May Know feature in Aster nCluster. This was long ago, and I’m not sure how sophisticated his use of SQL and MapReduce would be in today’s terms; for example, I was told he didn’t use “nPath or anything like that.” (Edit: See the comments below for clarifications from Jonathan.) Anyhow, LinkedIn has replaced Aster for PYMK with Hadoop, and in my opinion is getting much better results.

That, from an Aster standpoint, is the bad news. The good news is that LinkedIn is happily using Aster nCluster for several other applications; LinkedIn folks doesn’t seem to regret throwing out* Greenplum for Aster; and they also seem to have a very high opinion of Jonathan and his work while he was there.

*And this time that is indeed the phrase that was used. 😉

One thing that astonished me is that LinkedIn PYMK is based only on data innate to LinkedIn (as opposed to imported email addresses, the results of web crawls, and so on). Given that, I am at a loss to explain how it suggested a couple of old friends, to whom I have no discernable chain of connection. Yes, we were at Harvard at the same time, but if that’s all it was, there would be a huge number of false positives I’m not actually seeing.

Categories: Aster Data, Couchbase, Data models and architecture, Games and virtual worlds, Greenplum, Hadoop, Petabyte-scale data management, Specific users, Vertica Systems, Zynga

Subscribe to our complete feed!

Comments

27 Responses to “Data management at Zynga and LinkedIn”

shrikanth shankar on September 5th, 2011 9:27 am

Curt,
Have you ever visited the profile of these folks while logged in (either by search or through clicking around on the site) . I think,based on some experiences with PYMK, that LinkedIn is using your click stream to figure out other people you know ..
Curt Monash on September 5th, 2011 9:02 pm

Shrikanth,

That theory makes a lot of sense, especially if one expands it to asking whether I ever searched on their names (including times I failed to find their profiles in the search).
Jonathan Goldman on September 5th, 2011 9:26 pm

Hi Curt,

Regarding how PYMK works I really can’t comment on that. Also given the enormous impact PYMK has had on growth and improving network health LinkedIn has now put 4-5 people to work on it and they have likely added many more methods for identifying likely connections which I’m not even aware of.

Aster was critical to the early development of a number of analytics products including PYMK among others. Some of these products made use of the SQL-MR capabilities while others mainly benefited from the MPP capability that Aster provided over their pre-existing data warehouse environment. The
Aster platform is a powerful one in that enables the analyst to escape from the confines of SQL and perform more sophisticated tasks. When it came to deploying PYMK in production, the engineering team responsible for the task decided to utilize the Hadoop stack that they owned (Aster was owned by a different team) and do batch runs of PYMK there. This does not negate the value of the Aster platform to do SQL and non-SQL discovery analytics in a quick, iterative fashion, and this is very important to LinkedIn’s product innovation.

Thanks,
Jonathan Goldman
Curt Monash on September 5th, 2011 9:36 pm

Jonathan,

Good point. The tool one uses to research algorithms need not be the one one uses to execute them later.

The point often arises in a slightly different context, namely the modeling/scoring dichotomy in straightforward statistics/predictive analytics. But it fits here too.

And why didn’t you use nPath? Was it simply a matter of it not existing yet? 🙂
Jonathan Goldman on September 6th, 2011 1:05 am

When I was at LinkedIn I used nPath extensively actually for analysis work on user engagement (e.g. analyzing clickstream data). I didn’t use nPath for PYMK if that’s what you mean.
Curt Monash on September 6th, 2011 2:49 am

Jonathan,

That makes a lot of sense. Thanks!
Curt Monash on September 6th, 2011 4:52 am

No, it’s not just people I’ve searched on. I just got a semi-personal connection whose last name I didn’t even know (long story), and a wholly personal connection I much doubt I ever searched on.
shrikanth shankar on September 6th, 2011 8:12 am

interesting.. Any chance they may have looked you up on LinkedIn? I assume this is bi-directional. A sees B even if B searched for A.
Curt Monash on September 6th, 2011 8:15 am

That’s another good thought, Shrikanth. It’s definitely possible.

And since I have an unusual name, anybody who is searching for me is searching for ME. (If there’s another person named “Curt Monash” in the world, he hasn’t left a trace on Google that I have found.)
Curt Monash on September 7th, 2011 3:52 pm

Mike,

Well, it starts at memcached. But Zynga was a development leader in making memcached persistent, via technology that was later rolled into Membase/Couchbase.

Perhaps the Couchbase company guys can shed some light on the matter, and/or some Zynga folks.

I’ll delete your dupe comment. Was the site slow on comment response again?
Cloud Database on September 7th, 2011 3:00 pm

Curt, I was led to believe that the bulk of the Zynga db infrastructure was MySQL, not couchbase. I could be wrong but here is some insight from Venu from the inside: http://venublog.com/2010/12/02/mysql-at-scale-zynga-games/

– Mike
Curt Monash on September 7th, 2011 3:55 pm

OK. That’s interesting. Venu’s post contradicts what’s widely believed, and also what I thought I heard from Zynga last week.
James Phillips on September 7th, 2011 4:24 pm

While it is most certainly true that Zynga uses MySQL (I think it is probably safe to say that they use one of just about everything), it is also true that they have over 2,000 servers running Couchbase technology. And they use Membase in concert with Vertica, as you correctly highlighted, Curt.
Curt Monash on September 7th, 2011 4:40 pm

So SOME Zynga games are on Membase, while others on are MySQL? That would make sense, although it wouldn’t entirely excuse Venu from pretty clearly making an erroneous claim.
James Phillips on September 7th, 2011 4:46 pm

I get my data from Cadir and others. I don’t think I know Venu, so I can’t really comment on his assertions. I do find it humorous that ScaleDB is commenting about our business though.
Cloud Database on September 7th, 2011 6:17 pm

@James, I’m commenting about a supposition made by Curt that didn’t jive with what I heard from someone inside Zynga. I made no comment about your product or business. I even “couched” it by saying I could be wrong. No need to get sensitive. I wish you and your company well.

-Mike
Dali kilani on September 8th, 2011 12:54 am

This post on zynga’s engineering blog should settle the question: http://code.zynga.com/2011/07/building-a-scalable-game-server/

It used to be memcache + Mysql but it migrated to membase later.
Curt Monash on September 8th, 2011 1:41 am

Thank you, kind Zynga person!
State of Data #65 « Dr Data's Blog on September 9th, 2011 1:01 am

[…] Analytic Data Management at Zynga (5 TB/day) and LinkedIn – Data is divided into two parts. One part has a […]
Ken Rudin on September 10th, 2011 2:54 am

Curt, nice chatting with you recently. To clear up the confusion about whether we use MySQL/membase or Vertica, the answer is pretty simple: One is used for transactional purposes, and one is used for analytical purposes.

The games write transactional data to MySQL/membase, and the architecture is described here: http://code.zynga.com/2011/07/building-a-scalable-game-server/

By transactional data, I mean data regarding a player’s state in the game, such as what their game board looks like, how many coins they have left, etc. It’s all the info that the game needs when the player logs in so they can continue playing where they left off previously.

The games also separately write analytical data to our analytics platform, and the architecture is described here:
http://code.zynga.com/2011/06/deciding-how-to-store-billions-of-rows-per-day/

This analytical data is primarily event data related to player behaviors. Did the player just send a horse to their friend in Farmville? Log that to the analytics system. Did they visit a neighbor’s city in Cityville? Log that to the analytics system. Etc.

Hope that clears things up.
Curt Monash on September 10th, 2011 3:13 am

Ken,

It was great talking with you too!

Actually, the controversy was about whether you use Membase OR MySQL for the more “transactional” parts. Somebody who self-described as a consultant or something to you claimed it was all MySQL and zero Membase, in a blog post linked in a comment above, and confusion ensued.

Am I correct in guessing that it’s Membase for some games, memcached/MySQL for others?
Data Management at Zynga | Inside-BigData.com on September 24th, 2011 11:04 am

[…] the Full Story Posted in Analytics, Business of Big Data, Hadoop by Ralph 0 […]
Commercial software for academic use | DBMS 2 : DataBase Management System Services on October 17th, 2011 10:27 am

[…] Zynga and LinkedIn […]
Confluence: Pythia on January 12th, 2012 11:48 am

2012-01-11 Tableau – Francois Ajenstat…

Chris and David met with Francois Ajenstat (handl…
PolySpot Speaks More Than 50 Languages : Stephen E. Arnold @ Beyond Search on November 27th, 2012 11:02 am

[…] existing data warehousing deployments. It explains, for example, why a company like LinkedIn might adopt Hadoop for its People You May Know feature while retaining its investment in Aster Data for other […]
Data model churn | DBMS 2 : DataBase Management System Services on August 4th, 2013 6:10 pm

[…] examples I’ve written about explicitly are eBay and Zynga. Satisfying a similar need is one of the pillars of the Splunk value […]
Schema-on-need | DBMS 2 : DataBase Management System Services on September 21st, 2013 8:23 pm

[…] years ago I wrote about how Zynga managed analytic data: Data is divided into two parts. One part has a pretty ordinary schema; the other is just stored as […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Data management at Zynga and LinkedIn

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin