October 1, 2009

Yahoo wants to do decapetabyte-scale data warehousing in Hadoop

My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo’s Hadoop effort — everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.

Highlights of our visit included:

*I also spoke with a couple of Mark’s Yahoo colleagues, on his introduction, who are being less helpful than he is about clarifying what I am or am not allowed to say for publication. But I will say that I was heartened by the degree of concern they showed for doing the right thing with regard to privacy. I was not as heartened by the concrete ideas — or lack thereof — for making it happen. But frankly, I don’t think it’s a solvable technical problem. Rather, it should be a huge priority on the legal/political front.

We also talked some about Pig, Yahoo’s non-SQL DML (Data Manipulation Language) for Hadoop, which is however getting a SQL interface. And we talked about Pig vs. Hive. But I recently heard a rumor all that is in flux, so I won’t write it up now.

Mark sent along a couple of interesting slide presentations by a colleague. After some back and forth as to whether I could post them, he suggested I post these links to similar material instead.


6 Responses to “Yahoo wants to do decapetabyte-scale data warehousing in Hadoop”

  1. Jerome Pineau on October 1st, 2009 4:03 pm

    Curt, I’m curious: how much data is Yahoo currently managing total and do they use commercial RDBMSs at all and if so which ones?

  2. Curt Monash on October 1st, 2009 4:17 pm


    As per other threads, it’s clear Yahoo is using quite a bit of Oracle.

    Otherwise, I couldn’t say.

  3. Jerome Pineau on October 1st, 2009 4:23 pm

    Besides Oracle and their internal (Everest) engine you mentioned I mean (am assuming it’s not all ORCL is it?)
    The video is really full of interesting stuff. I wonder how many “nodes” they work with and what kind of fabric is used and how these things are clustered . Isnt Xen the same VM EC2 is using?

    Fascinating stuff.

  4. Curt Monash on October 1st, 2009 6:37 pm

    Everest = Yahoo’s proprietary Postgres-based column store that is managing petabytes of data.

  5. Mark Tsimelzon on October 1st, 2009 7:19 pm

    Curt, you may want to add that Yahoo’s Hadoop team is growing fast! If anybody wants to join us, we are looking for developers, architects, testers, and managers: http://developer.yahoo.net/blogs/hadoop/2009/10/do_you_have_what_it_takes_to_j.html

  6. Bioinformatics and mythology. You still need to manage the data on December 9th, 2009 11:43 pm

    […] Yahoo’s wants to do decapetabyte-scale data warehousing in Hadoop (dbms2.com) […]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.