July 6, 2011

Petabyte-scale Hadoop clusters (dozens of them)

I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.

Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo’s latest stated figures are:

That works out near the low end of the range I came up with for Yahoo’s newest gear, namely 36-90 TB/node. Yahoo’s biggest clusters are little over 4,000 nodes (a limitation that’s getting worked on), and Yahoo has over 20 clusters in total.

Based on those numbers, it would seem that 10 or more of Yahoo’s Hadoop clusters are probably in the petabyte range. Facebook no doubt has a few petabyte-scale Hadoop clusters as well. So we’re probably over 3 dozen petabyte+ Hadoop clusters, just counting Yahoo, Facebook, and CDH users. There surely are others too, running Apache Hadoop without Cloudera’s help.

We also have some more information about the scale of Hadoop usage, and the markets it is being used in, because Omer Trajman of Cloudera kindly wrote the following — lightly edited as usual — for quotation:

The number of Petabyte+ Hadoop clusters expanded dramatically over the past year, with our recent count reaching 22 in production (in addition to the well-known clusters at Yahoo! and Facebook). Just as our poll back at Hadoop World 2010 showed the average cluster size at just over 60 nodes, today it tops 200. While mean is not the same as median (most clusters are under 30 nodes), there are some beefy ones pulling up that average. Outside of the well-known large clusters at Yahoo and Facebook, we count today 16 organizations running PB+ clusters running CDH across a diverse number of industries including online advertising, retail, government, financial services, online publishing, web analytics and academic research. We expect to see many more in the coming years, as Hadoop gets easier to use and more accessible to a wide variety of enterprise organizations.

Omer went on to add:

The biggest number of PB clusters are in the advertising space. I often tell people that every ad you see on the internet touched at least one Hadoop cluster (or the Google equivalent).


13 Responses to “Petabyte-scale Hadoop clusters (dozens of them)”

  1. Are companies addicted to Hadoop? — Cloud Computing News on July 6th, 2011 11:30 am

    […] his DBMS2 blog this morning, database expert Curt Monash quotes Cloudera Vice President of Customer Solutions Omer Trajman as stating that his employer counts 22 Hadoop clusters (not counting non-Cloudera users Facebook […]

  2. Are Companies Addicted to Hadoop? on July 6th, 2011 4:42 pm

    […] 2011 By admin Leave a Comment In his DBMS2 blog this morning, database expert Curt Monash quotes Cloudera VP of Customer Solutions Omer Trajman as stating that his employer counts 22 Hadoop clusters (not counting non-Cloudera users Facebook […]

  3. MapR Releases Commercial Distributions based on Hadoop « Another Word For It on July 7th, 2011 4:17 pm

    […] claimed to have more than 80 customers running Hadoop in production in March 2011, with 22 clusters running Cloudera’s distribution that are over a petabyte as of July […]

  4. High Scalability - High Scalability - Stuff The Internet Says On Scalability For July 8, 2011 on July 9th, 2011 1:56 am

    […] confirms 750 million users, sharing 4 billion items daily; Yahoo: 42,000 Hadoop nodes storing 180-200 petabytes; Formspring hits 25 million […]

  5. Quora on July 12th, 2011 1:42 pm

    What are some companies that work with peta-scale data?…

    We count 16 customers with petabyte-sized clusters at Cloudera [1]. Unfortunately, many of our customers choose not to make their names public. It wouldn’t really matter, though: essentially every large company in the web, digital media, telco, financ…

  6. Supply and demand for MapReduce technology in the GeoSpatial world » Blog for Geodelivery project on July 20th, 2011 7:25 am

    […] has been called the LAMP stack (Linux, Apache, MySQL, PHP) in the world of “big data”. 22 Hadoop clusters working with more than a petabyte of data have been spotted by Cloudera (a Hadoop vendor), and it seems to be catching on. Hadoop is one of […]

  7. Hadoop Examples « Zettaforce on July 31st, 2011 2:55 pm

    […] that operate Hadoop production clusters with one petabyte or more of data stored in each cluster. (DBMS2, July 6, […]

  8. Facebook migra 30 Petabytes con Hadoop | IP Video surveillance & Cloud Computing blog on August 16th, 2011 6:13 am

    […] Petabyte-scale Hadoop clusters (dozens of them) (dbms2.com) […]

  9. Big numbers for Big Data 大資料的大數字 « Data Torrent 資料狂潮 on November 9th, 2011 9:47 am

    […] Hadoop World 2011 Mike Olson (Cloudera CEO) Keynote 提到,與會人士中,有13.1% 擁有超過 100TB的資料量,12.8% 擁有超過 1PB的資料量,最大的單一site達到20PB。而他們平均有120 nodes 在其Hadoop Cluster內。Cloudera 自己就有22個客戶擁有超過1PB的Hadoop,而Yahoo全球就有20幾個超過PB的Ha…。 […]

  10. Notes on the Hadoop and HBase markets : DBMS 2 : DataBase Management System Services on April 24th, 2012 4:43 am

    […] Omer Trajman — who by the way has made a long-overdue official move into technical marketing — can no longer keep count of how many petabyte-scale Hadoop clusters Cloudera supports. […]

  11. Which Big Data Company has the World's Biggest Hadoop Cluster? - HadoopWizard on January 20th, 2013 11:02 pm

    […] Yahoo! has up to 42,000 nodes in its Hadoop grids in July 2011 from Hortonworks Hadoop summit 2011 keynote and Petabyte-scale Hadoop clusters […]

  12. Inefficient on April 11th, 2016 11:26 pm

    Does anyone else notice the math doesn’t add up?
    At 36-90 TB/node capacity, this isn’t efficient use of the nodes storage capacity.

    200 petabyte / 48,000 nodes
    2 petabyte 480 nodes
    200TB/48nodes = 4.1TB/node

    200 x 1,000,000,000,000,000 / 48,000 = 4,166,666,666.66667 bytes/node = 4.16 TB/node

  13. Curt Monash on April 12th, 2016 7:31 pm

    Well, there’s a 3X replication factor. And there’s working space. And as old as these figures are, we can assume there isn’t much in the way of compression.

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.