Discussion of Yahoo’s use of database and analytic technology. Related subjects include:

August 21, 2011

Hadoop evolution

I wanted to learn more about Hadoop and its futures, so I talked Friday with Arun Murthy of Hortonworks.* Most of what we talked about was:

Arun previously addressed these issues and more in a June slide deck.
Read more

July 10, 2011

Cloudera and Hortonworks

My clients at Cloudera have been around for a while, in effect positioned as “the Hadoop company.” Their business, in a nutshell, consists of:

Hortonworks spun out of Yahoo last week, with parts of the Cloudera business model, namely Hadoop support, training, and I guess conferences. Hortonworks emphatically rules out professional services, and says that it will contribute all code back to Apache Hadoop. Hortonworks does grudgingly admit that it might get into the proprietary software business at some point — but evidently hopes that day will never actually come.

Read more

July 6, 2011

Petabyte-scale Hadoop clusters (dozens of them)

I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.

Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo’s latest stated figures are:

Read more

June 4, 2011

Hardware for Hadoop

After suggesting that there’s little point to Hadoop appliances, it occurred to me to look into what kinds of hardware actually are used with Hadoop. So far as I can tell:

Read more

October 1, 2009

Yahoo wants to do decapetabyte-scale data warehousing in Hadoop

My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo’s Hadoop effort — everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.

Highlights of our visit included:

Read more

September 19, 2009

Oracle gives a few customer database size examples

In its recent quarterly conference call, Oracle said (as per the Seeking Alpha transcript):

AC Neilsen, for instance, we deployed a 45-terabyte data [mart], they called it; Adidas, 13 terabytes; Australian Bureau of Statistics, 250 terabytes; and of course, some of our high-end ones that you have probably heard of in the past, AT&T, 250 terabytes; Yahoo!, 700 terabytes — just gives you an idea of the size of the databases that are out there and how they are growing, and that’s driving the need for greater throughput.

I don’t know what’s being counted there, but I wouldn’t be surprised if those were legit user-data figures.

Some other notes:

July 6, 2009

Yahoo is up to 10 petabytes now?

According to somebody (I forget who) who attended Yahoo’s SIGMOD presentation last week, the big Yahoo database is now up to 10 petabytes in size, in line with Yahoo’s predictions last year.  Apparently, Yahoo also gave more details of how the technology works.

May 11, 2009

Facebook, Hadoop, and Hive

I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.

Updating the metrics in my Cloudera post,

Nothing else in my Cloudera post was called out as being wrong.

In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon. Facebook thinks this is the second-largest* Hadoop installation, or else close to it. What’s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters. Read more

September 24, 2008

Some of Oracle’s largest data warehouses

Googling around, I came across an Oracle presentation – given some time this year – that lists some of Oracle’s largest data warehouses. 10 databases total are listed with >16 TB, which is fairly consistent with Larry Ellison’s confession during the Exadata announcement that Oracle has trouble over 10 TB (which is something I’ve gotten a lot of flack from a few Oracle partisans for pointing out … 😀 ).

However, what’s being measured is probably not the same in all cases. For example, I think the Amazon 70 TB figure is obviously for spinning disk (elsewhere in the presentation it’s stated that Amazon has 71 TB of disk). But the 16 TB British Telecom figure probably is user data — indeed, it’s the same figure Computergram cited for BT user data way back in 2001.

The list is: Read more

May 29, 2008

Yahoo scales its web analytics database to petabyte range

Information Week has an article with details on what sounds like Yahoo’s core web analytics database. Highlights include:

← Previous Page

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.