Petabyte-scale data management

Posts about database management for databases with petabytes of user data.

April 30, 2009

eBay’s two enormous data warehouses

A few weeks ago, I had the chance to visit eBay, meet briefly with Oliver Ratzesberger and his team, and then catch up later with Oliver for dinner. I’ve already alluded to those discussions in a couple of posts, specifically on MapReduce (which eBay doesn’t like) and the astonishingly great difference between high- and low-end disk drives (to which eBay clued me in). Now I’m finally getting around to writing about the core of what we discussed, which is two of the very largest data warehouses in the world.

Metrics on eBay’s main Teradata data warehouse include:

Metrics on eBay’s Greenplum data warehouse (or, if you like, data mart) include:

Read more

April 15, 2009

Cloudera presents the MapReduce bull case

Monday was fire-drill day regarding MapReduce vs. MPP relational DBMS. The upshot was that I was quoted in Computerworld and paraphrased in GigaOm as being a little more negative on MapReduce than I really am, in line with my comment

Frankly, my views on MapReduce are more balanced than [my] weary negativity would seem to imply.

Tuesday afternoon the dial turned a couple notches more positive yet, when I talked with Michael Olson and Jeff Hammerbacher of Cloudera. Cloudera is a new company, built around the open source MapReduce implementation Hadoop. So far Cloudera gives away its Hadoop distribution, without charging for any sort of maintenance or subscription, and just gets revenue from professional services. Presumably, Cloudera plans for this business model to change down the road.

Much of our discussion revolved around Facebook, where Jeff directed a huge and diverse Hadoop effort. Apparently, Hadoop played much of the role of an enterprise data warehouse at Facebook — at least for clickstream/network data — including:

Some Facebook data, however, was put into an Oracle RAC cluster for business intelligence. And Jeff does concede that query execution is slower in Hadoop than in a relational DBMS. Hadoop was also used to build the index for Facebook’s custom text search engine.

Jeff’s reasons for liking Hadoop over relational DBMS at Facebook included: Read more

October 15, 2008

Teradata’s Petabyte Power Players

As previously hinted, Teradata has now announced 4 of the 5 members of its “Petabyte Power Players” club.  These are enterprises with 1+ petabyte of data on Teradata equipment.  As is commonly the case when Teradata discusses such figures, there’s some confusion as to how they’re actually counting.  But as best I can tell, Teradata is counting: Read more

August 25, 2008

Greenplum’s single biggest customer

Greenplum offered a bit of clarification regarding the usage figures I posted last night. Everything on the list is in production, except that:

August 25, 2008

Greenplum is in the big leagues

After a March, 2007 call, I didn’t talk again with Greenplum until earlier this month. That changed fast. I flew out to see Greenplum last week and spent over a day with president/co-founder Scott Yara, CTO/co-founder Luke Lonergan, marketing VP Paul Salazar, and product management/marketing director Ben Werther. Highlights – besides some really great sushi at Sakae in Burlingame – start with an eye-opening set of customer proof points, such as: Read more

May 29, 2008

Yahoo scales its web analytics database to petabyte range

Information Week has an article with details on what sounds like Yahoo’s core web analytics database. Highlights include:

January 10, 2008

Netezza targets 1 petabyte

Netezza is promising petabyte-scale appliances later this year, up from 100 terabytes. That’s user data (I checked), and assumes 2-3X compression, or a little less than they think is actually likely. I.e., they’re describing their capacity in the same kinds of terms other responsible vendors do. They haven’t actually built and tested any 1 petabyte systems internally yet, but they’ve gone over 100 terabytes.

Basically, this leaves Netezza’s high-end capability about 10X below Teradata’s. On the other hand, it should leave them capable of handling pretty much every Teradata database in existence. Read more

October 9, 2007

Marketing versus reality on the one-petabyte barrier

Usually, I don’t engage in the kind of high-speed quick-response blogging I have over the past couple of days from the Teradata Partners conference (and more generally have for the past week or so). And I’m not sure it’s working out so well.

For example, the claim that Teradata has surpassd the one-petabyte mark comes as quite a surprise to variety of Teradata folks, not to mention at least one reliable outside anonymous correspondent. That claim may indeed be true about raw disk space on systems sold. But the real current upper limit, according to CTO Todd Walter,* is 5-700 terabytes of user data. He thinks half a dozen or so customers are in that range. I’d guess quite strongly that three of those are Wal-Mart, eBay, and an unspecified US intelligence agency.

*Teradata seems to have quite a few CTOs. But I’ve seen things much sillier than that in the titles department, and accordingly shan’t scoff further — at least on that particular subject. 😉

On the other hand, if anybody did want to buy a 10 petabyte system, Teradata could ship them one. And by the way, the Teradata people insist Sybase’s claims in the petabyte area are quite bogus. Teradata claims to have had bigger internal systems tested earlier than the one Sybase writes about.

October 9, 2007

Yet more on petabyte-scale Teradata databases

I managed to buttonhole Teradata’s Darryl MacDonald again, to follow up on yesterday’s brief chat. He confirmed that there are more than one petabyte+ Teradata databases out there, of which at least one is commercial rather than government/classified. Without saying who any of them were, he dropped a hint suggestive of Wal-Mart. That makes sense, given that a 423 terabyte figure for Wal-Mart is now three years old, and Wal-Mart is in the news for its 4 petabyte futures. Yes, that news has tended to mention HP NeoView recently more than Teradata. But it seems very implausible that a NeoView replacement of Teradata has already happened, if if such a thing is a possibility for the future. So right now however much data Wal-Mart has on its path from 423 terabytes to 4 petabytes and beyond is probably collected mainly on Teradata machines.

← Previous Page

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.