April 30, 2009

eBay’s two enormous data warehouses

A few weeks ago, I had the chance to visit eBay, meet briefly with Oliver Ratzesberger and his team, and then catch up later with Oliver for dinner. I’ve already alluded to those discussions in a couple of posts, specifically on MapReduce (which eBay doesn’t like) and the astonishingly great difference between high- and low-end disk drives (to which eBay clued me in). Now I’m finally getting around to writing about the core of what we discussed, which is two of the very largest data warehouses in the world.

Metrics on eBay’s main Teradata data warehouse include:

Metrics on eBay’s Greenplum data warehouse (or, if you like, data mart) include:

eBay’s Teradata installation is a full enterprise data warehouse. Besides size and scope, it is most notable for its implementation of Oliver’s misleadingly named analytics-as-a-service vision. In essence, eBay spins out dozens of virtual data marts, which:

The whole scheme relies heavily on Teradata’s workload management software to deliver with assurance on many SLAs (Service-Level Agreements) at once. Resource partitions are a key concept in all this.

So far as I can tell, eBay uses Greenplum to manage one kind of data — web and network event logs. These seem to be managed primarily at two levels of detail — Oliver said that the 17 trillion event detail records reduce to 1 trillion real event records. When I asked where the 17:1 ratio comes from, Oliver explained that a single web page click — which is what is memorialized in an event record — resulted in 50-150 details. That leaves a missing factor of 3-8X, but perhaps other less complex kinds of events are also mixed in.

The Greenplum metrics I quoted above represent over 100 days of data. Ultimately, eBay expects to keep 90-180 days of ultimate detail, and >1 years of event data. The 6 1/2 petabyte figure comes from dividing 2 terabytes of compressed data by (100%-70%). Since that all fits on a 4 1/2 petabyte system, I presume there’s only one level of mirroring (duh), not much temp space, and even less in the way of indexes.

Two uses of eBay’s Greenplum database are disclosed — whittling down from detailed to click-level event data, and sessionization. The latter seems to be done in batch runs and take 30 minutes per day. A couple of other uses are undisclosed. I assume eBay is doing something that requires UDFs (User-Defined Functions), because Oliver remarked that he likes the language choices offered by Greenplum’s Postgres-based UDF capability. But basically eBay’s Greenplum database is used for and evidently does very nicely at:

eBay’s Teradata database handles the rest.

Related links:

Comments

41 Responses to “eBay’s two enormous data warehouses”

  1. Hyperbio » Blog Archive » eBay’s enormous data warehouses on April 30th, 2009 10:39 am

    […] Monash meets with Ebay’s Oliver Ratzesberger and gets us numbers on two of the world’s largest data warehouses in the world. Look at these Ebay […]

  2. Greg Rahn on April 30th, 2009 12:46 pm

    @Curt

    >140 GB/sec of I/O [for 72 nodes], or 2 GB/node/sec

    I’m finding this I/O number very difficult to believe. If eBay has Teradata’s latest nodes, the 5550 series nodes, then each node could have two Quad Core Intel® Xeon® processor 5400 (2.33GHz) and the option of 4GB Quad Fibre Channel. Four 4GBFC would be 4 x 400MB/s = 1600MB/s max. By the math, it would be impossible to have a physical I/O rate of 2GB/s/node when running at max wire speed; it would allow only 1600MB/s/node. I would also comment that 2 Quad Core Xeon 5400 series processors would not be able to do anything but a SELECT COUNT(*) and ingest data at 2GB/s (or even 1600MB/s).

  3. Tech infrastructure at eBay and Craigslist « ecpm blog on April 30th, 2009 3:34 pm

    […] eBay’s two enormous data warehouses | DBMS2 — DataBase Management System Services […]

  4. Analytics Team » Blog Archive » Web analytics databases keep getting bigger on April 30th, 2009 10:22 pm

    […] Ebay has a 6.5 petabyte Greenplum warehouse and a 2.5 petabyte Teradata warehouse. This system ingests hundreds of billions of new rows of data every day. Facebook has a 2.5 petabyte Hadoop system Yahoo has more than 1 petabyte running on their homemade system […]

  5. Luke Lonergan on May 1st, 2009 5:54 am

    Great to see this project out in the open Curt!

    See my blog post about this.

    Key important aspects of this from my perspective are:
    – Full SQL analytics on 5+ PB of data, scaling to 10s of Petabytes
    – Small footprint in datacenter 1/10 the power consumption and floorspace

    WRT IO performance, the 200 MB/s is the rate at which compressed data is read from disk. The speed the SQL user sees is the net effective rate after decompression, which is the compression ratio times the physical rate. Measuring the disk IO rate is only telling us how fast the decompression is running in this case. With uncompressed data, these same systems perform disk IO at 2,000 MB/s while executing queries.

    We now have an improved compression scheme that delivers much higher effective rates, which will also make the disk rate higher, though that’s mostly irrelevant.

    – Luke

  6. Amr Awadallah on May 1st, 2009 9:26 am

    @Curt,

    Nice writeup.

    That said, I think bragging rights for size should not be how much is stored, we know that partitioning/sharding can allow you to inflate that number.

    The bragging rights should be how much data can be processed in a *single query* (especially something like a global distinct, order-by or group-by over a long period of time).

    Cheers,

    — amr

  7. Curt Monash on May 1st, 2009 9:35 am

    Amr,

    The only thing that limits querying the whole database in one go is the complexity of the schema. But in most cases — eBay’s Teradata EDW may be an exception — I’d guess that the largest fact table is well over half of the whole database size listed.

  8. Michael McIntire on May 1st, 2009 2:30 pm

    I’d like to see some appropriate measure between vendors, unfortunately the TPC benchmarks didn’t work for as long as we’d have all hoped.

    As to Single Query speed – I think this only counts if that’s what your SLA needs are.

    I’d much rather see a measure of throughput over the dev/ops people investment.

  9. Luke Lonergan on May 1st, 2009 5:55 pm

    I agree with Amr, and, we win hands down.

    :-)

    – Luke

  10. meneame.net on May 4th, 2009 12:14 am

    Los monstruosos datacenters de eBay (ENG)…

    Alguna vez nos habremos preguntado cuál es el ancho de banda consumido por alguno de los gigantes de internet, como youtube, google y demás. Bien, pues aquí nos ofrecen datos muy concretos sobre los datacenter de eBay, donde se especifica, además, …

  11. Random Crap I Learned Recently, Vol. 1 | Devon Biere's Blog on May 6th, 2009 2:18 am

    […] went on to play Match again in Back to the Future Part II. [5]  Per Curt Monash’s research blog entry as referenced by his Slashdot posting. [6]  Wikipedia’s Petabyte entry. [7]   Kevin […]

  12. anonymous on May 6th, 2009 11:04 am

    Greg Rahn notes “Four 4GBFC would be 4 x 400MB/s = 1600MB/s max”.

    I went to the link and looked up the specs on the Teradata 5550 node. The data sheet says there are 3 PCI-X slots. It also says I/O can be 4 GB dual or quad fibre channel. My interpretation is dual or quad PCI-X adapter cards. With 3 PCI-X slots that means Teradata can have up to 12 Fibre Channel links per node for a theoretical bandwidth of 4.8 GB/s. The limiting factor is probably three 133 MHz PCI-X buses, which are 1066 MB/s apiece giving 3 GB/s per node.

    Mr Rahn also says “I would also comment that 2 Quad Core Xeon 5400 series processors would not be able to do anything but a SELECT COUNT(*) and ingest data at 2GB/s (or even 1600MB/s).” But yet eBay says they are doing it – and a lot more.

  13. Yan on May 6th, 2009 5:34 pm

    Great to open the discussion of VLDB.

    Without performance regards of end user queries, the size of DB is meaningless, at most is data storage. There are some limits of VLDB being set up by system considering both size and performance of DB. It would be nice to have some numbers for this.

    Yan

  14. Greg Rahn on May 8th, 2009 2:27 am

    @anonymous

    I would agree with your interpretation of the dual/quad fibre channel [card]. To be honest, I would have never guessed anyone made a 4 port HBA but I guess LSI does: LSI7404XP-LC and given Teradata resells LSI Engenio storage it is likely they use their HBAs also. Given a 2 port 4GFC PCI-X HBA can deliver 80% of the slot bandwidth, it seems like a bit of a waste to go to 4 ports, at least for performance. For connectivity, perhaps, which is why I believe Teradata may use it – for their cliques.

    The other reason for my comment of a max of 1600MB/s per node is that the LSI Engenio 6998 array only does a max of 1600MB/s (per LSI’s presentation) and the 7900 array is quite new so it would seem doubtful that eBay uses that one. They may have opted for the EMC DMX storage, but I would think that would be an extremely costly solution at 72 nodes.

    I may soften up a bit on the I/O rate also. Curt’s comment “maybe that’s a peak when the workload is scan-heavy” is probably correct. Peak rate vs. sustained, I could see that. Maybe they do some light scans of some large de-normalized table making it peak out at 2GB/s per node. But that number still seems quite high at 250MB/s per CPU core. Doing group bys and aggregation I’m sure that number drops fast.

    I think the interesting, and unmentioned data point, is how many hard drives are in this 72 node config to deliver this I/O number. eBays’s own Michael McIntire reports that Teradata I/O is all random so the MBPS rate per drive is probably somewhere around 30MB/s (give or take). My guess is that there is somewhere between 4500 and 5000 HDDs (around 64 HDDs per node).

  15. Facebook, Hadoop, and Hive | DBMS2 -- DataBase Management System Services on May 11th, 2009 4:34 am

    […] eBay has a 6 1/2 petabyte database running on Greenplum and a 2 1/2 petabyte enterprise data warehouse running on Teradata. […]

  16. Diverging views on density, and some gimmes « Data Beta on May 14th, 2009 4:33 am

    […] Monash posted that eBay hosts a 6.5 petabyte Greenplum database on 96 […]

  17. blog.rbach.net - Server Sprawl Continues on May 16th, 2009 6:40 pm

    […] million users on Skype, eBay has a massive data center infrastructure. The company houses more than 8.5 petabytes of data in huge data warehouses. We’re not certain what kind of server count this requires, but it’s certainly in the […]

  18. Andrew on May 23rd, 2009 9:34 pm

    Yahoo’s main data warehouse was up to 3 petabytes compressed at the end of 2007.

  19. More on Fox Interactive Media’s use of Greenplum | DBMS2 -- DataBase Management System Services on June 8th, 2009 12:48 am

    […] most important reference is probably its energetic advocate Fox Interactive Media, even ahead of much larger user Greenplum user eBay, and notwithstanding Aster Data’s large presence in Fox subsidiary MySpace. I just ran across […]

  20. BI-Quotient » Blog Archive » Greenplum, MapReduce, and Hadoop on June 18th, 2009 4:10 pm

    […] 6.5 Petabytes of data eBay runs the world’s largest data warehouse on Greenplum. Facebook runs a 2 PB warehouse on […]

  21. SQL is Dead. Long Live SQL. : Dataspora Blog on November 25th, 2009 6:58 am

    […] not that relational databases can’t scale – in fact, they can and do scale to petabytes, as those who know Fortune 500 enterprise computing can attest . The problem is that relational databases don’t scale easily – and require a lot of ETL […]

  22. Big Data Is Less About Size, And More About Freedom | Venture Capital & Angel Investors Lists News and Jobs on March 16th, 2010 10:23 pm

    […] mean for the enterprise, but they have had big data for a long time. eBay manages petabytes in its Teradata and Greenplum data warehouses. Sophisticated startups extracting value from big data is also nothing new—it has […]

  23. Big Data Is Less About Size, And More About Freedom | ebusyet.com on March 17th, 2010 6:51 am

    […] mean for the enterprise, but they have had big data for a long time. eBay manages petabytes in its Teradata and Greenplum data warehouses. Sophisticated startups extracting value from big data is also nothing new—it has […]

  24. Big Data Is Less About Size, And More About Freedom | Experts Developers on March 18th, 2010 12:40 am

    […] mean for the enterprise, but they have had big data for a long time. eBay manages petabytes in its Teradata and Greenplum data warehouses. Sophisticated startups extracting value from big data is also nothing new—it has […]

  25. T. Shoone on April 27th, 2010 3:43 pm

    Does ebay really use Greenplum? I was told they no longer use Greenplum. Can someone comment?

  26. Curt Monash on April 27th, 2010 7:29 pm

    Every time I check, eBay is still using Greenplum.

  27. Big Data Is Less About Size, And More About Freedom on June 20th, 2010 3:19 am

    […] mean for the enterprise, but they have had big data for a long time. eBay manages petabytes in its Teradata and Greenplum data warehouses. Sophisticated startups extracting value from big data is also nothing new—it has […]

  28. Big Data Is Less About Size, And More About Freedom « Gadget Fee on June 29th, 2010 12:54 pm

    […] mean for the enterprise, but they have had big data for a long time. eBay manages petabytes in its Teradata and Greenplum data warehouses. Sophisticated startups extracting value from big data is also nothing new—it has […]

  29. Netezza on concurrency and workload management | DBMS2 -- DataBase Management System Services on August 18th, 2010 2:18 am

    […] Netezza customer has been rapidly spinning out virtual data marts, in a manner somewhat akin to eBay’s virtual data mart/”analytics-as-a-service” strategy* since 2004. However, the whole thing isn’t necessarily as slick as what eBay has going. This […]

  30. Big Data Is Less About Size, And More About Freedom | Big Data Cloud on September 2nd, 2010 11:38 pm

    […] mean for the enterprise, but they have had big data for a long time. eBay manages petabytes in its Teradata and Greenplum data warehouses. Sophisticated startups extracting value from big data is also nothing new—it has been happening […]

  31. eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more | DBMS 2 : DataBase Management System Services on October 6th, 2010 9:21 am

    […] has thrown out Greenplum. eBay’s 6 ½ petabyte Greenplum database has turned into a >10 petabyte Teradata database, which will grow 2 1/2x further in size […]

  32. Hadoop could help companies keep up in the ‘Big Data’ era « Wikibon Blog on March 29th, 2011 3:23 pm

    […] of data warehouses reaching the petabyte-level thanks to advances in parallel processing (think eBay’s two massive data warehouses, one each run on Teradata’s and Greenplum’s platforms.) But not every enterprise has eBay’s […]

  33. SQL is Dead. Long Live SQL! « Dataspora on April 26th, 2011 11:26 am

    […] not that relational databases can’t scale – in fact, they can scale to petabytes, as those who know Fortune 500 enterprise computing can attest . The problem is that relational databases require lots of ETL cruft to munge fluid blobs of data […]

  34. chris on June 10th, 2011 7:10 am

    I liked the article.Infact it assisted me on my presentation.Thanks alot

  35. Greenplum links and resources « Dirty Cache on September 15th, 2011 10:20 am

    […] – EMC Delivers Hadoop ‘Big Data’ Analytics to the Enterprise eBay’s two enormous data warehouses More on Fox Interactive Media’s use of […]

  36. Big numbers for Big Data 大資料的大數字 « Data Torrent 資料狂潮 on November 9th, 2011 9:44 am

    […] eBay 在2009年就有兩個巨大的資料倉儲,分別使用了GreenPlum與Terradata的解決方案,也都是管理好幾個PB的資料量。更使人尊敬的是他們乘載每秒好幾百MB或超過GB的I/O量。還有許多大型公司在2008年就擁有PB等級的資料倉儲。 […]

  37. A Sad Story: eBay | synsynack on December 15th, 2011 6:35 am

    […] the 21st most visited website in the world (7th in the US) according to Alexa.com. They store about 8.5 Petabytes of user data across two […]

  38. Blog DBA: Big Data: Greenplum Database on October 5th, 2012 3:11 am

    […] small number of "concurrent users”. Fonte: greenpum at ebay Greenplum è disponibile in tre […]

  39. todozambiacom on January 4th, 2014 11:30 pm

    This text is worth everyone’s attention. Where can I
    find out more?

  40. Big Data | hzpstat on July 15th, 2014 11:04 am

    […] Monash, Curt (30 April 2009). “eBay’s two enormous data warehouses” […]

  41. Notes from a visit to Teradata | DBMS 2 : DataBase Management System Services on August 31st, 2014 5:17 am

    […] Oliver Ratzesberger runs Teradata’s software development. […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.