Discussion of eBay’s use of database and analytic technology. Related subjects include:
- The use of analytic technologies to study web and network event data
Before I write anything else about the overlapping efforts known as XLDB and SciDB, I probably should explain and disambiguate what they are as best I can. XLDB was organized and still is run by guys who want to solve a scientific problem in eXtremely Large DataBase Management, most especially Jacek Becla of SLAC (the organization previously known as Stanford Linear Accelerator Center). Becla’s original motivation was that he needs a DBMS to manage what will be 55 petabytes of raw image data and 100 petabytes of astronomical data total for LSST (Large Synoptic Survey Telescope). Read more
|Categories: Data models and architecture, Database diversity, eBay, Michael Stonebraker, Open source, Petabyte-scale data management, Scientific research, Theory and architecture||2 Comments|
Greenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC). Key observations around the concept — mixing mine and Greenplum’s together — include:
- Data marts aren’t just for performance (or price/performance). They also exist to give individual analysts or small teams control of their analytic destiny.
- Thus, it would be really cool if business users could have their own analytic “sandboxes” — virtual or physical analytic databases that they can manipulate without breaking anything else.
- In any case, business users want to analyze data when they want to analyze it. It is often unwise to ask business users to postpone analysis until after an enterprise data model can be extended to fully incorporate the new data they want to look at.
- Whether or not you agree with that, it’s an empirical fact that enterprises have many legacy data marts (or even, especially due to M&A, multiple legacy data warehouses). Similarly, it’s an empirical fact that many business users have the clout to order up new data marts as well.
- Consolidating data marts onto one common technological platform has important benefits.
In essence, Greenplum is pitching the story:
- Thesis: Enterprise Data Warehouses (EDWs)
- Antithesis: Data Warehouse Appliances
- Synthesis: Greenplum’s Enterprise Data Cloud vision
When put that starkly, it’s overstated, not least because
Specialized Analytic DBMS != Data Warehouse Appliance
But basically it makes sense, for two main reasons:
- Analysis is performed on all sorts of novel data, from sources far beyond an enterprise’s core transactions. This data neither has to fit nor particularly benefits from being tightly fitted into the core enterprise data model. Requiring it to do so is just an unnecessary and painful bureaucratic delay.
- On the other hand, consolidation can be a good idea even when systems don’t particularly interoperate. Data marts, which commonly do in part interoperate with central data stores, have all the more reason to be consolidated onto a central technology platform/stack.
|Categories: Analytic technologies, Data warehouse appliances, Data warehousing, DATAllegro, EAI, EII, ETL, ELT, ETLT, eBay, Greenplum, Microsoft and SQL*Server, Parallelization, Specific users, Teradata||26 Comments|
A few weeks ago, I had the chance to visit eBay, meet briefly with Oliver Ratzesberger and his team, and then catch up later with Oliver for dinner. I’ve already alluded to those discussions in a couple of posts, specifically on MapReduce (which eBay doesn’t like) and the astonishingly great difference between high- and low-end disk drives (to which eBay clued me in). Now I’m finally getting around to writing about the core of what we discussed, which is two of the very largest data warehouses in the world.
Metrics on eBay’s main Teradata data warehouse include:
- >2 petabytes of user data
- 10s of 1000s of users
- Millions of queries per day
- 72 nodes
- >140 GB/sec of I/O, or 2 GB/node/sec, or maybe that’s a peak when the workload is scan-heavy
- 100s of production databases being fed in
Metrics on eBay’s Greenplum data warehouse (or, if you like, data mart) include:
- 6 1/2 petabytes of user data
- 17 trillion records
- 150 billion new records/day, which seems to suggest an ingest rate well over 50 terabytes/day
- 96 nodes
- 200 MB/node/sec of I/O (that’s the order of magnitude difference that triggered my post on disk drives)
- 4.5 petabytes of storage
- 70% compression
- A small number of concurrent users
|Categories: Analytic technologies, Data warehouse appliances, Data warehousing, eBay, Greenplum, Petabyte-scale data management, Teradata, Web analytics||41 Comments|
This is a long post, so I’m going to recap the highlights up front. In the opinion of somebody I have high regard for, namely Carson Schmidt of Teradata:
- There’s currently a huge — one order of magnitude — performance difference between cheap and expensive disks for data warehousing workloads.
- New disk generations coming soon will have best-of-both-worlds aspects, combining high-end performance with lower-end cost and power consumption.
- Solid-state drives will likely add one or two orders of magnitude to performance a few years down the road. Echoing the most famous logjam in VC history — namely the 60+ hard disk companies that got venture funding in the 1980s — 20+ companies are vying to cash in.
In other news, Carson likes 10 Gigabit Ethernet, dislikes Infiniband, and is “ecstatic” about Intel’s Nehalem, which will be the basis for Teradata’s next generation of servers.
|Categories: Data warehouse appliances, Data warehousing, eBay, Solid-state memory, Storage, Teradata||16 Comments|
I talked with Oliver Ratzesberger and his team at eBay last week, who I already knew to be MapReduce non-fans. This time I added more detail.
Oliver believes that, on the whole, MapReduce is 6-8X slower than native functionality in an MPP DBMS, and hence should only be used sporadically. This view is based on part on simulations eBay ran of the Terasort benchmark. On 72 Teradata nodes or 96 lower-powered nodes running another (currently unnamed, as per yet another of my PR fire drills) MPP DBMS, a simulation of Terasort executed in 78 and 120 secs respectively, which is very comparable to the times Google and Yahoo got on 1000 nodes or more.
And by the way, if you use many fewer nodes, you also consume much less floor space or electric power.
Neither Greenplum nor eBay will say for the record that eBay is a Greenplum customer. Indeed, saying that is quite verboten. On the other hand, Greenplum’s press release boilerplate says that Skype is a Greenplum customer, and Skype is of course a subsidiary of eBay. (Edit: Speaking of silliness, fixed a typo there.)
The point of such distinctions is sometimes lost on me.
In related news, of Greenplum’s two customers who back in August were supposedly heading into production soon with petabyte-plus databases, one hasn’t yet made it to that size. (“As we speak” turned out to be a longer conversation than I might have anticipated ….) The other (of course unnamed) customer has, Greenplum assures me, made it that high. But upon checking with that (unnamed, in case I forgot to mention the point) customer, I don’t detect a whole lot of enthusiasm about Greenplum.
I’ve talked with a whole lot of vendors recently, some here at TDWI, as well as users, fellow analysts, and so on. Repeated themes include: Read more
|Categories: Analytic technologies, Application areas, Data mart outsourcing, Data warehousing, eBay, Microsoft and SQL*Server, MySQL, Oracle, Teradata||Leave a Comment|
I chatted yesterday at TDWI with Yves de Montcheuil of Talend, as a follow-up to some chats at Teradata Partners in October. This time around I got more metrics, including:
- Talend revenue grew 6-fold in 2008.
- Talend revenue is expected to grow 3-fold in 2009.
- Talend had >400 paying customers at the end of 2008.
- Talend estimates it has >200,000 active users. This is based on who gets automated updates, looks at documentation, etc.
- ~1/3 of Talend’s revenue is from large customers. 2/3 is from the mid-market.
- Talend has had ~700,000 downloads of its core product, and >3.3 million downloads in all (including documentation, upgrades, etc.)
It seems that Talend’s revenue was somewhat shy of $10 million in 2008.
Specific large paying customers Yves mentioned include: Read more
|Categories: Analytic technologies, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, eBay, Market share and customer counts, Specific users, Talend||5 Comments|
The first time I ever heard from Oliver Ratzesberger of eBay, the subject line of his email mentioned MapReduce. That was early this year. Subsequently, however, eBay seems to have become a MapReduce non-fan. The reason is simple: eBay’s parallel efficiency tests show that MapReduce leaves most processors idle most of the time. The specific figure they mentioned was parallel efficiency of 18%.
As previously hinted, Teradata has now announced 4 of the 5 members of its “Petabyte Power Players” club. These are enterprises with 1+ petabyte of data on Teradata equipment. As is commonly the case when Teradata discusses such figures, there’s some confusion as to how they’re actually counting. But as best I can tell, Teradata is counting: Read more
|Categories: Data warehousing, eBay, Market share and customer counts, Petabyte-scale data management, Specific users, Teradata||11 Comments|