Discussion of Yahoo’s use of database and analytic technology. Related subjects include:
- The use of analytic technologies to study web and network event data
- (in Text Technologies) Analysis of Yahoo’s efforts as a provider of search and other online services
My clients at Cloudera have been around for a while, in effect positioned as “the Hadoop company.” Their business, in a nutshell, consists of:
- Packaging up a Cloudera distribution of Apache Hadoop. This distribution doesn’t have proprietary code; it’s just packaged by Cloudera from Apache projects (with a decent minority of the code happening to have been contributed by Cloudera engineers).
- Paid subscription support for Apache Hadoop and, in connection with that …
- … proprietary software that all support customers automatically get. There are two points to this proprietary software:
- It adds value for the customer.
- It makes Cloudera’s support job easier.
- Professional services around Hadoop.
- Training and conferences around Hadoop, which probably don’t generate all that much money, but are great marketing in terms of visibility, thought leadership, and lead generation.
Hortonworks spun out of Yahoo last week, with parts of the Cloudera business model, namely Hadoop support, training, and I guess conferences. Hortonworks emphatically rules out professional services, and says that it will contribute all code back to Apache Hadoop. Hortonworks does grudgingly admit that it might get into the proprietary software business at some point — but evidently hopes that day will never actually come.
I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.
Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo’s latest stated figures are:
- 42,000 Hadoop nodes …
- … holding 180-200 petabytes of data.
|Categories: Cloudera, Facebook, Hadoop, Investment research and trading, Log analysis, MapReduce, Market share and customer counts, Petabyte-scale data management, Scientific research, Web analytics, Yahoo||11 Comments|
After suggesting that there’s little point to Hadoop appliances, it occurred to me to look into what kinds of hardware actually are used with Hadoop. So far as I can tell:
- Hadoop nodes today tend to run on fairly standard boxes.
- Hadoop nodes in the past have tended to run on boxes that were light with respect to RAM.
- The number of spindles per core on Hadoop node boxes is going up even as disks get bigger.
My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo’s Hadoop effort — everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.
Highlights of our visit included:
- There are dozens of people at Yahoo doing Hadoop development that will wind up getting open sourced. (Full-time or close to it.) In particular, everything Mark’s team does goes to open source.
- Yahoo is moving as much of its analytics to Hadoop as possible. Much of this is being moved away from Oracle and from Yahoo’s own Everest.
- A column store is being put on top of HDFS, based on Yahoo technology. Columns will be striped across nodes. Perhaps that’s why the effort is called Project Zebra.
- Mark believes that in a year Hadoop will be much further along in meeting traditional data warehousing requirements, in areas such as:
- SLAs/high availability/other workload management
- Data retention policies
- Yahoo views the time-to-market benefits of Hadoop as being more important than TCO.
|Categories: Analytic technologies, Data warehousing, Hadoop, MapReduce, Open source, Oracle, Petabyte-scale data management, Web analytics, Yahoo||6 Comments|
In its recent quarterly conference call, Oracle said (as per the Seeking Alpha transcript):
AC Neilsen, for instance, we deployed a 45-terabyte data [mart], they called it; Adidas, 13 terabytes; Australian Bureau of Statistics, 250 terabytes; and of course, some of our high-end ones that you have probably heard of in the past, AT&T, 250 terabytes; Yahoo!, 700 terabytes — just gives you an idea of the size of the databases that are out there and how they are growing, and that’s driving the need for greater throughput.
I don’t know what’s being counted there, but I wouldn’t be surprised if those were legit user-data figures.
Some other notes:
- The Yahoo database is of course Yahoo’s first-generation data warehouse, which has been largely superseded by an internal system more than 10X that size. (Edit: Actually, Greg Rahn of Oracle says below that it’s a different database.)
- I’m keynoting the Netezza road show this month, and Nielsen is up there on stage touting Netezza. (Edit: Nielsen indeed does the overwhelming majority of its data warehousing on Netezza.)
- I’d be surprised if AT&T’s largest data warehouse were “only” 250 terabytes in size. (Edit: Actually, I am told the database in question is 310 TB of user data and growing. More later, hopefully.)
- Oracle didn’t exactly say that those were Exadata installations.
|Categories: Analytic technologies, Data warehousing, Exadata, Netezza, Oracle, Specific users, Telecommunications, Web analytics, Yahoo||10 Comments|
According to somebody (I forget who) who attended Yahoo’s SIGMOD presentation last week, the big Yahoo database is now up to 10 petabytes in size, in line with Yahoo’s predictions last year. Apparently, Yahoo also gave more details of how the technology works.
I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.
Updating the metrics in my Cloudera post,
- Facebook has 400 terabytes of disk managed by Hadoop/Hive, with a slightly better than 6:1 overall compression ratio. So the 2 1/2 petabytes figure for user data is reasonable.
- Facebook’s Hadoop/Hive system ingests 15 terabytes of new data per day now, not 10.
- Hadoop/Hive cycle times aren’t as fast as I thought I heard from Jeff. Ad targeting queries are the most frequent, and they’re run hourly. Dashboards are repopulated daily.
Nothing else in my Cloudera post was called out as being wrong.
In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon. Facebook thinks this is the second-largest* Hadoop installation, or else close to it. What’s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters. Read more
|Categories: Data warehousing, EAI, EII, ETL, ELT, ETLT, Facebook, Hadoop, MapReduce, Parallelization, Petabyte-scale data management, Specific users, Web analytics, Yahoo||47 Comments|
Googling around, I came across an Oracle presentation – given some time this year – that lists some of Oracle’s largest data warehouses. 10 databases total are listed with >16 TB, which is fairly consistent with Larry Ellison’s confession during the Exadata announcement that Oracle has trouble over 10 TB (which is something I’ve gotten a lot of flack from a few Oracle partisans for pointing out … ).
However, what’s being measured is probably not the same in all cases. For example, I think the Amazon 70 TB figure is obviously for spinning disk (elsewhere in the presentation it’s stated that Amazon has 71 TB of disk). But the 16 TB British Telecom figure probably is user data — indeed, it’s the same figure Computergram cited for BT user data way back in 2001.
The list is: Read more
Information Week has an article with details on what sounds like Yahoo’s core web analytics database. Highlights include:
- The Yahoo web analytics database is over 1 petabyte. They claim it will be in the 10s of petabytes by 2009.
- The Yahoo web analytics database is based on PostgreSQL. So much for MySQL fanboys’ claims of Yahoo validation for their beloved toy … uh, let me rephrase that. The highly-regarded MySQL, although doing a great job for some demanding and impressive applications at Yahoo, evidently wasn’t selected for this one in particular. OK. That’s much better now.
- But the Yahoo web analytics database doesn’t actually use PostgreSQL’s storage engine. Rather, Yahoo wrote something custom and columnar.
- Yahoo is processing 24 billion “events” per day. The article doesn’t clarify whether these are sent straight to the analytics store, or whether there’s an intermediate storage engine. Most likely the system fills blocks in RAM and then just appends them to the single persistent store. If commodity boxes occasionally crash and lose a few megs of data — well, in this application, that’s not a big deal at all.
- Yahoo thinks commercial column stores aren’t ready yet for more than 100 terabytes of data.
- Yahoo says it got great performance advantages from a custom system by optimizing for its specific application. I don’t know exactly what that would be, but I do know that database architectures for high-volume web analytics are still in pretty bad shape. In particular, there’s no good way yet to analyze the specific, variable-length paths users take through websites.
|Categories: Analytic technologies, Columnar database management, Data warehousing, MySQL, Petabyte-scale data management, PostgreSQL, Specific users, Theory and architecture, Yahoo||13 Comments|