I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.
Updating the metrics in my Cloudera post,
- Facebook has 400 terabytes of disk managed by Hadoop/Hive, with a slightly better than 6:1 overall compression ratio. So the 2 1/2 petabytes figure for user data is reasonable.
- Facebook’s Hadoop/Hive system ingests 15 terabytes of new data per day now, not 10.
- Hadoop/Hive cycle times aren’t as fast as I thought I heard from Jeff. Ad targeting queries are the most frequent, and they’re run hourly. Dashboards are repopulated daily.
Nothing else in my Cloudera post was called out as being wrong.
In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon. Facebook thinks this is the second-largest* Hadoop installation, or else close to it. What’s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters.
*Apparently, Yahoo is at 2000 nodes (and headed for 4000), 1000 or so of which are operated as a single cluster for a single app.
Facebook decided in 2007 to move what was then a 15 terabyte big-DBMS-vendor data warehouse to Hadoop — augmented by Hive — rather than to an MPP data warehouse DBMS. Major drivers of the choice included:
- License/maintenance costs. Free is a good price.
- Open source flexibility. Facebook is one of the few users I’ve ever spoken with that actually cares about modifying open source code.
- Ability to run on cheap hardware. Facebook runs real-time MySQL instances on boxes that cost $10K or so, and would expect to pay at least as much for an MPP DBMS node. But Hadoop nodes run on boxes that cost no more than $4K, and sometimes (depending e.g. on whether they have any disk at all) as little as $2K. These are “true” commodity boxes; they don’t even use RAID.
- Ability to scale out to lots of nodes. Few of the new low-cost MPP DBMS vendors have production systems even today of >100 nodes. (Actually, I’m not certain that any except Netezza do, although Kognitio in a prior release of its technology once built a 900ish node production system.)
- Inherently better performance. Correctly or otherwise, the Facebook guys thought that Hadoop had performance advantages over DBMS, due to the lack of overhead associated with transactions and so on.
One option Facebook didn’t seriously consider was sticking with the incumbent, which Facebook folks regarded as “horrible” and a “lost cause.” The daily pipeline took more than 24 hours to process. Although aware that its big-DBMS-vendor warehouse could probably be tuned much better, Facebook didn’t see that as a path to growing its warehouse more than 100-fold. (But based on my discussion with Cloudera, I gather that vendor’s DBMS is indeed used to run some reporting today.)
Reliability of Facebook’s Hadoop/Hive system seems to be so-so. It’s designed for a few nodes at a time to fail; that’s no biggie. There’s a head node that’s a single-point of failure; while there’s a backup node, I gather failover takes 15 minutes or so, a figure the Facebook guys think they could reduce substantially if they put their minds to it. But users submitting long-running queries don’t seem to mind delays of up to an hour, as long as they don’t have to resubmit their queries. Keeping ETL up is a higher priority than keeping query execution going. Data loss would indeed be intolerable, but at that level Hadoop/Hive seems to be quite trustworthy.
There also are occasional longer partial(?) outages, when an upgrade introduces a bug or something, but those don’t seem to be a major concern.
Facebook’s variability in node hardware raises an obvious question — how does Hadoop deal with heterogeneous hardware among its nodes? Apparently a fair scheduling capability has been built for Hadoop, with Facebook as the first major user and Yahoo apparently moving in that direction as well. As for inputs to the scheduler (or any more primitive workload allocator) — well, that depends on the kind of heterogeneity.
- Disk heterogeneity — a distributed file system reports back about disk.
- CPU heterogeneity — different nodes can be configured to run different numbers of concurrent tasks each.
- RAM heterogeneity — Hadoop does not understand the memory requirements of each task, and does not do a good job of matching tasks to boxes accordingly. But the Hadoop community is working to fix this.
Further notes on Hive
Without Hive, some basic Hadoop data manipulations can be a pain in the butt. A GROUP BY or the equivalent could take >100 lines of Java or Python code, and unless the person writing it knew something about database technologically, it could use some pretty sub-optimal algorithms even then. Enter Hive.
Hive sets out to fix this problem. Originally developed at Facebook (in Java, like Hadoop is), Hive was open-sourced last summer, by which time its SQL interface was in place, and now has 6 main developers. The essence of Hive seems to be:
- An interface that implements a subset of SQL
- Compilation of that SQL into a MapReduce configuration file.
- An execution engine to run same.
The SQL implemented so far seems to, unsurprisingly be, what is most needed to analyze Facebook’s log files. I.e., it’s some basic stuff, plus some timestamp functionality. There also is an extensibility framework, and some ELT functionality.
Known users of Hive include Facebook (definitely in production) and hi5 (apparently in production as well). Also, there’s a Hive code committer from Last.fm.
Other links about huge data warehouses:
- eBay has a 6 1/2 petabyte database running on Greenplum and a 2 1/2 petabyte enterprise data warehouse running on Teradata.
- Wal-Mart, Bank of America, another financial services company, and Dell also have very large Teradata databases.
- Yahoo’s web/network events database, running on proprietary software, sounded about 1/6th the size of eBay’s Greenplum system when it was described about a year ago.
- Fox Interactive Media/MySpace has multi-hundred terabyte databases running on each of Greenplum and Aster Data nCluster.
- TEOCO has 100s of terabytes running on DATAllegro.
- To a probably lesser extent, the same is now also true of Dell.
- Vertica has a couple of unnamed customers with databases in the 200 terabyte range.