May 11, 2009

Facebook, Hadoop, and Hive

I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.

Updating the metrics in my Cloudera post,

Nothing else in my Cloudera post was called out as being wrong.

In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon. Facebook thinks this is the second-largest* Hadoop installation, or else close to it. What’s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters.

*Apparently, Yahoo is at 2000 nodes (and headed for 4000), 1000 or so of which are operated as a single cluster for a single app.

Facebook decided in 2007 to move what was then a 15 terabyte big-DBMS-vendor data warehouse to Hadoop — augmented by Hive — rather than to an MPP data warehouse DBMS. Major drivers of the choice included:

One option Facebook didn’t seriously consider was sticking with the incumbent, which Facebook folks regarded as “horrible” and a “lost cause.” The daily pipeline took more than 24 hours to process. Although aware that its big-DBMS-vendor warehouse could probably be tuned much better, Facebook didn’t see that as a path to growing its warehouse more than 100-fold.  (But based on my discussion with Cloudera, I gather that vendor’s DBMS is indeed used to run some reporting today.)

Reliability of Facebook’s Hadoop/Hive system seems to be so-so. It’s designed for a few nodes at a time to fail; that’s no biggie. There’s a head node that’s a single-point of failure; while there’s a backup node, I gather failover takes 15 minutes or so, a figure the Facebook guys think they could reduce substantially if they put their minds to it. But users submitting long-running queries don’t seem to mind delays of up to an hour, as long as they don’t have to resubmit their queries. Keeping ETL up is a higher priority than keeping query execution going. Data loss would indeed be intolerable, but at that level Hadoop/Hive seems to be quite trustworthy.

There also are occasional longer partial(?) outages, when an upgrade introduces a bug or something, but those don’t seem to be a major concern.

Facebook’s variability in node hardware raises an obvious question — how does Hadoop deal with heterogeneous hardware among its nodes? Apparently a fair scheduling capability has been built for Hadoop, with Facebook as the first major user and Yahoo apparently moving in that direction as well. As for inputs to the scheduler (or any more primitive workload allocator) — well, that depends on the kind of heterogeneity.

Further notes on Hive

Without Hive, some basic Hadoop data manipulations can be a pain in the butt. A GROUP BY or the equivalent could take >100 lines of Java or Python code, and unless the person writing it knew something about database technologically, it could use some pretty sub-optimal algorithms even then. Enter Hive.

Hive sets out to fix this problem. Originally developed at Facebook (in Java, like Hadoop is), Hive was open-sourced last summer, by which time its SQL interface was in place, and now has 6 main developers. The essence of Hive seems to be:

The SQL implemented so far seems to, unsurprisingly be, what is most needed to analyze Facebook’s log files. I.e., it’s some basic stuff, plus some timestamp functionality. There also is an extensibility framework, and some ELT functionality.

Known users of Hive include Facebook (definitely in production) and hi5 (apparently in production as well). Also, there’s a Hive code committer from Last.fm.

Other links about huge data warehouses:

Comments

48 Responses to “Facebook, Hadoop, and Hive”

  1. Jeff on May 11th, 2009 5:43 am

    Hey Curt,

    I’m glad you got a chance to speak with Ashish and Joy! They, along with the rest of the Facebook Data team, deserve the credit for designing, building, and operating Facebook’s innovative data infrastructure.

    I just wanted to add that the fair scheduler mentioned in your blog post is largely the work of Matei Zaharia, a PhD student at Berkeley’s RAD Lab: http://www.cs.berkeley.edu/~matei/.

    Matei’s work is a great example of the power of open source and the Hadoop community. He’s been able to continue his work on the fair scheduler even after he returned to school, and as mentioned in the article, it seems likely that his work is going to be moving into Yahoo!’s clusters as well.

    Later,
    Jeff

  2. dave on May 11th, 2009 11:07 am

    any idea if there are use-case profiles within government monster stores like lawrence livermore, nasa, argonnes projects or elsewhere?

  3. Josh Patterson on May 11th, 2009 12:04 pm

    @dave: There is the installation at the University of Nebraska at Lincoln that processes high energy physics data (writeup at cloudera.com: http://www.cloudera.com/blog/2009/05/01/high-energy-hadoop/ ). I dont know what your definition of monster is, but they house around 200TB per cluster it appears.

    Also, I run the hadoop cluster for storage of smartgrid data in service of efforts for the NERC at TVA (data off the US powergrid for transmission and generation). We’re in prototype / build-out mode, but we’re heading towards a > 100TB hadoop system in the near future.

    Hadoop is a tremendous product and we believe that its price, ecosystem, and general flexibility is great for big data.

  4. Ivan Novick on May 11th, 2009 1:26 pm

    One thing that I am suspicious about is the resource utilization of a 610 node hadoop cluster running hourly jobs.

    I would make a wild guess that facebook is getting less than 10% average CPU utilization on that cluster over a 24 hour period … please correct me if I am wrong.

    Also what data throughput can they achieve per node.

    In other words its great to have a 610 node cluster but if you can do a similar job with 100 nodes that would be better 🙂

  5. Josh Patterson on May 11th, 2009 2:58 pm

    @Ivan: typically hadoop type jobs are disk bound as opposed to cpu bound, so they want more disk access “surface area” so to speak. more nodes = more places to read data into mappers in a massively parallel sense.

    I keep seeing articles comparing RDBMs to hadoop (and map reduce), and its really apples to oranges. You just dont generally apply each of those to the same problem, although there is some cross-over.

  6. Ashish Thusoo on May 11th, 2009 3:36 pm

    @Curt
    Just wanted to add that even though there is a single point of failure the reliability due to software bugs has not been an issue and the dfs Namenode has been very stable. The Jobtracker crashes that we have seen are due to errant jobs – job isolation is not yet that great in hadoop and a bad query from a user can bring down the tracker (though the recovery time for the tracker is literally a few minutes). There is some good work happening in the community though to address those issues.

    @Ivan
    Thought the average CPU usage is low, there are many times during the day when we are maxed out on CPU, so we do end up using the entire compute capacity of the cluster during peak hours.

  7. UnHolyGuy on May 11th, 2009 4:15 pm

    2 recs on Dice currently at Ft Meade requiring Hadoop and security clearances, one at LLNL.

  8. Owen O'Malley on May 11th, 2009 10:56 pm

    Actually, Yahoo has 24,000 Hadoop computers at this point in clusters of up to 3000 computers.

  9. Owen O'Malley on May 11th, 2009 10:57 pm

    I also believe that Quantcast is the second largest installation. Four months ago they were at 1000 nodes…

  10. Mark Callaghan on May 12th, 2009 9:14 am

    For the Hive people — Can Hive survive the failure of query processing nodes during a long running query? Is this handled by restarting the query, or just the step running on the failed nodes?

    For Curt — How do the commercial MPP vendors handle this. I will guess that AsterData and NeoView don’t have to restart the query, but others may.

  11. Curt Monash on May 12th, 2009 10:32 am

    Mark,

    A core idea of MapReduce is that if a node fails, you automagically send its work to another node or nodes and keep going. How much of a slowdown this amounts to would, I guess, depend on the nature and design of the query or task being done, and how you manage your spare copy of the data.

    Offhand, I can’t think of a reason why the answer would be any different for Aster or Greenplum (did you mean Greenplum when you wrote Neoview?) than for Hadoop, when doing MapReduce.

    Or if your question wasn’t about MapReduce, but rather what happens to a query when a node fails in an MPP SQL DBMS — hmmm. I imagine the step in the query plan that failed would have to be redone, on the mirrored copy of the data. The natural follow-up questions are to ask how badly performance degrades in case of a node failure, and how long the bad degradation lasts, and how much degradation remains after data is redistributed.

    The answers to that turn out to vary a lot on a case-by-case basis. For example, on Vertica the mirrored copy of a column is likely to per sorted differently than the copy you were using for the query, so the performance hit varies a lot on a query-by-query basis. Row-based systems can usually be tuned for your choice of how many different nodes a single node’s data is mirrored across. And they also aren’t uniform in having primary data on the fast outer tracks and mirrored data on the slow inner tracks. (I forget at the moment why anybody would ever do it any other way, but at least one vendor told me there was a reason.)

    CAM

  12. links for 2009-05-12 | IT Management and Cloud Blog on May 12th, 2009 10:35 am

    […] Facebook, Hadoop, and Hive | DBMS2 — DataBase Management System Services Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and […]

  13. Joydeep Sen Sarma on May 12th, 2009 11:49 am

    thanks for the post Curt!

    @Ivan – regarding utilization – Facebook’s average cpu utilization for the cluster is more than 50%. This is partly the reason why the cluster is being expanded. Data Throughput per node depends a lot on number of concurrent tasks, whether the data is local or not and the nature of computation that is being performed. Isolated controlled runs look very different from a real-life environments with heterogenous task mixes. At this point Yahoo’s Tera-Sort runs on Hadoop are the best reference point on performance.

    @Mark – at this point Hive queries don’t restart from the last completed map-reduce step on failure. I have opened a Jira: https://issues.apache.org/jira/browse/HIVE-480 to track this.

    To add to the post:
    a) Open source really rocks. In addition to the work on scheduling by Matei – there have been many instances where we have been able to modify the core hadoop code base for our environment:
    – the ongoing work by Dhruba to support archival storage (with limited bandwidth) transparently.
    – critical patches to not allow runaway memory consumption by job-tracker
    – being able to try and fix problems with the bzip2 codec in Hadoop

    to just name a few.

    b) regarding availability. One of the things we love about Hadoop is how it keeps chugging along even when half the system is down. As i mentioned to Curt – ingesting new log data rapidly is critical for Facebook – and in that sense it’s great that the hadoop file system can keep accepting new data at a time when it may otherwise be incomplete/corrupt because of missing nodes.

    c) In addition to sql compiler and execution engine – Hive also provides a metastore that tracks the various data units (tables/partitions,their schemas/formats and other properties) and this is invaluable for even pure map-reduce programs in terms of data addressability.

    d) One of the more novel aspects about the SQL that Hive provides is the capability to inject map and reduce transforms (via scripts) in the query (in addition to standard SQL clauses). This is one of the reasons why it has become so popular amongst data mining engineers at Facebook (since it gives them the ability to write custom logic in the language of their choice).

    Both Hadoop and Hive keep improving rapidly in the area of reliability, scalability and performance (yay open source community!) and this is one the other things that makes Hadoop/Hive relatively future proof (and help companies like Facebook decide to use them).

    Finally – the successful deployment of Hive and Hadoop at Facebook would not have been possible without all the wonderful engineers there – who have guided Hive feature development as well as incurred the pain of living on the bleeding edge.

  14. Curt Monash on May 12th, 2009 12:17 pm

    Joydeep,

    Thanks for your help, including all the additional info you just posted!

    I must admit to confusion on the re-starting question. Are you saying that if a node goes down, the whole task has to restart? If so, is this common in MapReduce tasks, or is the issue Hive-specific?

    Thanks!

    CAM

  15. Joydeep Sen Sarma on May 12th, 2009 12:51 pm

    @Curt – for a single map-reduce job – Hadoop takes care of retrying individual tasks (in case they fail). This covers the case of node failures for a single map-reduce job. However – the entire cluster may go down (since as we noted – job tracker and namenodes are single points of failure and today there’s no transparent failover for these components). All currently running map-reduce jobs will fail in that case.

    A single Hive query may require multiple map-reduce jobs. Today – if one of the map-reduce jobs fails (for instance if the cluster goes down – or there are connectivity issues with the cluster) – then the entire query aborts. This is bad because some of the expensive map-reduce jobs may already have been completed and we should only re-submit the ones that have not been (hence the JIRA). Obviously – this doesn’t happen that often.

    Having said that – there’s also some ongoing work in the Hadoop codebase as well on making map-reduce jobs survive cluster failures that may be relevant (and that i am not fully abreast of).

  16. Mark Callaghan on May 12th, 2009 1:22 pm

    Curt – My question is about how commercial MPP RDBMS vendors recover from single or a small number of node failures during a long running SQL query. Do any of them save enough state to avoid starting the query over?

  17. Curt Monash on May 12th, 2009 1:24 pm

    Mark,

    Let’s find out!

  18. How much state is saved when an MPP DBMS node fails? | DBMS2 -- DataBase Management System Services on May 12th, 2009 1:27 pm

    […] Callaghan raised an interesting question in the comment thread to my recent Facebook/Hadoop/Hive post: My question is about how commercial MPP RDBMS vendors recover from single or a small number of […]

  19. Curt Monash reports on Hadoop/Hive @ Facebook « Joydeep Sen Sarma’s blog on May 12th, 2009 1:57 pm

    […] Monash posted a blog post on our (myself and Ashish Thusoo’s) conversation with him regarding Hadoop and Hive and their […]

  20. Steve Wooledge on May 12th, 2009 2:32 pm

    @Mark

    The Aster nCluster MPP RDBMS is founded on recovery-oriented computing principles and the user will not have to restart any read-only queries in the event of a failure. More on our blog:
    http://tinyurl.com/6fcuj4 or datasheet: http://www.asterdata.com/resources/downloads/datasheets/Aster_nCluster_Availability.pdf

  21. Steve Wooledge on May 12th, 2009 2:41 pm

    > Few of the new low-cost MPP DBMS vendors have production systems even today of >100 nodes.

    MySpace has >200 nodes in production with Aster nCluster. Video about it here: http://tinyurl.com/oz39bf

  22. Neil Conway on May 12th, 2009 3:26 pm

    Part of the reason Hadoop is better at fine-grained failure recovery than a typical parallel DBMS is that Hadoop materializes the results of each operator (e.g. map output is materialized on disk before being sent to the reducers). A parallel DB using pipelined query execution might likely get better performance, at the tradeoff of making fine-grained error recovery more complicated.

  23. Ashish Thusoo on May 12th, 2009 3:46 pm

    @Steve

    I think the basic principals described int he Aster datasheet are similar to hadoop. Have a replica (or two in case of hadoop in order to withstand rack failures) and then fail over the query fragment to the replica if anything happens to one of the compute nodes. Hadoop does that very well as Joydeep mentioned in his comments and in fact the way Hive is structured, Hadoop and Hive are able to withstand such failures for DMLs as well and not just read only queries. Steve, is that true for Aster as well?

    Node failures, transient failures are not a problem in Hadoop/Hive and Hadoop solves that systemically. Where things become a bit week are the central nodes that do the job submission (Job tracker) or the central node that maintains file system metadata (Namenode). By having a backup of these nodes, it is entirely possible to protect and the cluster against any hardware failures for these nodes as well.

    The lack of query isolation in Hadoop/Hive however, does mean that a bad query (e.g. some query using say 16GB memory on nodes having 8GB nodes in each of its reduce jobs and running on say the entire cluster can make the compute nodes and the file system nodes swap and thus bring down the cluster). Most of the availability problems that we have seen have been due to really bad queries having unintended side effects – specially on the memory usage side. We have not had problems withstanding hardware failures or compute node failures.

  24. BotchagalupeMarks for May 11th - 13:34 | IT Management and Cloud Blog on May 12th, 2009 4:10 pm

    […] Facebook, Hadoop, and Hive | DBMS2 — DataBase Management System Services – Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and […]

  25. Facebook’s experiences with compression | DBMS2 -- DataBase Management System Services on May 14th, 2009 3:00 am

    […] little topic didn’t make it into my long post on Facebook’s Hadoop/Hive-based data warehouse: Compression. The story seems to […]

  26. Cloudera Hadoop & Big Data Blog » Blog Archive » 5 Common Questions About Hadoop on May 14th, 2009 10:10 am

    […] There’s been a lot of buzz about Hadoop lately. Just the other day, some of our friends at Yahoo! reclaimed the terasort record from Google using Hadoop, and the folks at Facebook let on that they ingest 15 terabytes a day into their 2.5 petabyte Hadoop-powered data warehouse. […]

  27. Ben Werther on May 14th, 2009 1:21 pm

    There’s a very interesting continuum of approaches here – as best described in a recent blog post by Joe Hellerstein.
    http://databeta.wordpress.com/2009/05/14/bigdata-node-density/

    A few choice extracts:

    – Hadoop (as per the Google MapReduce paper) is wildly pessimistic, checkpointing the output of every single Map or Reduce stage to disks, before reading it right back in. (I describe this to my undergrads as the “regurgitation approach” to fault tolerance.) By contrast, classic MPP database approaches (like Graefe’s famous Exchange operator) are wildly optimistic and pipeline everything, requiring restarts of deep dataflow pipelines in the case of even a single fault.

    – The Google MapReduce pessimistic fault model requires way more machines, but the more machines you have, the more likely you are to see a fault, which will make you pessimistic….

    – It sounds wise to only play the Google regurgitation game when the cost of staging to disk is worth the expected benefit of enabling restart. Can’t this be predicted reasonably well, so that the choice of pipelining or snapshotting is done judiciously?

    That last point hits the nail on the head. If a query would run for 1 minute without ‘regurgitation’ and 40 minutes with it (or require 40x the hardware), you’d probably be better off just running it straight and allow the query to automatically restart if it fails. For longer running queries, a very selective amount of mid-query checkpointing (i.e. not full regurgitation at every step) could start to makes sense, but finding the right balance is really an optimization problem based on the expected runtime and characteristics of the query. If only we had a smart query optimizer that we could use to make those kind of decisions… 🙂

  28. Required Reading | Sam Hamilton on May 22nd, 2009 11:39 am

    […] Facebook, Hadoop, and Hive […]

  29. Facebook, Hadoop and Hive | DeveloperZen on June 16th, 2009 12:38 am

    […] Facebook, Hadoop, and Hive on DBMS2 by Curt Monash discusses Facebook’s architecture and motivation. Facebook decided in 2007 to move what was then a 15 terabyte big-DBMS-vendor data warehouse to Hadoop — augmented by Hive — rather than to an MPP data warehouse DBMS… […]

  30. NoSQL? | DBMS2 -- DataBase Management System Services on July 1st, 2009 3:35 am

    […] are a better fit for cloud computing and extreme scale-out on failure-prone commodity hardware. Facebook made that case to me. However, I have trouble thinking of very many enterprise scenarios where it […]

  31. Big Data and Real-time Web: A Confluence of Streams on July 12th, 2009 9:47 am

    […] as highly compressed columnar storage and smart caching schemes), the basic approach has been to rely on increasing amounts of hardware to solve ever-bigger problems. Such systems have not addressed the fundamental mismatch between batch-oriented processing and the […]

  32. Why Big Data & Real-Time Web Are Made For Each Other | Tech Newz on July 12th, 2009 1:48 pm

    […] as highly compressed columnar storage and smart caching schemes), the basic approach has been to rely on increasing amounts of hardware to solve ever-bigger problems. Such systems have not addressed the fundamental mismatch between batch-oriented processing and the […]

  33. Why Big Data & Real-Time Web Are Made For Each Other | ScooperNews.com on July 12th, 2009 6:31 pm

    […] as highly compressed columnar storage and smart caching schemes), the basic approach has been to rely on increasing amounts of hardware to solve ever-bigger problems. Such systems have not addressed the fundamental mismatch between batch-oriented processing and the […]

  34. Announcing release of HadoopDB (longer version) | Bookmarks on July 23rd, 2009 4:35 pm

    […] their analytical database system and claiming data warehouses of size more than a petabyte (see the end of this write-up for some links to large data warehouses).The second trend is what I talked about in my last blog […]

  35. Confluence: Edmunds Central on September 10th, 2009 3:26 pm

    Hadoop POC Environment…

    Overview The Software Development team is currently researching a technology solution for addressing our ingestion and processing of weblog data that is reaching the limits of the current system based upon Informatica and Oracle RDBMS…….

  36. Cloudera presents the MapReduce bull case | DBMS2 -- DataBase Management System Services on September 12th, 2009 7:31 pm

    […] 10 terabytes/day of data ingested via Hadoop (Edit: Some of these metrics have been updated in a subsequent post about Facebook.) […]

  37. Fault-tolerant queries | DBMS2 -- DataBase Management System Services on September 13th, 2009 12:07 pm

    […] we discussed this subject a few months ago in a couple of comment threads, it seemed to be the case […]

  38. How 30+ enterprises are using Hadoop | DBMS2 -- DataBase Management System Services on October 16th, 2009 7:36 am

    […] to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case […]

  39. Yahoo wants to do decapetabyte-scale data warehousing in Hadoop | DBMS2 -- DataBase Management System Services on December 12th, 2009 1:59 am

    […] Language) for Hadoop, which is however getting a SQL interface. And we talked about Pig vs. Hive. But I recently heard a rumor all that is in flux, so I won’t write it up […]

  40. Enterprise Data Base Building In The Social Era | Digital Savant on December 15th, 2009 7:48 pm

    […] utilizes the open source Hadoop/Hive system, which “ingests 15 terabytes of new data per day”. The amount of data coming in, telling details like favorite music, locations, books, thoughts and […]

  41. Boston Big Data Summit keynote outline | DBMS2 -- DataBase Management System Services on January 28th, 2010 7:24 am

    […] Big ones — retail-oriented ones (eBay, Amazon) partially excepted — rolled their own technology stacks […]

  42. Parag Arora on February 22nd, 2010 8:55 am

    How is the load balancing to different nodes assured? Does Hadoop internally handles it or we need to define a distribution mechanism?

  43. Curt Monash on February 22nd, 2010 4:56 pm

    Hadoop has some level of load balancing. That’s essential to any implementation of MapReduce. If it’s not good enough for you — well, Hadoop is open source, so dive in and change it.

  44. Rico on August 17th, 2011 9:57 am

    Sorry Hit enter too soon. Has anyone tried Datameers excel frontend to Hadoop. Makes life easier for it and the end users. Free copy on the website. Give it a chance. You wont be disappointed.

  45. NoSQL bei Windows Server, Azure & SQL Server mit Apache Hadoop | Code-Inside Blog on October 12th, 2011 7:04 pm

    […] hat vermutlich den größten Hadoop Cluster – in diesem Blogpost sind ein paar Zahlen und Fakten genannt. Beeindruckend auf alle […]

  46. NoSQL for Windows Server, Azure & SQL Server with Apache Hadoop | Code-Inside Blog International on November 19th, 2011 2:31 pm

    […] Facebook has the main Hadoop Cluster – in this Blogpost you will found some numbers and facts. […]

  47. What does database facebook, google, yahoo, twitter, linkin use | Kevin on December 9th, 2013 11:16 pm
  48. What databases do the World Wide Web’s biggest sites run on? [closed] | ASK AND ANSWER on January 13th, 2016 4:10 pm

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.