October 10, 2009

How 30+ enterprises are using Hadoop

MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera’s files. Facts and metrics ranged widely, of course:

Some are in heavy production with Hadoop, and closely engaged with Cloudera. Others are active Hadoop users but are very secretive. Yet others signed up for initial Hadoop training last week.
Some have Hadoop clusters in the thousands of nodes. Many have Hadoop clusters in the 50-100 node range. Others are just prototyping Hadoop use. And one seems to be “OEMing” a small Hadoop cluster in each piece of equipment sold.
Many export data from Hadoop to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case Jaql.
Some are household names, in web businesses or otherwise. Others seem to be pretty obscure.
Industries include financial services, telecom (Asia only, and quite new), bioinformatics (and other research), intelligence, and lots of web and/or advertising/media.
Application areas mentioned — and these overlap in some cases — include:
- Log and/or clickstream analysis of various kinds
- Marketing analytics
- Machine learning and/or sophisticated data mining
- Image processing
- Processing of XML messages
- Web crawling and/or text processing
- General archiving, including of relational/tabular data, e.g. for compliance

We went over this list so quickly that we didn’t go into much detail on any one user. But one example that stood out was of an ad serving firm that had an “aggregation pipeline” consisting of 70-80 MapReduce jobs.

I also talked yesterday again w/ Omer Trajman of Vertica, who surprised me by indicating a high single-digit number of Vertica’s customers were in production with Hadoop — i.e., over 10% of Vertica’s production customers. (Vertica recently made its 100th sale, and of course not all those buyers are in production yet.) Vertica/Hadoop usage seems to have started in Vertica’s financial services stronghold — specifically in financial trading — with web analytics and the like coming on afterwards. Based on current prototyping efforts, Omer expects bioinformatics to be the third production market for Vertica/Hadoop, with telecommunications coming in fourth.

Unsurprisingly, the general Vertica/Hadoop usage model seems to be:

Do something to the data in Hadoop
Dump it into Vertica to be queried

What I did find surprising is that the data often isn’t reduced by this analysis, but rather exploded in size. E.g., a complete store of mortgage trading data might be a few terabytes in size, but Hadoop-based post processing can increase that by 1 or 2 orders of magnitude. (Analogies to the importance and magnitude of “cooked” data in scientific data processing come to mind.)

And finally, I talked to Aster a few days ago about the usage of its nCluster/Hadoop connector. Aster characterized Aster/Hadoop users’ Hadoop usage as being of the batch/ETL variety, which is the classic use case one concedes to Hadoop even if one believes that MapReduce should commonly be done right in the DBMS.

Related link

An August, 2008 round-up of MapReduce applications.

Categories: Application areas, Aster Data, Cloudera, Data types, Data warehousing, Database diversity, EAI, EII, ETL, ELT, ETLT, Hadoop, Investment research and trading, Log analysis, MapReduce, Open source, Parallelization, Predictive modeling and advanced analytics, Scientific research, Structured documents, Telecommunications, Text, Vertica Systems, Web analytics

Subscribe to our complete feed!

Comments

9 Responses to “How 30+ enterprises are using Hadoop”

Vlad on October 11th, 2009 3:34 am

I have made some calculations based on the data publicly available on the Internet. The famous Yahoo Terasort record – sorting 1 TB of data (actually 10 billion 100 bytes record)on a Hadoop ~ 3400+ server cluster in 60 seconds. I will omit the calculation details but the average CPU , disk I/O and network I/O utilization during the run were:

1%, 5-6% and 30% respectively. These are not exact numbers of course, but the estimates based on sorting algorithm used, the cluster’s configuration, server CPUs power, max NIC throughput (1Gb) and 4 SATA disk array I/O capability.

So, the bottleneck definitely is network (I think it is not only for sorting but for many others problems). But it seems that either Yahoo cluster is suboptimal from the point of view of max sustained throughput or Hadoop can not saturate 1Gb link. OK, lets imagine we do not use commodity hardware but more optimized servers and network configurations.

How about 2 10Gb port NIC per server and 128 – port 10GB switch. Just one. By increasing network throughput from 30MB/s to 2GB/s (2 10Gb port NIC per server) sec we can reduce the number of servers in a cluster by factor of 70 (~ 50 servers) and still keep the same 60 sec run. Is it possible to sort 2GB per second (20 million 100 bytes records ) on one server. Sure it is.

Yahoo cluster costs approx 7 million. I can build my cluster for less than 1 million and we are not talking about power consumption and other associated costs.

MapReduce and commodity hardware won’t save you money. Do not buy cheap.
Curt Monash on October 11th, 2009 9:04 am

@Vlad,

http://www.dbms2.com/2008/10/15/ebay-doesnt-love-mapreduce/ may be relevant. 🙂
Jerome Pineau on October 11th, 2009 10:49 am

Curt, do you know how many of these V customers are “in the cloud” (ie: they’re running on V AMIs in EC2) and how many of those are in that 10% or so you mention?
Vlad on October 11th, 2009 10:40 pm

MapReduce is heavily promoted, for some reason, by Yahoo and Facebook but not by Google. Google (and Microsoft) have developed already next generation “Hadoops” (Pregel and Dryad) but they are still not available for general public and not open-sourced. Even information on Pregel is limited.

To me the situation reminds Soviet Union in middle-late 80s. Not being able to create its own supercomputers, Soviets tried to reverse engineer American ones (Cray etc). You can reproduce what has been done already but you always be behind.

UPD. Dryad can be downloaded from MS site but only for academic research.
RC on October 12th, 2009 3:46 am

@Vlad

Is Dryad much better than Hadoop? If so, what are the improvements?
Vlad on October 12th, 2009 3:53 pm

@RC
From Dryad whitepaper:
“The fundamental difference between the two systems (Dryad and MapReduce) is that a Dryad application may specify an arbitrary communication DAG rather than requiring
a sequence of map/distribute/sort/reduce operations. In particular, graph vertices may consume multiple inputs, and generate multiple outputs, of different types. For many applications this simplifies the mapping from algorithm to implementation, lets us
build on a greater library of basic subroutines, and, together with the ability to exploit TCP pipes and shared-memory for data edges, can bring substantial performance gains. At the same time, our implementation is general enough to support all the features described in the MapReduce paper.”
Andrew S on October 19th, 2009 7:54 pm

Vlad, the difference is that the Soviets didn’t have open source behind them. A more common pattern in recent history has been:

1. Proprietary software solution comes out
2. A good open source solution with similar capabilities comes out later.
3. Open source solution gains large backers, top developers, cutting-edge tech companies, leading academics
4. Open source solution eclipses proprietary solution in usage because of easy availability and documentation
5. Proprietary solution dies out because it becomes profitable to switch to open source solution.

Hadoop is somewhere in (3) and partially in (4).
Cloudera Enterprise and Hadoop evolution | DBMS 2 : DataBase Management System Services on July 13th, 2011 8:04 pm

[…] None of this is inconsistent with previous surveys of Hadoop use cases. […]
Will Hadoop Vendors Profit from Banks’ Big Data Woes? — Gigaom Research on October 23rd, 2013 6:21 am

[…] be Bank of America’s managing director for big data and analytics. A year ago, already, Vertica indicated that roughly 10 percent of its customers were in production with Hadoop — a trend spearheaded by its financial services customers. On the […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

How 30+ enterprises are using Hadoop

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin