Comments on: How 30+ enterprises are using Hadoop

By: Will Hadoop Vendors Profit from Banks’ Big Data Woes? — Gigaom Research

Wed, 23 Oct 2013 10:21:34 +0000

[…] be Bank of America’s managing director for big data and analytics. A year ago, already, Vertica indicated that roughly 10 percent of its customers were in production with Hadoop — a trend spearheaded by its financial services customers. On the […]

By: Cloudera Enterprise and Hadoop evolution | DBMS 2 : DataBase Management System Services

Thu, 14 Jul 2011 01:04:35 +0000

[…] None of this is inconsistent with previous surveys of Hadoop use cases. […]

By: Andrew S

Andrew S — Mon, 19 Oct 2009 23:54:52 +0000

Vlad, the difference is that the Soviets didn’t have open source behind them. A more common pattern in recent history has been:

1. Proprietary software solution comes out
2. A good open source solution with similar capabilities comes out later.
3. Open source solution gains large backers, top developers, cutting-edge tech companies, leading academics
4. Open source solution eclipses proprietary solution in usage because of easy availability and documentation
5. Proprietary solution dies out because it becomes profitable to switch to open source solution.

Hadoop is somewhere in (3) and partially in (4).

By: Vlad

Vlad — Mon, 12 Oct 2009 19:53:54 +0000

@RC
From Dryad whitepaper:
“The fundamental difference between the two systems (Dryad and MapReduce) is that a Dryad application may specify an arbitrary communication DAG rather than requiring
a sequence of map/distribute/sort/reduce operations. In particular, graph vertices may consume multiple inputs, and generate multiple outputs, of different types. For many applications this simplifies the mapping from algorithm to implementation, lets us
build on a greater library of basic subroutines, and, together with the ability to exploit TCP pipes and shared-memory for data edges, can bring substantial performance gains. At the same time, our implementation is general enough to support all the features described in the MapReduce paper.”

By: RC

RC — Mon, 12 Oct 2009 07:46:12 +0000

@Vlad

Is Dryad much better than Hadoop? If so, what are the improvements?

By: Vlad

Vlad — Mon, 12 Oct 2009 02:40:40 +0000

MapReduce is heavily promoted, for some reason, by Yahoo and Facebook but not by Google. Google (and Microsoft) have developed already next generation “Hadoops” (Pregel and Dryad) but they are still not available for general public and not open-sourced. Even information on Pregel is limited.

To me the situation reminds Soviet Union in middle-late 80s. Not being able to create its own supercomputers, Soviets tried to reverse engineer American ones (Cray etc). You can reproduce what has been done already but you always be behind.

UPD. Dryad can be downloaded from MS site but only for academic research.

By: Jerome Pineau

Jerome Pineau — Sun, 11 Oct 2009 14:49:57 +0000

Curt, do you know how many of these V customers are “in the cloud” (ie: they’re running on V AMIs in EC2) and how many of those are in that 10% or so you mention?

By: Curt Monash

Curt Monash — Sun, 11 Oct 2009 13:04:24 +0000

@Vlad,

http://www.dbms2.com/2008/10/15/ebay-doesnt-love-mapreduce/ may be relevant. 🙂

By: Vlad

Vlad — Sun, 11 Oct 2009 07:34:45 +0000

I have made some calculations based on the data publicly available on the Internet. The famous Yahoo Terasort record – sorting 1 TB of data (actually 10 billion 100 bytes record)on a Hadoop ~ 3400+ server cluster in 60 seconds. I will omit the calculation details but the average CPU , disk I/O and network I/O utilization during the run were:

1%, 5-6% and 30% respectively. These are not exact numbers of course, but the estimates based on sorting algorithm used, the cluster’s configuration, server CPUs power, max NIC throughput (1Gb) and 4 SATA disk array I/O capability.

So, the bottleneck definitely is network (I think it is not only for sorting but for many others problems). But it seems that either Yahoo cluster is suboptimal from the point of view of max sustained throughput or Hadoop can not saturate 1Gb link. OK, lets imagine we do not use commodity hardware but more optimized servers and network configurations.

How about 2 10Gb port NIC per server and 128 – port 10GB switch. Just one. By increasing network throughput from 30MB/s to 2GB/s (2 10Gb port NIC per server) sec we can reduce the number of servers in a cluster by factor of 70 (~ 50 servers) and still keep the same 60 sec run. Is it possible to sort 2GB per second (20 million 100 bytes records ) on one server. Sure it is.

Yahoo cluster costs approx 7 million. I can build my cluster for less than 1 million and we are not talking about power consumption and other associated costs.

MapReduce and commodity hardware won’t save you money. Do not buy cheap.