<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: How 30+ enterprises are using Hadoop</title>
	<atom:link href="http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Mon, 25 Jan 2010 14:39:21 -0500</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Andrew S</title>
		<link>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/comment-page-1/#comment-144612</link>
		<dc:creator>Andrew S</dc:creator>
		<pubDate>Mon, 19 Oct 2009 23:54:52 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=1073#comment-144612</guid>
		<description>Vlad, the difference is that the Soviets didn&#039;t have open source behind them. A more common pattern in recent history has been:

1. Proprietary software solution comes out
2. A good open source solution with similar capabilities comes out later.
3. Open source solution gains large backers, top developers, cutting-edge tech companies, leading academics
4. Open source solution eclipses proprietary solution in usage because of easy availability and documentation
5. Proprietary solution dies out because it becomes profitable to switch to open source solution.

Hadoop is somewhere in (3) and partially in (4).</description>
		<content:encoded><![CDATA[<p>Vlad, the difference is that the Soviets didn&#8217;t have open source behind them. A more common pattern in recent history has been:</p>
<p>1. Proprietary software solution comes out<br />
2. A good open source solution with similar capabilities comes out later.<br />
3. Open source solution gains large backers, top developers, cutting-edge tech companies, leading academics<br />
4. Open source solution eclipses proprietary solution in usage because of easy availability and documentation<br />
5. Proprietary solution dies out because it becomes profitable to switch to open source solution.</p>
<p>Hadoop is somewhere in (3) and partially in (4).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Vlad</title>
		<link>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/comment-page-1/#comment-143238</link>
		<dc:creator>Vlad</dc:creator>
		<pubDate>Mon, 12 Oct 2009 19:53:54 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=1073#comment-143238</guid>
		<description>@RC
From Dryad whitepaper:
&quot;The fundamental difference between the two systems (Dryad and MapReduce) is that a Dryad application may specify an arbitrary communication DAG rather than requiring
a sequence of map/distribute/sort/reduce operations. In particular, graph vertices may consume multiple inputs, and generate multiple outputs, of different types. For many applications this simplifies the mapping from algorithm to implementation, lets us
build on a greater library of basic subroutines, and, together with the ability to exploit TCP pipes and shared-memory for data edges, can bring substantial performance gains. At the same time, our implementation is general enough to support all the features described in the MapReduce paper.&quot;</description>
		<content:encoded><![CDATA[<p>@RC<br />
From Dryad whitepaper:<br />
&#8220;The fundamental difference between the two systems (Dryad and MapReduce) is that a Dryad application may specify an arbitrary communication DAG rather than requiring<br />
a sequence of map/distribute/sort/reduce operations. In particular, graph vertices may consume multiple inputs, and generate multiple outputs, of different types. For many applications this simplifies the mapping from algorithm to implementation, lets us<br />
build on a greater library of basic subroutines, and, together with the ability to exploit TCP pipes and shared-memory for data edges, can bring substantial performance gains. At the same time, our implementation is general enough to support all the features described in the MapReduce paper.&#8221;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: RC</title>
		<link>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/comment-page-1/#comment-143176</link>
		<dc:creator>RC</dc:creator>
		<pubDate>Mon, 12 Oct 2009 07:46:12 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=1073#comment-143176</guid>
		<description>@Vlad

Is Dryad much better than Hadoop? If so, what are the improvements?</description>
		<content:encoded><![CDATA[<p>@Vlad</p>
<p>Is Dryad much better than Hadoop? If so, what are the improvements?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Vlad</title>
		<link>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/comment-page-1/#comment-143159</link>
		<dc:creator>Vlad</dc:creator>
		<pubDate>Mon, 12 Oct 2009 02:40:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=1073#comment-143159</guid>
		<description>MapReduce is heavily promoted, for some reason, by Yahoo and Facebook but not by Google. Google (and Microsoft) have developed already next generation &quot;Hadoops&quot; (Pregel and Dryad) but they are still not available for general public and not open-sourced. Even information on Pregel is limited. 

To me the situation reminds Soviet Union in middle-late 80s. Not being able to create its own supercomputers, Soviets tried to reverse engineer American ones (Cray etc). You can reproduce what has been done already but you always be behind. 
 

UPD. Dryad can be downloaded from MS site but only for academic research.</description>
		<content:encoded><![CDATA[<p>MapReduce is heavily promoted, for some reason, by Yahoo and Facebook but not by Google. Google (and Microsoft) have developed already next generation &#8220;Hadoops&#8221; (Pregel and Dryad) but they are still not available for general public and not open-sourced. Even information on Pregel is limited. </p>
<p>To me the situation reminds Soviet Union in middle-late 80s. Not being able to create its own supercomputers, Soviets tried to reverse engineer American ones (Cray etc). You can reproduce what has been done already but you always be behind. </p>
<p>UPD. Dryad can be downloaded from MS site but only for academic research.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jerome Pineau</title>
		<link>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/comment-page-1/#comment-143131</link>
		<dc:creator>Jerome Pineau</dc:creator>
		<pubDate>Sun, 11 Oct 2009 14:49:57 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=1073#comment-143131</guid>
		<description>Curt, do you know how many of these V customers are &quot;in the cloud&quot; (ie: they&#039;re running on V AMIs in EC2) and how many of those are in that 10% or so you mention?</description>
		<content:encoded><![CDATA[<p>Curt, do you know how many of these V customers are &#8220;in the cloud&#8221; (ie: they&#8217;re running on V AMIs in EC2) and how many of those are in that 10% or so you mention?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Curt Monash</title>
		<link>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/comment-page-1/#comment-143128</link>
		<dc:creator>Curt Monash</dc:creator>
		<pubDate>Sun, 11 Oct 2009 13:04:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=1073#comment-143128</guid>
		<description>@Vlad,

http://www.dbms2.com/2008/10/15/ebay-doesnt-love-mapreduce/ may be relevant. :)</description>
		<content:encoded><![CDATA[<p>@Vlad,</p>
<p><a href="http://www.dbms2.com/2008/10/15/ebay-doesnt-love-mapreduce/"  rel="nofollow">http://www.dbms2.com/2008/10/15/ebay-doesnt-love-mapreduce/</a> may be relevant. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Vlad</title>
		<link>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/comment-page-1/#comment-143111</link>
		<dc:creator>Vlad</dc:creator>
		<pubDate>Sun, 11 Oct 2009 07:34:45 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=1073#comment-143111</guid>
		<description>I have made some calculations based on the data publicly available on the Internet. The famous Yahoo Terasort record - sorting 1 TB of data (actually 10 billion 100 bytes record)on a Hadoop ~ 3400+ server cluster in 60 seconds. I will omit the calculation details but the average CPU , disk I/O and network I/O utilization during the run were:

1%, 5-6% and 30% respectively. These are not exact numbers of course, but the estimates based on sorting algorithm used, the cluster&#039;s configuration, server CPUs power, max NIC throughput (1Gb) and 4 SATA disk array I/O capability.

So, the bottleneck definitely is network (I think it is not only for sorting but for many others problems). But it seems that either Yahoo cluster is suboptimal from the point of view of max sustained throughput or Hadoop can not saturate 1Gb link. OK, lets imagine we do not use commodity hardware but more optimized servers and network configurations.

How about 2 10Gb port NIC per server and 128 - port 10GB switch. Just one. By increasing network throughput from 30MB/s to 2GB/s (2 10Gb port NIC per server) sec we can reduce the number of servers in a cluster by factor of 70 (~ 50 servers) and still keep the same 60 sec run. Is it possible to sort 2GB per second (20 million 100 bytes records ) on one server. Sure it is.

Yahoo cluster costs approx 7 million. I can build my cluster for less than 1 million and we are not talking about power consumption and other associated costs.  

MapReduce and commodity hardware won&#039;t save you money. Do not buy cheap.</description>
		<content:encoded><![CDATA[<p>I have made some calculations based on the data publicly available on the Internet. The famous Yahoo Terasort record &#8211; sorting 1 TB of data (actually 10 billion 100 bytes record)on a Hadoop ~ 3400+ server cluster in 60 seconds. I will omit the calculation details but the average CPU , disk I/O and network I/O utilization during the run were:</p>
<p>1%, 5-6% and 30% respectively. These are not exact numbers of course, but the estimates based on sorting algorithm used, the cluster&#8217;s configuration, server CPUs power, max NIC throughput (1Gb) and 4 SATA disk array I/O capability.</p>
<p>So, the bottleneck definitely is network (I think it is not only for sorting but for many others problems). But it seems that either Yahoo cluster is suboptimal from the point of view of max sustained throughput or Hadoop can not saturate 1Gb link. OK, lets imagine we do not use commodity hardware but more optimized servers and network configurations.</p>
<p>How about 2 10Gb port NIC per server and 128 &#8211; port 10GB switch. Just one. By increasing network throughput from 30MB/s to 2GB/s (2 10Gb port NIC per server) sec we can reduce the number of servers in a cluster by factor of 70 (~ 50 servers) and still keep the same 60 sec run. Is it possible to sort 2GB per second (20 million 100 bytes records ) on one server. Sure it is.</p>
<p>Yahoo cluster costs approx 7 million. I can build my cluster for less than 1 million and we are not talking about power consumption and other associated costs.  </p>
<p>MapReduce and commodity hardware won&#8217;t save you money. Do not buy cheap.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Dynamic page generated in 0.225 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2010-01-26 22:44:56 -->
