<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Yahoo</title>
	<atom:link href="http://www.dbms2.com/category/users/yahoo/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Tue, 07 Feb 2012 06:49:30 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Big data terminology and positioning</title>
		<link>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/</link>
		<comments>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 01:35:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5768</guid>
		<description><![CDATA[Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions: Bigness &#8212; Volume, Velocity, size Structure &#8212; Variety, Variability, Complexity given that High-velocity &#8220;big data&#8221; problems are usually high-volume as well.* Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction. But the conflation should [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I observed that <a href="../../../../../2011/09/11/big-data-has-jumped-the-shark/">Big Data terminology is seriously broken</a>. It is reasonable to reduce the subject to two quasi-dimensions:</p>
<ul>
<li><strong>Bigness</strong> &#8212; Volume, Velocity, size</li>
<li><strong>Structure</strong> &#8212; Variety, Variability, Complexity</li>
</ul>
<p>given that</p>
<ul>
<li>High-velocity &#8220;big data&#8221; problems are usually high-volume as well.*</li>
<li>Variety, variability, and complexity all relate to the <a href="../../../../../2011/05/17/poly-structured-database/">simply-structured/poly-structured</a> distinction.</li>
</ul>
<p>But the conflation should stop there.</p>
<p><em>*Low-volume/high-velocity problems are commonly referred to as <a href="../2011/08/25/renaming-cep-or-not/">&#8220;event processing&#8221; and/or &#8220;streaming&#8221;</a>.</em></p>
<p>When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2&#215;2 matrix of possibilities. For want of better alternatives, my suggestions are:</p>
<ul>
<li><strong>Relational big data</strong> is data of high volume that fits well into a relational DBMS.</li>
<li><strong>Multi-structured big data</strong> is data of high volume that doesn&#8217;t fit well into a relational DBMS. <em>Alternative: Poly-structured big data.</em></li>
<li><strong>Conventional relational data</strong> is data of not-so-high volume that fits well into a relational DBMS. <em>Alternatives: Ordinary/normal/smaller relational data.</em></li>
<li><strong>Smaller poly-structured data</strong> is data for which <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schema</a> capabilities are important, but which doesn&#8217;t rise to &#8220;big data&#8221; volume.</li>
</ul>
<p><span id="more-5768"></span>Notes on all this include:</p>
<ul>
<li>&#8220;Relational big data&#8221; is commonly what you need a scalable analytic relational DBMS for. But there are non-analytic use cases as well.</li>
<li>The paradigmatic example of &#8220;multi-structured big data&#8221; is log files. Thus, multi-structured big data is commonly what you need a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> for.</li>
<li>One might want to equate non-analytic relational big data technology to &#8220;NewSQL&#8221;. However, I&#8217;m struggling to think of a database size range in which the entire NewSQL industry can match Oracle&#8217;s market share alone.</li>
<li>One might want to equate non-analytic multi-structured big data technology to &#8220;NoSQL&#8221;. However:
<ul>
<li>&#8220;NoSQL&#8221; is also used to encompass not-so-big-data use cases, such as prototyping in MongoDB.</li>
<li><a href="../../../../../2011/10/02/defining-nosql/">&#8220;NoSQL&#8221; has non-ACID/low(er)-data-integrity connotations</a> that aren&#8217;t appropriate for all non-relational systems.</li>
</ul>
</li>
<li>Up to a point, you can analyze relational big data in a conventional relational DBMS, but an analytic RDBMS will usually win on TCO (Total Cost of Ownership). In particular, reasonable thresholds for moving an analytic database off Oracle might be:
<ul>
<li>1-2 terabytes if you&#8217;ve never bought anything past Oracle Standard Edition.</li>
<li>5-10 terabytes if you&#8217;re already paying for Oracle Enterprise Edition.</li>
<li>A lot higher than that if you actually find Oracle Exadata to be cost-effective.</li>
</ul>
</li>
<li>Depending on how big one acknowledges as &#8220;big&#8221;, the market share leader in &#8220;big bit bucket&#8221; use cases is either Splunk or Hadoop.</li>
<li>If we look at multi-structured big data management overall, MarkLogic joins the list of market share contenders, as do various NoSQL alternatives.</li>
<li>It is wrong to say that the large web companies invented &#8220;big data&#8221; technology. But it is more reasonable to say they invented much of &#8220;multi-structured big data&#8221; management. In particular (and this is just a partial list), Google, Amazon, Yahoo, Facebook, et al. can reasonably be credited with Hadoop, Cassandra, HBase and various predecessors to same.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Hadoop evolution</title>
		<link>http://www.dbms2.com/2011/08/21/hadoop-evolution/</link>
		<comments>http://www.dbms2.com/2011/08/21/hadoop-evolution/#comments</comments>
		<pubDate>Sun, 21 Aug 2011 11:54:43 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Workload management]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5109</guid>
		<description><![CDATA[I wanted to learn more about Hadoop and its futures, so I talked Friday with Arun Murthy of Hortonworks.* Most of what we talked about was: NameNode evolution, and the related issue of file-count limitations. JobTracker evolution. Arun previously addressed these issues and more in a June slide deck. *Arun has been working on Hadoop [...]]]></description>
			<content:encoded><![CDATA[<p>I wanted to learn more about Hadoop and its <a href="../../../../../2011/07/10/hadoop-futures-and-enhancements/">futures</a>, so I talked Friday with Arun Murthy of Hortonworks.* Most of what we talked about was:</p>
<ul>
<li>NameNode evolution, and the related issue of file-count limitations.</li>
<li>JobTracker evolution.</li>
</ul>
<p>Arun previously addressed these issues and more in a <a href="http://www.slideshare.net/hortonworks/nextgen-apache-hadoop-mapreduce">June slide deck</a>.<br />
<span id="more-5109"></span><em></em></p>
<p><em>*Arun has been working on Hadoop full time for over 6 years, and leading  the Yahoo/Hortonworks MapReduce team for at least the last 2.</em></p>
<p>For both NameNode and JobTracker, Arun&#8217;s take was along the lines of:</p>
<ul>
<li>It&#8217;s worked well so far, but it had to be enhanced.</li>
<li>What we&#8217;re taking care of in the current development cycle will make things better.</li>
<li>Future enhancements after that are of course likely.</li>
<li>What&#8217;s more, we&#8217;re doing some refactoring* to improve pluggability, so that alternatives to the mainline Apache Hadoop version of any specific module can easily be swapped in.</li>
</ul>
<p><em>*My word, not Arun&#8217;s.</em></p>
<p>In the case of both NameNode and JobTracker, the problem starts with an essential component of the Hadoop system being on a single server. That single server can be mirrored for high availability, but it&#8217;s a bottleneck. The current fix for NameNode is to federate it across multiple servers, each owning a portion of the namespace. A variety of further possibilities come to mind:</p>
<ul>
<li>Allow persistent storage (I&#8217;d urge solid-state memory be assumed in the design), in some sort of a purpose-built system. (Right now NameNode keeps all the metadata it manages in RAM, which is why it has such capacity limitations.)</li>
<li>Use HBase, as <a href="../../../../../2011/07/27/introduction-to-zettaset/">Zettaset</a> does.</li>
<li>Build on some other open-source DBMS, especially one that was designed for hybrid memory-centric use cases.</li>
</ul>
<p>The Hadoop guys evidently don&#8217;t know which of those approaches will be tried next, in which combination. But if they choose wrong &#8212; well, as Arun points out, you&#8217;re welcome to implement your own alternative.</p>
<p>Meanwhile, Arun says the current improvement will take Hadoop&#8217;s capacity up to half a billion files or so,* or 60-100 petabytes of &#8220;storage&#8221; (presumably before <a href="../../../../../2011/07/06/hadoop-hardware-and-compression/">compression</a>, replication, and so on are taken into account). I didn&#8217;t ask Arun to walk me through the arithmetic of that, but I did ask why there were so many files in the first place. I&#8217;ve heard in the past of what amounted to &#8220;a new file for every update&#8221; kinds of scenarios, but that&#8217;s not the example he gave. Rather, he spoke of such slicing and dicing of Yahoo advertising data that there are huge numbers of calculated quasi-cubes. For example, advertiser-specific aggregates are run on 5-minute slices of web event data, and stored for 15-minute slices; the raw data is of course kept as well.</p>
<p><em>*Depending on who you ask, the current figure seems to be in the 70-100 million range, special cases perhaps aside.</em></p>
<p>Arun also addressed the debate over the fact that MapReduce uses the node-specific file system to create temporary files, rather than HDFS. One reason for this is surely the file-count limitation, but another is performance. Yes, this choice means you lose intermediate results if there&#8217;s a node failure; but Hadoop is designed to deal with nodes that fail. And by the way &#8212; even if you did use HDFS for intermediate result sets, you might well be setting its replication factor to 1.</p>
<p>As for JobTracker &#8212; that&#8217;s getting split into at least two pieces, to separate its two main functions:</p>
<ul>
<li>Global Resource Manager, which will manage the allocation of resources to various applications, which is apparently a fairly lightweight task.</li>
<li>Application Master, which will manage the scheduling across nodes of individual applications, which is apparently the part that consumes lots of resources.</li>
</ul>
<p>Apparently JobTracker has been a key barrier to scalability, being limited to 40-50,000 tasks per cluster, which is a lot like 10-15 tasks/node multiplied by the limit-to-date of 4,000 or so nodes/cluster. The new architecture, in which multiple Application Master compete for resources dispenses by a single Global Resource Manager, is supposed to allow at least 20-50 tasks/node, and a greater number of nodes as well. This capability was checked into the main Hadoop code line just this week.</p>
<p>Global Resource Manager also runs on a single node, high availability mirroring aside. Discussion of opportunities for further scaling was much like our discussion of scaling NameNode &#8212; there&#8217;s one option that could work, and another option that could also work, and in the mean time one can substitute one&#8217;s own version if one likes. Indeed, different Hadoop users already plug different schedulers into JobTracker today.</p>
<p>Another benefit of refactoring JobTracker is that non-MapReduce processing frameworks can be included in Hadoop. The one that gets mentioned repeatedly is MPI (Message Passing Interface). Like the <a href="../../../../../2011/04/21/sas-hpa-does-make-sense-after-all/">SAS</a> folks, the <a href="../../../../../2011/04/06/so-can-logistic-regression-be-parallelized-or-not/">Hadoop</a> folks seem to think that not all important algorithms can be parallelize via MapReduce, and that MPI is a great place to look for alternative strategies.</p>
<p>I&#8217;m still confused about issues such as:</p>
<ul>
<li> What resources the Global Resource Manager manages (CPU? RAM? I/O?).</li>
<li>Whether the Global Resource Manager prioritizes work in the flexible way that, say, a good analytic DBMS workload manager does (I&#8217;m guessing it doesn&#8217;t, based on the discussion of alternative scheduling modules).</li>
<li>What, if anything, still needs to be done to actually make MPI work within Hadoop.</li>
</ul>
<p>But in any case, this all seems to be an important step in Hadoop&#8217;s evolution toward being (more of) a flexible, industrial-strength application execution environment.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/08/21/hadoop-evolution/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Cloudera and Hortonworks</title>
		<link>http://www.dbms2.com/2011/07/10/cloudera-and-hortonworks/</link>
		<comments>http://www.dbms2.com/2011/07/10/cloudera-and-hortonworks/#comments</comments>
		<pubDate>Mon, 11 Jul 2011 03:13:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hortonworks]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4939</guid>
		<description><![CDATA[My clients at Cloudera have been around for a while, in effect positioned as &#8220;the Hadoop company.&#8221; Their business, in a nutshell, consists of: Packaging up a Cloudera distribution of Apache Hadoop. This distribution doesn&#8217;t have proprietary code; it&#8217;s just packaged by Cloudera from Apache projects (with a decent minority of the code happening to [...]]]></description>
			<content:encoded><![CDATA[<p>My clients at Cloudera have been around for a while, in effect positioned as &#8220;the Hadoop company.&#8221; Their business, in a nutshell, consists of:</p>
<ul>
<li>Packaging up <strong>a Cloudera distribution of Apache Hadoop.</strong> This distribution doesn&#8217;t have proprietary code; it&#8217;s just packaged by Cloudera from Apache projects (with a decent minority of the code happening to have been contributed by Cloudera engineers).</li>
<li>Paid subscription <strong>support for Apache Hadoop</strong> and, in connection with that &#8230;</li>
<li>&#8230;  <strong>proprietary software</strong> that all support customers automatically get. There are two points to this proprietary software:
<ul>
<li>It adds value for the customer.</li>
<li>It makes Cloudera&#8217;s support job easier.</li>
</ul>
</li>
<li><strong>Professional services</strong> around Hadoop.</li>
<li><strong>Training and conferences</strong> around Hadoop, which probably don&#8217;t generate all that much money, but are great marketing in terms of visibility, thought leadership, and lead generation.</li>
</ul>
<p><strong>Hortonworks</strong> spun out of Yahoo last week, with parts of the Cloudera business model, namely Hadoop support, training, and I guess conferences. Hortonworks emphatically rules out professional services, and says that it will contribute all code back to Apache Hadoop. Hortonworks does grudgingly admit that it might get into the proprietary software business at some point &#8212; but evidently hopes that day will never actually come.</p>
<p><span id="more-4939"></span>Hortonworks&#8217; two main initial marketing messages &#8212; and there&#8217;s some synergy between these &#8212; boil down to:</p>
<ul>
<li>Open source purism</li>
<li>&#8220;We have most of the Hadoop developers, so we&#8217;re better&#8221;*</li>
</ul>
<p>Frankly, the open source purism part sounds like doubletalk to me, in that Hortonworks has trouble articulating what supposedly-less-pure Cloudera does wrong that Hortonworks will do better. However, I&#8217;ve been hearing for a long time that Yahoo&#8217;s MapReduce developers feel very strongly about open source, so perhaps this is in part an emotional issue for them. More substantively, it fits well with the pro-Hortonworks story I&#8217;ve outlined below.</p>
<p><em>*&#8221;We have most of the Hadoop developers&#8221; seems fairly defensible, give or take dueling definitions of &#8220;committer,&#8221; &#8220;core developer,&#8221; &#8220;patch&#8221; or for that matter &#8220;Hadoop.&#8221;</em></p>
<p>The other branch of the Hortonworks marketing message can be lampooned as &#8220;We&#8217;re the right folks to identify your bugs, since we&#8217;re probably the ones who put them there in the first place.&#8221; More darkly, that pitch could be &#8220;If you want the bugs fixed that bother you, we&#8217;re the ones who have control over whether or not that happens.&#8221; Well, maybe. But I also see Cloudera having a couple years experience supporting Hadoop, as well as shipping some code that perhaps makes Hadoop more supportable.</p>
<p>That&#8217;s the skeptical view. <strong>A more favorable view of Hortonworks&#8217; prospects </strong>would go something like this:</p>
<ul>
<li>One version of Apache Hadoop is plenty.</li>
<li>Cloudera (and arguably other Hadoop platform software vendors) sell capabilities that will soon be eclipsed by core Apache Hadoop. Folks should just please wait.</li>
<li>Now that Hortonworks is an independent company focused on the task, it will speedily solve the packaging problems that have made Cloudera&#8217;s Hadoop distribution (perceived to be) necessary.</li>
<li>Yahoo and IBM both back Hortonworks&#8217; approach. That&#8217;s got to count for something.</li>
<li>Apache Hadoop will be quickly enhanced, and Hortonworks will be driving the enhancements. Hortonworks simply is the top Hadoop authority.</li>
</ul>
<p>We&#8217;ll see. Cloudera&#8217;s been around for a couple years, has smart people, and by definition has no technical inferiority to Hortonworks (since it has access to all Hortonworks&#8217; code). What&#8217;s more, it will be a long time before Hadoop technology is so mature that there&#8217;s nothing left to do; add-on software should long prove to be useful. As for &#8220;We&#8217;re purer about open source than the other guys&#8221; &#8212; well, I&#8217;m dubious that that will turn out to be a great marketing message.</p>
<p>And so I think Cloudera is the early favorite in the competition. But perhaps Hadoop users will be able to play Cloudera and Hortonworks off  against each other in price negotiations. Perhaps, notwithstanding <a href="../../../../../2011/06/02/why-you-would-want-an-appliance-and-when-you-wouldnt/">my skepticism about Hadoop appliances</a>, some hardware vendors will play them against each other for appliance partnerships.</p>
<p>Meanwhile, whatever else happens, I&#8217;m pretty psyched about <a href="http://www.dbms2.com/2011/07/10/hadoop-futures-and-enhancements/">some enhancements the Hortonworks folks plan to lead for Hadoop</a>.</p>
<p><strong><em>Related links</em></strong></p>
<ul>
<li>A <a href="http://www.monash.com/uploads/Hortonworks-Apache-Hadoop-July-2011.pptx">Hortonworks/Apache Hadoop slide deck</a> Hortonworks graciously allowed me to post</li>
<li>Cloudera&#8217;s post about it&#8217;s recent <a href="http://www.cloudera.com/blog/2011/07/the-only-full-lifecycle-management-for-apache-hadoop-introducing-cloudera-enterprise-3-5-and-scm-express/">3.5 release of Cloudera Enterprise</a></li>
<li>Pros and cons of <a href="http://www.softwarememories.com/2011/07/10/when-professional-services-and-software-mix/">professional services efforts at young software companies</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/10/cloudera-and-hortonworks/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Petabyte-scale Hadoop clusters (dozens of them)</title>
		<link>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/</link>
		<comments>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/#comments</comments>
		<pubDate>Wed, 06 Jul 2011 05:15:21 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4886</guid>
		<description><![CDATA[I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop. Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more [...]]]></description>
			<content:encoded><![CDATA[<p>I recently learned that there are <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">7 Vertica clusters with a petabyte</a> (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.</p>
<p>Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo&#8217;s latest stated figures are:</p>
<ul>
<li>42,000 Hadoop nodes &#8230;</li>
<li>&#8230; holding 180-200 petabytes of data.</li>
</ul>
<p><span id="more-4886"></span>That works out near the low end of the range I came up with for Yahoo&#8217;s newest gear, namely <a href="http://www.dbms2.com/2011/07/06/hadoop-hardware-and-compression/">36-90 TB/node</a>. Yahoo&#8217;s biggest clusters are little over 4,000 nodes (a limitation that&#8217;s getting worked on), and Yahoo has over 20 clusters in total.</p>
<p>Based on those numbers, it would seem that 10 or more of Yahoo&#8217;s Hadoop clusters are probably in the petabyte range. Facebook no doubt has a few petabyte-scale Hadoop clusters as well. So we&#8217;re probably over 3 dozen petabyte+ Hadoop clusters, just counting Yahoo, Facebook, and CDH users. There surely are others too, running Apache Hadoop without Cloudera&#8217;s help.</p>
<p>We also have some more information about the scale of Hadoop usage, and the markets it is being used in, because Omer Trajman of Cloudera kindly wrote the following &#8212; lightly edited as usual &#8212; for quotation:</p>
<blockquote><p>The number of Petabyte+ Hadoop clusters expanded dramatically over the past year, with our recent count reaching 22 in production (in addition to the well-known clusters at Yahoo! and Facebook). Just as our poll back at Hadoop World 2010 showed the average cluster size at just over 60 nodes, today it tops 200. While mean is not the same as median (most clusters are under 30 nodes), there are some beefy ones pulling up that average. Outside of the well-known large clusters at Yahoo and Facebook, we count today 16 organizations running PB+ clusters running CDH across a diverse number of industries including online advertising, retail, government, financial services, online publishing, web analytics and academic research. We expect to see many more in the coming years, as Hadoop gets easier to use and more accessible to a wide variety of enterprise organizations.</p></blockquote>
<p>Omer went on to add:</p>
<blockquote><p>The biggest number of PB clusters are in the advertising space. I often tell people that every ad you see on the internet touched at least one Hadoop cluster (or the Google equivalent).</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Hardware for Hadoop</title>
		<link>http://www.dbms2.com/2011/06/04/hardware-for-hadoop/</link>
		<comments>http://www.dbms2.com/2011/06/04/hardware-for-hadoop/#comments</comments>
		<pubDate>Sat, 04 Jun 2011 22:47:12 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4610</guid>
		<description><![CDATA[After suggesting that there&#8217;s little point to Hadoop appliances, it occurred to me to look into what kinds of hardware actually are used with Hadoop. So far as I can tell: Hadoop nodes today tend to run on fairly standard boxes. Hadoop nodes in the past have tended to run on boxes that were light [...]]]></description>
			<content:encoded><![CDATA[<p>After suggesting that <a href="http://www.dbms2.com/2011/06/02/why-you-would-want-an-appliance-and-when-you-wouldnt/">there&#8217;s little point to Hadoop appliances</a>, it occurred to me to look into what kinds of hardware actually are used with Hadoop. So far as I can tell:</p>
<ul>
<li>Hadoop nodes today tend to run on fairly standard boxes.</li>
<li>Hadoop nodes in the past have tended to run on boxes that were light with respect to RAM.</li>
<li>The number of spindles per core on Hadoop node boxes is going up even as disks get bigger.</li>
</ul>
<p><span id="more-4610"></span>A key input comes from Cloudera, who to my joy delegated the questions to Omer Trajman, who wrote:</p>
<blockquote><p>Most Hadoop deployments today use systems with dual socket and quad or  hex cores (8 or 12 cores total, 16 or 24 hyper-threaded). Storage has  increased as well with 6-8 spindles being common and some deployments  going to 12 spindles. These are SATA disks with between 1TB and 2TB  capacity. The amount of RAM varies depending on the application. 24GB is  common as is 36GB – all ECC RAM. HBase clusters may have more RAM so  they can cache more data. Some customers put Hadoop on their “standard  box” which may not be perfectly balanced (e.g. more RAM, less disk) and  needs to be altered slightly to meet the above specs. The new Dell C2100  series and the HP SL170 series are both popular server lines for  Hadoop.</p>
<p>For a year ago perspective, see this post: <a href="http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/" target="_blank">http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/</a></p></blockquote>
<p>Bullet points from that year-ago link include:</p>
<ul>
<blockquote>
<li>4 1TB hard disks in a JBOD (Just a Bunch Of Disks) configuration</li>
<li>2 quad core CPUs, running at least 2-2.5GHz</li>
<li>16-24GBs of RAM (24-32GBs if you’re considering HBase)</li>
<li>Gigabit Ethernet</li>
</blockquote>
</ul>
<p>So basically we&#8217;re talking in the range of 2-3 GB of RAM per core &#8212; and 1 spindle per core, up from perhaps half a spindle per core a year ago.</p>
<p>Meanwhile, a 2009 <a href="https://opencirrus.org/system/files/OpenCirrusHadoop2009.ppt">Yahoo  slide deck</a> refers to &#8220;500 nodes, 4000 cores, 3TB RAM, 1.5PB disk&#8221;;  that divides out to 8 cores, 6 GB of RAM, and 3 TB of disk per node, all  on &#8220;commodity hardware.&#8221; By 2010 Yahoo was evidently up to <a href="http://twitter.com/#!/marin_dimitrov/status/12900368052">2 GB of RAM per core</a>.</p>
<p>There are lots of data points on the <a href="http://wiki.apache.org/hadoop/PoweredBy">Apache Hadoop wiki</a>, but many seem a few years old, and I don&#8217;t immediately see how to time-stamp them. Overall, they seem consistent with the trends I noted at the top of the post.</p>
<p>One thing I haven&#8217;t done is attempted to price any of these systems.</p>
<p>Contributions in the comment thread would be warmly appreciated.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/04/hardware-for-hadoop/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Yahoo wants to do decapetabyte-scale data warehousing in Hadoop</title>
		<link>http://www.dbms2.com/2009/10/01/yahoos-decapetabyte-data-warehousinghadoop/</link>
		<comments>http://www.dbms2.com/2009/10/01/yahoos-decapetabyte-data-warehousinghadoop/#comments</comments>
		<pubDate>Thu, 01 Oct 2009 07:05:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=974</guid>
		<description><![CDATA[My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo&#8217;s Hadoop effort &#8212; everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get [...]]]></description>
			<content:encoded><![CDATA[<p>My old client <a href="http://www.dbms2.com/2007/08/10/the-essence-of-cep-according-to-coral8">Mark Tsimelzon</a> moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo&#8217;s Hadoop effort &#8212; everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.</p>
<p style="margin-bottom: 0in;">Highlights of our visit included:</p>
<ul>
<li>There are dozens of people at 	Yahoo doing Hadoop development that will wind up getting open 	sourced. (Full-time or close to it.) In particular, everything 	Mark&#8217;s team does goes to open source.</li>
<li>Yahoo is moving as much of its 	analytics to Hadoop as possible. Much of this is being moved away 	from <a href="http://www.dbms2.com/2009/09/19/oracle-database-siz/">Oracle</a> and from Yahoo&#8217;s own <a href="http://www.dbms2.com/2009/07/06/yahoo-is-up-to-10-petabytes-now/">Everest</a>.</li>
<li>A column store 	is being put on top of HDFS, based on Yahoo technology. Columns will 	be striped across nodes. Perhaps that&#8217;s why the effort is called 	Project Zebra.</li>
<li>Mark believes 	that in a year Hadoop will be much further along in meeting 	traditional data warehousing requirements, in areas such as:
<ul>
<li>Metadata</li>
<li>SLAs/high 	availability/other workload management</li>
<li>Data retention 	policies</li>
<li>Security/privacy*</li>
</ul>
</li>
<li>Yahoo views 	the time-to-market benefits of Hadoop as being more important than 	TCO.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;"><em><span id="more-974"></span>*I also spoke with a couple of Mark&#8217;s Yahoo colleagues, on his introduction, who are being less helpful than he is about clarifying what I am or am not allowed to say for publication. But I will say that I was heartened by the degree of concern they showed for doing the right thing with regard to privacy. I was not as heartened by the concrete ideas &#8212; or lack thereof &#8212; for making it happen. But frankly, I don&#8217;t think it&#8217;s a solvable technical problem. Rather, it should be <a href="http://www.monashreport.com/2006/06/06/freedom-even-without-data-privacy/">a huge priority on the legal/political front</a>.</em></p>
<p style="margin-bottom: 0in;">We also talked some about Pig, Yahoo&#8217;s non-SQL DML (Data Manipulation Language) for Hadoop, which is however getting a SQL interface. And we talked about Pig vs. <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/">Hive</a>. But I recently heard a rumor all that is in flux, so I won&#8217;t write it up now.</p>
<p style="margin-bottom: 0in;">Mark sent along a couple of interesting slide presentations by a colleague. After some back and forth as to whether I could post them, he suggested I post <a href="http://developer.yahoo.net/blogs/theater/archives/2009/09/welcome_hadoop_summit.html">these</a> <a href="http://developer.yahoo.net/blogs/theater/archives/2009/06/hadoopsummit_shugar.html">links</a> to similar material instead.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/01/yahoos-decapetabyte-data-warehousinghadoop/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Oracle gives a few customer database size examples</title>
		<link>http://www.dbms2.com/2009/09/19/oracle-database-siz/</link>
		<comments>http://www.dbms2.com/2009/09/19/oracle-database-siz/#comments</comments>
		<pubDate>Sun, 20 Sep 2009 00:40:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=905</guid>
		<description><![CDATA[In its recent quarterly conference call, Oracle said (as per the Seeking Alpha transcript): AC Neilsen, for instance, we deployed a 45-terabyte data [mart], they called it; Adidas, 13 terabytes; Australian Bureau of Statistics, 250 terabytes; and of course, some of our high-end ones that you have probably heard of in the past, AT&#38;T, 250 [...]]]></description>
			<content:encoded><![CDATA[<p>In its recent quarterly conference call, Oracle said (as per <a href="http://seekingalpha.com/article/161887-oracle-f1q10-qtr-end-8-31-09-earnings-call-transcript?page=5">the Seeking Alpha transcript</a>):</p>
<blockquote><p>AC Neilsen, for instance, we deployed a 45-terabyte data [mart], they called it; Adidas, 13 terabytes; Australian Bureau of Statistics, 250 terabytes; and of course, some of our high-end ones that you have probably heard of in the past, AT&amp;T, 250 terabytes; Yahoo!, 700 terabytes &#8212; just gives you an idea of the size of the databases that are out there and how they are growing, and that’s driving the need for greater throughput.</p></blockquote>
<p>I don&#8217;t know what&#8217;s being counted there, but I wouldn&#8217;t be surprised if those were legit user-data figures.</p>
<p>Some other notes:</p>
<ul>
<li><span style="text-decoration: line-through;">The Yahoo database is of course Yahoo&#8217;s first-generation data warehouse, which has been largely superseded by <a href="http://www.dbms2.com/2009/07/06/yahoo-is-up-to-10-petabytes-now/">an internal system more than 10X that size</a>.</span> <em>(Edit: Actually, Greg Rahn of Oracle says below that it&#8217;s a different database.)</em></li>
<li>I&#8217;m keynoting the Netezza road show this month, and Nielsen is up there on stage touting Netezza. <em>(Edit: <a href="http://www.dbms2.com/2009/09/29/a-c-nielsen-data-warehousing-dbms/">Nielsen indeed does the overwhelming majority of its data warehousing on Netezza</a>.)</em></li>
<li>I&#8217;d be surprised if AT&amp;T&#8217;s largest data warehouse were &#8220;only&#8221; 250 terabytes in size. <em>(Edit: Actually, I am told the database in question is 310 TB of user data and growing. More later, hopefully.)</em></li>
<li>Oracle didn&#8217;t exactly say that those were Exadata installations.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/09/19/oracle-database-siz/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Yahoo is up to 10 petabytes now?</title>
		<link>http://www.dbms2.com/2009/07/06/yahoo-is-up-to-10-petabytes-now/</link>
		<comments>http://www.dbms2.com/2009/07/06/yahoo-is-up-to-10-petabytes-now/#comments</comments>
		<pubDate>Mon, 06 Jul 2009 06:03:54 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=832</guid>
		<description><![CDATA[According to somebody (I forget who) who attended Yahoo&#8217;s SIGMOD presentation last week, the big Yahoo database is now up to 10 petabytes in size, in line with Yahoo&#8217;s predictions last year.  Apparently, Yahoo also gave more details of how the technology works.]]></description>
			<content:encoded><![CDATA[<p>According to somebody (I forget who) who attended Yahoo&#8217;s SIGMOD presentation last week, <a href="http://www.dbms2.com/2008/05/29/yahoo-scales-web-analytics-database-petabyte/">the big Yahoo database</a> is now up to 10 petabytes in size, in line with Yahoo&#8217;s predictions last year.  Apparently, Yahoo also gave more details of how the technology works.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/07/06/yahoo-is-up-to-10-petabytes-now/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Facebook, Hadoop, and Hive</title>
		<link>http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/</link>
		<comments>http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/#comments</comments>
		<pubDate>Mon, 11 May 2009 08:29:08 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=775</guid>
		<description><![CDATA[I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said.  They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.</p>
<p style="margin-bottom: 0in;">Updating the metrics in <a href="http://www.dbms2.com/2009/04/15/cloudera-presents-the-mapreduce-bull-case/">my Cloudera post</a>,</p>
<ul>
<li>Facebook has 400 terabytes of disk 	managed by Hadoop/Hive, with a slightly better than 6:1 overall 	compression ratio. So the <strong>2 1/2 petabytes</strong> figure for user 	data is reasonable.</li>
<li>Facebook&#8217;s Hadoop/Hive system 	ingests <strong>15 terabytes of new data per day</strong> now, not 10.</li>
<li>Hadoop/Hive cycle times aren&#8217;t as 	fast as I thought I heard from Jeff.  Ad targeting queries are the 	most frequent, and they&#8217;re run <strong>hourly.</strong> Dashboards are 	repopulated <strong>daily.</strong></li>
</ul>
<p style="margin-bottom: 0in;">Nothing else in my Cloudera post was called out as being wrong.</p>
<p style="margin-bottom: 0in;">In a new-to-me metric, Facebook has <strong>610 Hadoop nodes, running in a single cluster,</strong> due to be increased to 1000 soon.  Facebook thinks this is the second-largest* Hadoop installation, or else close to it.  What&#8217;s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters.<span id="more-775"></span></p>
<p style="margin-bottom: 0in;"><em>*Apparently, Yahoo is at 2000 nodes (and headed for 4000), 1000 or so of which are operated as a single cluster for a single app.</em></p>
<p style="margin-bottom: 0in;">Facebook decided in 2007 to move what was then a 15 terabyte big-DBMS-vendor data warehouse to Hadoop &#8212; augmented by Hive &#8212; rather than to an MPP data warehouse DBMS. Major drivers of the choice included:</p>
<ul>
<li><strong>License/maintenance costs.</strong> Free is a good price.</li>
<li><strong>Open source flexibility.</strong> Facebook is one of the few users I&#8217;ve ever spoken with that actually 	cares about modifying open source code.</li>
<li><strong>Ability to run on cheap 	hardware.</strong> Facebook runs real-time MySQL instances on boxes that 	cost $10K or so, and would expect to pay at least as much for an MPP 	DBMS node. But Hadoop nodes run on boxes that cost no more than $4K, 	and sometimes (depending e.g. on whether they have any disk at all) 	as little as $2K. These are &#8220;true&#8221; commodity boxes; they 	don&#8217;t even use RAID.</li>
<li><strong>Ability to scale out to lots of 	nodes.</strong> Few of the new low-cost MPP DBMS vendors have production 	systems even today of &gt;100 nodes.  (Actually, I&#8217;m not certain 	that any except Netezza do, although Kognitio in a prior release of 	its technology once built a 900ish node production system.)</li>
<li><strong>Inherently better performance.</strong> Correctly or otherwise, the Facebook guys thought that Hadoop had 	performance advantages over DBMS, due to the lack of overhead 	associated with transactions and so on.</li>
</ul>
<p style="margin-bottom: 0in;">One option Facebook didn&#8217;t seriously consider was sticking with the incumbent, which Facebook folks regarded as &#8220;horrible&#8221; and a &#8220;lost cause.&#8221; The daily pipeline took more than 24 hours to process. Although aware that its big-DBMS-vendor warehouse could probably be tuned much better, Facebook didn&#8217;t see that as a path to growing its warehouse more than 100-fold.  (But based on my discussion with Cloudera, I gather that vendor&#8217;s DBMS is indeed used to run some reporting today.)</p>
<p style="margin-bottom: 0in;"><strong>Reliability of Facebook&#8217;s Hadoop/Hive system seems to be so-so.</strong> It&#8217;s designed for a few nodes at a time to fail; that&#8217;s no biggie. There&#8217;s a head node that&#8217;s a single-point of failure; while there&#8217;s a backup node, I gather failover takes 15 minutes or so, a figure the Facebook guys think they could reduce substantially if they put their minds to it.  But users submitting long-running queries don&#8217;t seem to mind delays of up to an hour, as long as they don&#8217;t have to resubmit their queries. Keeping ETL up is a higher priority than keeping query execution going. Data loss would indeed be intolerable, but at that level Hadoop/Hive seems to be quite trustworthy.</p>
<p style="margin-bottom: 0in;">There also are occasional longer partial(?) outages, when an upgrade introduces a bug or something, but those don&#8217;t seem to be a major concern.</p>
<p style="margin-bottom: 0in;">Facebook&#8217;s variability in node hardware raises an obvious question &#8212; <strong>how does Hadoop deal with heterogeneous hardware among its nodes?</strong> Apparently a <em>fair scheduling</em> capability has been built for Hadoop, with Facebook as the first major user and Yahoo apparently moving in that direction as well.  As for inputs to the scheduler (or any more primitive workload allocator) &#8212; well, that depends on the kind of heterogeneity.</p>
<ul>
<li>Disk heterogeneity &#8212; a 	distributed file system reports back about disk.</li>
<li>CPU heterogeneity &#8212; different 	nodes can be configured to run different numbers of concurrent tasks 	each.</li>
<li>RAM heterogeneity &#8212; Hadoop does 	not understand the memory requirements of each task, and does not do 	a good job of matching tasks to boxes accordingly. But the Hadoop 	community is working to fix this.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;"><strong>Further notes on Hive</strong></p>
<p style="margin-bottom: 0in; font-style: normal;">Without Hive, some basic Hadoop data manipulations can be a pain in the butt.  A GROUP BY or the equivalent could take &gt;100 lines of Java or Python code, and unless the person writing it knew something about database technologically, it could use some pretty sub-optimal algorithms even then.  Enter Hive.</p>
<p style="margin-bottom: 0in; font-style: normal;">Hive sets out to fix this problem. Originally developed at Facebook (in Java, like Hadoop is), Hive was open-sourced last summer, by which time its SQL interface was in place, and now has 6 main developers. The essence of Hive seems to be:</p>
<ul>
<li>An interface 	that implements a subset of SQL</li>
<li>Compilation of 	that SQL into a MapReduce configuration file.</li>
<li>An execution 	engine to run same.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">The SQL implemented so far seems to, unsurprisingly be, what is most needed to analyze Facebook&#8217;s log files.  I.e., it&#8217;s some basic stuff, plus some timestamp functionality.  There also is an extensibility framework, and some ELT functionality.</p>
<p style="margin-bottom: 0in; font-style: normal;">Known users of Hive include Facebook (definitely in production) and hi5 (apparently in production as well). Also, there&#8217;s a Hive code committer from Last.fm.</p>
<p style="margin-bottom: 0in;"><em><strong>Other links about huge data warehouses:</strong></em></p>
<ul>
<li><a href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/">eBay</a> has a 6 1/2 petabyte database running on Greenplum and a 2 1/2 petabyte enterprise data warehouse running on Teradata.</li>
<li>Wal-Mart, Bank of America, another financial services company, and Dell also have <a href="../2008/10/15/teradatas-petabyte-power-players/">very large Teradata databases</a>.</li>
<li>Yahoo’s web/network events database, running on proprietary software, sounded about <a href="../2008/05/29/yahoo-scales-web-analytics-database-petabyte/">1/6th the size of eBay’s Greenplum system</a> when it was described about a year ago.</li>
<li>Fox Interactive Media/MySpace has multi-hundred terabyte databases running on each of <a href="../2009/03/05/fox-interactive-medias-multi-hundred-terabyte-database-running-on-greenplum/">Greenplum</a> and Aster Data <a href="../2009/03/05/myspaces-multi-hundred-terabyte-database-running-on-aster-data/">nCluster</a>.</li>
<li><a href="../2008/05/23/data-warehouse-appliance-power-user-teoco/">TEOCO has 100s of terabytes running on DATAllegro</a>.</li>
<li>To a probably lesser extent, the same is now also true of <a href="../2009/03/02/closing-the-book-on-the-datallegro-customer-base/">Dell</a>.</li>
<li><a href="../2009/04/25/vertica-pricing-and-customer-metrics/">Vertica has a couple of unnamed customers with databases in the 200 terabyte range</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/feed/</wfw:commentRss>
		<slash:comments>46</slash:comments>
		</item>
		<item>
		<title>Some of Oracle&#8217;s largest data warehouses</title>
		<link>http://www.dbms2.com/2008/09/24/some-of-oracles-largest-data-warehouses/</link>
		<comments>http://www.dbms2.com/2008/09/24/some-of-oracles-largest-data-warehouses/#comments</comments>
		<pubDate>Thu, 25 Sep 2008 00:21:38 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=570</guid>
		<description><![CDATA[Googling around, I came across an Oracle presentation – given some time this year – that lists some of Oracle&#8217;s largest data warehouses. 10 databases total are listed with &#62;16 TB, which is fairly consistent with Larry Ellison&#8217;s confession during the Exadata announcement that Oracle has trouble over 10 TB (which is something I&#8217;ve gotten [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Googling around, I came across <a href="http://www.oracle.com/global/kr/download/seminar/2008/ilm/session1.pdf">an Oracle presentation</a> – given some time this year – that lists some of Oracle&#8217;s largest data warehouses. 10 databases total are listed with &gt;16 TB, which is fairly consistent with Larry Ellison&#8217;s confession during the <a href="http://www.dbms2.com/2008/09/24/oracle-exadata/">Exadata</a> announcement that Oracle has trouble over 10 TB (which is something I&#8217;ve gotten a lot of flack from a few Oracle partisans for pointing out &#8230; <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' />  ).</p>
<p style="margin-bottom: 0in;">However, what&#8217;s being measured is probably not the same in all cases.  For example,  I think the Amazon 70 TB figure is obviously for spinning disk (elsewhere in the presentation it&#8217;s stated that Amazon has 71 TB of disk). But the 16 TB British Telecom figure probably is user data &#8212; indeed, it&#8217;s the same figure <a href="http://www.odshp.com/iqug/papers/SunSybasePRAR11-30-01.doc">Computergram</a> cited for BT user data way back in 2001.</p>
<p style="margin-bottom: 0in;">The list is:<span id="more-570"></span></p>
<ul>
<li>Acxiom 16 TB HP</li>
<li>Allstate 20 TB Sun (RAC)</li>
<li>Amazon 70 TB HP (RAC)</li>
<li>AT&amp;T 60 TB HP</li>
<li>British Telecom 16 TB HP</li>
<li>Cellcom 12 TB HP</li>
<li>Choicepoint 14 TB Sun</li>
<li>Cingular/AT&amp;T 25 TB HP</li>
<li>Claria 38 TB Sun</li>
<li>Colgate-Palm 10 TB IBM</li>
<li>Experian 14 TB Sun</li>
<li>France Telecom 36 TB HP</li>
<li>JPMC 40 TB IBM (RAC)</li>
<li>KTF 14 TB HP</li>
<li>Mastercard 40 TB IBM (RAC)</li>
<li>NASDAQ 35 TB Sun</li>
<li>NexTel 28 TB HP</li>
<li>NYSE Euronext 93 TB HP (RAC)</li>
<li>Reliance Ltd 13 TB Sun</li>
<li>Starwood 12 TB HP</li>
<li>Sprint/Nextel 110 TB HP</li>
<li>TIM (Italy)12 TB HP (RAC)</li>
<li>Turkcell14 TB Sun (RAC)</li>
<li>UBS AG 15 TB Sun</li>
<li>UPS 10 TB HP</li>
<li>Yahoo! 250 TB Fujitsu</li>
</ul>
<p style="margin-bottom: 0in;">I happen to have been on the phone with Phil Francisco of Netezza tonight, and he confirmed that Netezza has larger installations (user data) than the figures cited above at several of those customers, including Axciom and NYSE Euronext.  However, Phil confesses that he might have trouble getting up to 10 users at &gt; 15 TB of data if &#8212; which I think would be the fairest comparison &#8212; he had to restrict himself to only those who have given Netezza permission to publicize their names.*</p>
<p style="margin-bottom: 0in;"><em>*Phil emphatically says Netezza has more than that overall.  But the customers one is allowed to name, let alone disclose database sizes for, are only a fraction of the overall total.</em></p>
<p style="margin-bottom: 0in;">Meanwhile, I suspect that Reliance might be what turned into one of Greenplum&#8217;s flagship accounts.  And despite its ongoing Oracle relationship Yahoo has <a href="http://www.dbms2.com/2008/05/29/yahoo-scales-web-analytics-database-petabyte/">a much bigger data warehouse</a> based on Postgres technology.</p>
<p style="margin-bottom: 0in;">As usual, the preponderance of telecom customers is striking.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2008/09/24/some-of-oracles-largest-data-warehouses/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

