<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Yahoo</title>
	<atom:link href="http://www.dbms2.com/category/users/yahoo/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 02 Sep 2010 09:06:44 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Yahoo wants to do decapetabyte-scale data warehousing in Hadoop</title>
		<link>http://www.dbms2.com/2009/10/01/yahoos-decapetabyte-data-warehousinghadoop/</link>
		<comments>http://www.dbms2.com/2009/10/01/yahoos-decapetabyte-data-warehousinghadoop/#comments</comments>
		<pubDate>Thu, 01 Oct 2009 07:05:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=974</guid>
		<description><![CDATA[My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo&#8217;s Hadoop effort &#8212; everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get [...]]]></description>
			<content:encoded><![CDATA[<p>My old client <a href="http://www.dbms2.com/2007/08/10/the-essence-of-cep-according-to-coral8" >Mark Tsimelzon</a> moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo&#8217;s Hadoop effort &#8212; everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.</p>
<p style="margin-bottom: 0in;">Highlights of our visit included:</p>
<ul>
<li>There are dozens of people at 	Yahoo doing Hadoop development that will wind up getting open 	sourced. (Full-time or close to it.) In particular, everything 	Mark&#8217;s team does goes to open source.</li>
<li>Yahoo is moving as much of its 	analytics to Hadoop as possible. Much of this is being moved away 	from <a href="http://www.dbms2.com/2009/09/19/oracle-database-siz/" >Oracle</a> and from Yahoo&#8217;s own <a href="http://www.dbms2.com/2009/07/06/yahoo-is-up-to-10-petabytes-now/" >Everest</a>.</li>
<li>A column store 	is being put on top of HDFS, based on Yahoo technology. Columns will 	be striped across nodes. Perhaps that&#8217;s why the effort is called 	Project Zebra.</li>
<li>Mark believes 	that in a year Hadoop will be much further along in meeting 	traditional data warehousing requirements, in areas such as:
<ul>
<li>Metadata</li>
<li>SLAs/high 	availability/other workload management</li>
<li>Data retention 	policies</li>
<li>Security/privacy*</li>
</ul>
</li>
<li>Yahoo views 	the time-to-market benefits of Hadoop as being more important than 	TCO.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;"><em><span id="more-974"></span>*I also spoke with a couple of Mark&#8217;s Yahoo colleagues, on his introduction, who are being less helpful than he is about clarifying what I am or am not allowed to say for publication. But I will say that I was heartened by the degree of concern they showed for doing the right thing with regard to privacy. I was not as heartened by the concrete ideas &#8212; or lack thereof &#8212; for making it happen. But frankly, I don&#8217;t think it&#8217;s a solvable technical problem. Rather, it should be <a href="http://www.monashreport.com/2006/06/06/freedom-even-without-data-privacy/" onclick="javascript:pageTracker._trackPageview('/www.monashreport.com');">a huge priority on the legal/political front</a>.</em></p>
<p style="margin-bottom: 0in;">We also talked some about Pig, Yahoo&#8217;s non-SQL DML (Data Manipulation Language) for Hadoop, which is however getting a SQL interface. And we talked about Pig vs. <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >Hive</a>. But I recently heard a rumor all that is in flux, so I won&#8217;t write it up now.</p>
<p style="margin-bottom: 0in;">Mark sent along a couple of interesting slide presentations by a colleague. After some back and forth as to whether I could post them, he suggested I post <a href="http://developer.yahoo.net/blogs/theater/archives/2009/09/welcome_hadoop_summit.html" onclick="javascript:pageTracker._trackPageview('/developer.yahoo.net');">these</a> <a href="http://developer.yahoo.net/blogs/theater/archives/2009/06/hadoopsummit_shugar.html" onclick="javascript:pageTracker._trackPageview('/developer.yahoo.net');">links</a> to similar material instead.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/01/yahoos-decapetabyte-data-warehousinghadoop/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Oracle gives a few customer database size examples</title>
		<link>http://www.dbms2.com/2009/09/19/oracle-database-siz/</link>
		<comments>http://www.dbms2.com/2009/09/19/oracle-database-siz/#comments</comments>
		<pubDate>Sun, 20 Sep 2009 00:40:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=905</guid>
		<description><![CDATA[In its recent quarterly conference call, Oracle said (as per the Seeking Alpha transcript):
AC Neilsen, for instance, we deployed a 45-terabyte data [mart], they called it; Adidas, 13 terabytes; Australian Bureau of Statistics, 250 terabytes; and of course, some of our high-end ones that you have probably heard of in the past, AT&#38;T, 250 terabytes; [...]]]></description>
			<content:encoded><![CDATA[<p>In its recent quarterly conference call, Oracle said (as per <a href="http://seekingalpha.com/article/161887-oracle-f1q10-qtr-end-8-31-09-earnings-call-transcript?page=5" onclick="javascript:pageTracker._trackPageview('/seekingalpha.com');">the Seeking Alpha transcript</a>):</p>
<blockquote><p>AC Neilsen, for instance, we deployed a 45-terabyte data [mart], they called it; Adidas, 13 terabytes; Australian Bureau of Statistics, 250 terabytes; and of course, some of our high-end ones that you have probably heard of in the past, AT&amp;T, 250 terabytes; Yahoo!, 700 terabytes &#8212; just gives you an idea of the size of the databases that are out there and how they are growing, and that’s driving the need for greater throughput.</p></blockquote>
<p>I don&#8217;t know what&#8217;s being counted there, but I wouldn&#8217;t be surprised if those were legit user-data figures.</p>
<p>Some other notes:</p>
<ul>
<li><span style="text-decoration: line-through;">The Yahoo database is of course Yahoo&#8217;s first-generation data warehouse, which has been largely superseded by <a href="http://www.dbms2.com/2009/07/06/yahoo-is-up-to-10-petabytes-now/" >an internal system more than 10X that size</a>.</span> <em>(Edit: Actually, Greg Rahn of Oracle says below that it&#8217;s a different database.)</em></li>
<li>I&#8217;m keynoting the Netezza road show this month, and Nielsen is up there on stage touting Netezza. <em>(Edit: <a href="http://www.dbms2.com/2009/09/29/a-c-nielsen-data-warehousing-dbms/" >Nielsen indeed does the overwhelming majority of its data warehousing on Netezza</a>.)</em></li>
<li>I&#8217;d be surprised if AT&amp;T&#8217;s largest data warehouse were &#8220;only&#8221; 250 terabytes in size. <em>(Edit: Actually, I am told the database in question is 310 TB of user data and growing. More later, hopefully.)</em></li>
<li>Oracle didn&#8217;t exactly say that those were Exadata installations.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/09/19/oracle-database-siz/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Yahoo is up to 10 petabytes now?</title>
		<link>http://www.dbms2.com/2009/07/06/yahoo-is-up-to-10-petabytes-now/</link>
		<comments>http://www.dbms2.com/2009/07/06/yahoo-is-up-to-10-petabytes-now/#comments</comments>
		<pubDate>Mon, 06 Jul 2009 06:03:54 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=832</guid>
		<description><![CDATA[According to somebody (I forget who) who attended Yahoo&#8217;s SIGMOD presentation last week, the big Yahoo database is now up to 10 petabytes in size, in line with Yahoo&#8217;s predictions last year.  Apparently, Yahoo also gave more details of how the technology works.
]]></description>
			<content:encoded><![CDATA[<p>According to somebody (I forget who) who attended Yahoo&#8217;s SIGMOD presentation last week, <a href="http://www.dbms2.com/2008/05/29/yahoo-scales-web-analytics-database-petabyte/" >the big Yahoo database</a> is now up to 10 petabytes in size, in line with Yahoo&#8217;s predictions last year.  Apparently, Yahoo also gave more details of how the technology works.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/07/06/yahoo-is-up-to-10-petabytes-now/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Facebook, Hadoop, and Hive</title>
		<link>http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/</link>
		<comments>http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/#comments</comments>
		<pubDate>Mon, 11 May 2009 08:29:08 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=775</guid>
		<description><![CDATA[I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said.  They also [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said.  They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.</p>
<p style="margin-bottom: 0in;">Updating the metrics in <a href="http://www.dbms2.com/2009/04/15/cloudera-presents-the-mapreduce-bull-case/" >my Cloudera post</a>,</p>
<ul>
<li>Facebook has 400 terabytes of disk 	managed by Hadoop/Hive, with a slightly better than 6:1 overall 	compression ratio. So the <strong>2 1/2 petabytes</strong> figure for user 	data is reasonable.</li>
<li>Facebook&#8217;s Hadoop/Hive system 	ingests <strong>15 terabytes of new data per day</strong> now, not 10.</li>
<li>Hadoop/Hive cycle times aren&#8217;t as 	fast as I thought I heard from Jeff.  Ad targeting queries are the 	most frequent, and they&#8217;re run <strong>hourly.</strong> Dashboards are 	repopulated <strong>daily.</strong></li>
</ul>
<p style="margin-bottom: 0in;">Nothing else in my Cloudera post was called out as being wrong.</p>
<p style="margin-bottom: 0in;">In a new-to-me metric, Facebook has <strong>610 Hadoop nodes, running in a single cluster,</strong> due to be increased to 1000 soon.  Facebook thinks this is the second-largest* Hadoop installation, or else close to it.  What&#8217;s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters.<span id="more-775"></span></p>
<p style="margin-bottom: 0in;"><em>*Apparently, Yahoo is at 2000 nodes (and headed for 4000), 1000 or so of which are operated as a single cluster for a single app.</em></p>
<p style="margin-bottom: 0in;">Facebook decided in 2007 to move what was then a 15 terabyte big-DBMS-vendor data warehouse to Hadoop &#8212; augmented by Hive &#8212; rather than to an MPP data warehouse DBMS. Major drivers of the choice included:</p>
<ul>
<li><strong>License/maintenance costs.</strong> Free is a good price.</li>
<li><strong>Open source flexibility.</strong> Facebook is one of the few users I&#8217;ve ever spoken with that actually 	cares about modifying open source code.</li>
<li><strong>Ability to run on cheap 	hardware.</strong> Facebook runs real-time MySQL instances on boxes that 	cost $10K or so, and would expect to pay at least as much for an MPP 	DBMS node. But Hadoop nodes run on boxes that cost no more than $4K, 	and sometimes (depending e.g. on whether they have any disk at all) 	as little as $2K. These are &#8220;true&#8221; commodity boxes; they 	don&#8217;t even use RAID.</li>
<li><strong>Ability to scale out to lots of 	nodes.</strong> Few of the new low-cost MPP DBMS vendors have production 	systems even today of &gt;100 nodes.  (Actually, I&#8217;m not certain 	that any except Netezza do, although Kognitio in a prior release of 	its technology once built a 900ish node production system.)</li>
<li><strong>Inherently better performance.</strong> Correctly or otherwise, the Facebook guys thought that Hadoop had 	performance advantages over DBMS, due to the lack of overhead 	associated with transactions and so on.</li>
</ul>
<p style="margin-bottom: 0in;">One option Facebook didn&#8217;t seriously consider was sticking with the incumbent, which Facebook folks regarded as &#8220;horrible&#8221; and a &#8220;lost cause.&#8221; The daily pipeline took more than 24 hours to process. Although aware that its big-DBMS-vendor warehouse could probably be tuned much better, Facebook didn&#8217;t see that as a path to growing its warehouse more than 100-fold.  (But based on my discussion with Cloudera, I gather that vendor&#8217;s DBMS is indeed used to run some reporting today.)</p>
<p style="margin-bottom: 0in;"><strong>Reliability of Facebook&#8217;s Hadoop/Hive system seems to be so-so.</strong> It&#8217;s designed for a few nodes at a time to fail; that&#8217;s no biggie. There&#8217;s a head node that&#8217;s a single-point of failure; while there&#8217;s a backup node, I gather failover takes 15 minutes or so, a figure the Facebook guys think they could reduce substantially if they put their minds to it.  But users submitting long-running queries don&#8217;t seem to mind delays of up to an hour, as long as they don&#8217;t have to resubmit their queries. Keeping ETL up is a higher priority than keeping query execution going. Data loss would indeed be intolerable, but at that level Hadoop/Hive seems to be quite trustworthy.</p>
<p style="margin-bottom: 0in;">There also are occasional longer partial(?) outages, when an upgrade introduces a bug or something, but those don&#8217;t seem to be a major concern.</p>
<p style="margin-bottom: 0in;">Facebook&#8217;s variability in node hardware raises an obvious question &#8212; <strong>how does Hadoop deal with heterogeneous hardware among its nodes?</strong> Apparently a <em>fair scheduling</em> capability has been built for Hadoop, with Facebook as the first major user and Yahoo apparently moving in that direction as well.  As for inputs to the scheduler (or any more primitive workload allocator) &#8212; well, that depends on the kind of heterogeneity.</p>
<ul>
<li>Disk heterogeneity &#8212; a 	distributed file system reports back about disk.</li>
<li>CPU heterogeneity &#8212; different 	nodes can be configured to run different numbers of concurrent tasks 	each.</li>
<li>RAM heterogeneity &#8212; Hadoop does 	not understand the memory requirements of each task, and does not do 	a good job of matching tasks to boxes accordingly. But the Hadoop 	community is working to fix this.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;"><strong>Further notes on Hive</strong></p>
<p style="margin-bottom: 0in; font-style: normal;">Without Hive, some basic Hadoop data manipulations can be a pain in the butt.  A GROUP BY or the equivalent could take &gt;100 lines of Java or Python code, and unless the person writing it knew something about database technologically, it could use some pretty sub-optimal algorithms even then.  Enter Hive.</p>
<p style="margin-bottom: 0in; font-style: normal;">Hive sets out to fix this problem. Originally developed at Facebook (in Java, like Hadoop is), Hive was open-sourced last summer, by which time its SQL interface was in place, and now has 6 main developers. The essence of Hive seems to be:</p>
<ul>
<li>An interface 	that implements a subset of SQL</li>
<li>Compilation of 	that SQL into a MapReduce configuration file.</li>
<li>An execution 	engine to run same.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">The SQL implemented so far seems to, unsurprisingly be, what is most needed to analyze Facebook&#8217;s log files.  I.e., it&#8217;s some basic stuff, plus some timestamp functionality.  There also is an extensibility framework, and some ELT functionality.</p>
<p style="margin-bottom: 0in; font-style: normal;">Known users of Hive include Facebook (definitely in production) and hi5 (apparently in production as well). Also, there&#8217;s a Hive code committer from Last.fm.</p>
<p style="margin-bottom: 0in;"><em><strong>Other links about huge data warehouses:</strong></em></p>
<ul>
<li><a href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/" >eBay</a> has a 6 1/2 petabyte database running on Greenplum and a 2 1/2 petabyte enterprise data warehouse running on Teradata.</li>
<li>Wal-Mart, Bank of America, another financial services company, and Dell also have <a href="../2008/10/15/teradatas-petabyte-power-players/">very large Teradata databases</a>.</li>
<li>Yahoo’s web/network events database, running on proprietary software, sounded about <a href="../2008/05/29/yahoo-scales-web-analytics-database-petabyte/">1/6th the size of eBay’s Greenplum system</a> when it was described about a year ago.</li>
<li>Fox Interactive Media/MySpace has multi-hundred terabyte databases running on each of <a href="../2009/03/05/fox-interactive-medias-multi-hundred-terabyte-database-running-on-greenplum/">Greenplum</a> and Aster Data <a href="../2009/03/05/myspaces-multi-hundred-terabyte-database-running-on-aster-data/">nCluster</a>.</li>
<li><a href="../2008/05/23/data-warehouse-appliance-power-user-teoco/">TEOCO has 100s of terabytes running on DATAllegro</a>.</li>
<li>To a probably lesser extent, the same is now also true of <a href="../2009/03/02/closing-the-book-on-the-datallegro-customer-base/">Dell</a>.</li>
<li><a href="../2009/04/25/vertica-pricing-and-customer-metrics/">Vertica has a couple of unnamed customers with databases in the 200 terabyte range</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/feed/</wfw:commentRss>
		<slash:comments>43</slash:comments>
		</item>
		<item>
		<title>Some of Oracle&#8217;s largest data warehouses</title>
		<link>http://www.dbms2.com/2008/09/24/some-of-oracles-largest-data-warehouses/</link>
		<comments>http://www.dbms2.com/2008/09/24/some-of-oracles-largest-data-warehouses/#comments</comments>
		<pubDate>Thu, 25 Sep 2008 00:21:38 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=570</guid>
		<description><![CDATA[Googling around, I came across an Oracle presentation – given some time this year – that lists some of Oracle&#8217;s largest data warehouses. 10 databases total are listed with &#62;16 TB, which is fairly consistent with Larry Ellison&#8217;s confession during the Exadata announcement that Oracle has trouble over 10 TB (which is something I&#8217;ve gotten [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Googling around, I came across <a href="http://www.oracle.com/global/kr/download/seminar/2008/ilm/session1.pdf" onclick="javascript:pageTracker._trackPageview('/www.oracle.com');">an Oracle presentation</a> – given some time this year – that lists some of Oracle&#8217;s largest data warehouses. 10 databases total are listed with &gt;16 TB, which is fairly consistent with Larry Ellison&#8217;s confession during the <a href="http://www.dbms2.com/2008/09/24/oracle-exadata/" >Exadata</a> announcement that Oracle has trouble over 10 TB (which is something I&#8217;ve gotten a lot of flack from a few Oracle partisans for pointing out &#8230; <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' />  ).</p>
<p style="margin-bottom: 0in;">However, what&#8217;s being measured is probably not the same in all cases.  For example,  I think the Amazon 70 TB figure is obviously for spinning disk (elsewhere in the presentation it&#8217;s stated that Amazon has 71 TB of disk). But the 16 TB British Telecom figure probably is user data &#8212; indeed, it&#8217;s the same figure <a href="http://www.odshp.com/iqug/papers/SunSybasePRAR11-30-01.doc" onclick="javascript:pageTracker._trackPageview('/www.odshp.com');">Computergram</a> cited for BT user data way back in 2001.</p>
<p style="margin-bottom: 0in;">The list is:<span id="more-570"></span></p>
<ul>
<li>Acxiom 16 TB HP</li>
<li>Allstate 20 TB Sun (RAC)</li>
<li>Amazon 70 TB HP (RAC)</li>
<li>AT&amp;T 60 TB HP</li>
<li>British Telecom 16 TB HP</li>
<li>Cellcom 12 TB HP</li>
<li>Choicepoint 14 TB Sun</li>
<li>Cingular/AT&amp;T 25 TB HP</li>
<li>Claria 38 TB Sun</li>
<li>Colgate-Palm 10 TB IBM</li>
<li>Experian 14 TB Sun</li>
<li>France Telecom 36 TB HP</li>
<li>JPMC 40 TB IBM (RAC)</li>
<li>KTF 14 TB HP</li>
<li>Mastercard 40 TB IBM (RAC)</li>
<li>NASDAQ 35 TB Sun</li>
<li>NexTel 28 TB HP</li>
<li>NYSE Euronext 93 TB HP (RAC)</li>
<li>Reliance Ltd 13 TB Sun</li>
<li>Starwood 12 TB HP</li>
<li>Sprint/Nextel 110 TB HP</li>
<li>TIM (Italy)12 TB HP (RAC)</li>
<li>Turkcell14 TB Sun (RAC)</li>
<li>UBS AG 15 TB Sun</li>
<li>UPS 10 TB HP</li>
<li>Yahoo! 250 TB Fujitsu</li>
</ul>
<p style="margin-bottom: 0in;">I happen to have been on the phone with Phil Francisco of Netezza tonight, and he confirmed that Netezza has larger installations (user data) than the figures cited above at several of those customers, including Axciom and NYSE Euronext.  However, Phil confesses that he might have trouble getting up to 10 users at &gt; 15 TB of data if &#8212; which I think would be the fairest comparison &#8212; he had to restrict himself to only those who have given Netezza permission to publicize their names.*</p>
<p style="margin-bottom: 0in;"><em>*Phil emphatically says Netezza has more than that overall.  But the customers one is allowed to name, let alone disclose database sizes for, are only a fraction of the overall total.</em></p>
<p style="margin-bottom: 0in;">Meanwhile, I suspect that Reliance might be what turned into one of Greenplum&#8217;s flagship accounts.  And despite its ongoing Oracle relationship Yahoo has <a href="http://www.dbms2.com/2008/05/29/yahoo-scales-web-analytics-database-petabyte/" >a much bigger data warehouse</a> based on Postgres technology.</p>
<p style="margin-bottom: 0in;">As usual, the preponderance of telecom customers is striking.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2008/09/24/some-of-oracles-largest-data-warehouses/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Yahoo scales its web analytics database to petabyte range</title>
		<link>http://www.dbms2.com/2008/05/29/yahoo-scales-web-analytics-database-petabyte/</link>
		<comments>http://www.dbms2.com/2008/05/29/yahoo-scales-web-analytics-database-petabyte/#comments</comments>
		<pubDate>Thu, 29 May 2008 21:46:04 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=433</guid>
		<description><![CDATA[Information Week has an article with details on what sounds like Yahoo&#8217;s core web analytics database.  Highlights include:

The Yahoo web analytics database is over 1 petabyte.  They claim it will be in the 10s of petabytes by 2009.
The Yahoo web analytics database is based on PostgreSQL. So much for MySQL fanboys&#8217; claims of [...]]]></description>
			<content:encoded><![CDATA[<p><em><a href="http://www.informationweek.com/news/software/database/showArticle.jhtml?articleID=207801436" onclick="javascript:pageTracker._trackPageview('/www.informationweek.com');">Information Week</a></em> has an article with details on what sounds like Yahoo&#8217;s core web analytics database.  Highlights include:</p>
<ul>
<li><strong>The Yahoo web analytics database is over 1 petabyte. </strong> They claim it will be in the 10s of petabytes by 2009.</li>
<li><strong>The Yahoo web analytics database is based on PostgreSQL.</strong> So much for MySQL fanboys&#8217; claims of Yahoo validation for their beloved toy &#8230; uh, let me rephrase that.  The highly-regarded MySQL, although doing a great job for some demanding and impressive applications at Yahoo, evidently wasn&#8217;t selected for this one in particular.  OK.  That&#8217;s much better now.</li>
<li><strong>But the Yahoo web analytics database doesn&#8217;t actually use PostgreSQL&#8217;s storage engine.</strong> Rather, Yahoo wrote something custom and columnar.</li>
<li><strong>Yahoo is processing 24 billion &#8220;events&#8221; per day. </strong> The article doesn&#8217;t clarify whether these are sent straight to the analytics store, or whether there&#8217;s an intermediate storage engine.  Most likely the system fills blocks in RAM and then just appends them to the single persistent store.   If commodity boxes occasionally crash and lose a few megs of data &#8212; well, in this application, that&#8217;s not a big deal at all.</li>
<li><strong>Yahoo thinks commercial column stores aren&#8217;t ready yet for more than 100 terabytes of data.</strong></li>
<li><strong>Yahoo says it got great performance advantages from a custom system by optimizing for its specific application.</strong> I don&#8217;t know exactly what that would be, but I do know that database architectures for high-volume web analytics are still in pretty bad shape. In particular, there&#8217;s no good way yet to analyze the specific, variable-length paths users take through websites.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2008/05/29/yahoo-scales-web-analytics-database-petabyte/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
	</channel>
</rss>
