<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS2 -- DataBase Management System Services &#187; eBay</title>
	<atom:link href="http://www.dbms2.com/category/users/ebay/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Fri, 30 Jul 2010 15:51:32 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Cloudera Enterprise and Hadoop evolution</title>
		<link>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/</link>
		<comments>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 17:22:27 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2440</guid>
		<description><![CDATA[I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I&#8217;d say:  

If you are or want to be a serious 	MapReduce user – and you&#8217;re past the “play around over the 	weekend” stage &#8212; you probably should have either:

A serious non-DBMS MapReduce 	distribution.
MapReduce integrated into your [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I&#8217;d say:  <span id="more-2440"></span></p>
<ul>
<li>If you are or want to be a serious 	MapReduce user – and you&#8217;re past the “play around over the 	weekend” stage &#8212; you probably should have either:
<ul>
<li>A serious non-DBMS MapReduce 	distribution.</li>
<li>MapReduce integrated into your 	analytic DBMS.</li>
<li>Both.</li>
</ul>
</li>
<li>The obvious choice for non-DBMS 	MapReduce is Hadoop.</li>
<li>The obvious choice for a Hadoop 	distribution is <strong>Cloudera Enterprise.</strong></li>
<li>Cloudera Enterprise has three main 	aspects, in an inseparable bundle:
<ul>
<li>Distributions for a double-digit 	number of open source projects. It&#8217;s nice having all that in one 	package – unless, of course, you like playing with Tinkertoys.</li>
<li>Proprietary Cloudera code.</li>
<li>Cloudera support.</li>
</ul>
</li>
<li>Cloudera says its proprietary code 	is and in the future is planned to be concentrated – at least in 	large part &#8212; on integrating open source technology with closed 	source products. This has the virtue of being targeted directly at 	that segment of the market which has proven it&#8217;s actually willing to 	pay money for software.</li>
<li>Cloudera Enterprise areas of 	focus, now and in the presumed future, include:
<ul>
<li><strong>Core Hadoop engine,</strong> which 	Cloudera says is quite predictably and appropriately evolving more 	slowly than the tools around it.</li>
</ul>
<ul>
<li><strong>Development, management and 	administrative tools,</strong> including:
<ul>
<li><strong>Pig</strong> and <strong>Hive</strong>. Cloudera says &gt;70% 	of Facebook Hadoop jobs are initiated through Hive, and the same is 	true of Yahoo and Pig.</li>
<li>Connectivity to commercial tools.</li>
<li>The product formerly known as 	“Cloudera Desktop.”</li>
</ul>
</li>
<li><strong>Workflow</strong>, which in this context 	refers to letting you create a Hadoop application as a sequence of 	small steps, rather than forcing you to kluge it into being one 	unwieldy thing. At the moment, this is much less widely adopted than 	Pig and Hive, but Cloudera has high hopes for it, because of its 	obvious benefits in modularity and manageability.</li>
<li><strong>Quasi-DBMS technology.</strong> Besides Hive and Pig, this includes <strong>HBase.</strong> Cloudera says there has 	been considerable demand for HBase, and it is pleased that project 	is now mature enough to ship. Cloudera stresses that it intends 	HBase not for OLTP, but as an adjunct to analytic processing. E.g., 	Cloudera suggests HBase would be a fine vehicle for replicating 	dimension tables across each node of a cluster.</li>
<li><strong>Data connectivity, </strong><span style="font-weight: normal;">e.g. 	to MySQL or to sensor log files.</span></li>
</ul>
</li>
<li>Cloudera Enterprise pricing is 	well below DBMS prices – not by a full order of magnitude, if I&#8217;m 	right about everybody&#8217;s quantity discount policies, but even so by a 	lot. Details are NDA.</li>
</ul>
<p style="margin-bottom: 0in;">Cloudera sometimes sends confusing signals about its beliefs and strategies. For example, one can get different stories depending on whether one talks to:</p>
<ul>
<li>Somebody at Cloudera who comes 	primarily from the user and open source communities.</li>
<li>Somebody at Cloudera who has 	actually worked at a software company before.</li>
</ul>
<p style="margin-bottom: 0in;">But I predict that Cloudera will now stick for a while with more or less the strategy outlined above.</p>
<p style="margin-bottom: 0in;">Naturally, we also talked about Hadoop adoption. Highlights of that part – no doubt somewhat biased towards Cloudera&#8217;s own customer base &#8212; included:</p>
<ul>
<li>Notwithstanding <a href="http://www.dbms2.com/2009/04/14/ebay-thinks-mpp-dbms-clobber-mapreduce/" >eBay&#8217;s prior 	skepticism about MapReduce</a>, it is quoted saying nice things in a Cloudera press release, 	and has apparently become quite a large Hadoop user, starting out 	with a search-quality use case.</li>
<li>Typical Hadoop deployment sizes 	are 10 nodes or so when experimenting, 80-500+ in production.</li>
<li>10 terabytes/node – I&#8217;m pretty 	sure Cloudera meant of user data &#8212; is not inconceivable, so a 	cost-conscious 500-node user could have 5 petabytes of data managed 	by Hadoop.</li>
<li>Cloudera has half a dozen 	customers at the 75+ node production level.</li>
<li>Web and financial services are the 	two vertical markets moving most aggressively into Hadoop 	production. The government is also in significant Hadoop production, 	but the details of that are classified.</li>
<li>Web uses for Hadoop include:
<ul>
<li>Clickstream – sessionization, 	etc. – that&#8217;s a super-mainstream use.</li>
<li>Search – analyzing search 	attempts in conjunction with structured data.</li>
<li>Machine learning (for ad serving, 	etc.).</li>
</ul>
</li>
<li>Financial services uses for Hadoop 	include:
<ul>
<li>Internal trading rule 	enforcement/fraud detection.</li>
<li>Complex ETL.</li>
<li>Portfolio risk assessment 	(typically overnight).</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">None of this is inconsistent with previous surveys of <a href="http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/" >Hadoop use cases</a>.</p>
<p style="margin-bottom: 0in; font-style: normal;">Various users talked at the Hadoop Summit this week. I wasn&#8217;t there, and won&#8217;t write about their stories for now. That said, <a href="http://www.slideshare.net/kevinweil/hadoop-at-twitter-hadoop-summit-2010" onclick="javascript:pageTracker._trackPageview('/www.slideshare.net');">Twitter&#8217;s slide deck</a> from same has some interesting stuff, including:</p>
<ul>
<li><span style="font-style: normal;">7 	TB/day ETLed from MySQL.</span></li>
<li><span style="font-style: normal;">Petabytes-being-stored 	accordingly coming soon.</span></li>
<li><span style="font-style: normal;">Open 	sourcing their ETL tool Crane.</span></li>
<li><span style="font-style: normal;">3-4X 	LZO compression at little CPU cost.</span></li>
<li><span style="font-style: normal;">HBase 	is a more usable for them than HDFS, which isn&#8217;t mutable enough.</span></li>
<li><span style="font-style: normal;">Pig 	= 5% of code and coding effort vs. vanilla Hadoop at 30% or less 	performance hit.</span></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Notes on SciDB and scientific data management</title>
		<link>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/</link>
		<comments>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/#comments</comments>
		<pubDate>Sat, 22 May 2010 08:04:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[SciDB]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2178</guid>
		<description><![CDATA[I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That&#8217;s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about issues in scientific data management. Here&#8217;s some of what has transpired since then.
The main [...]]]></description>
			<content:encoded><![CDATA[<p>I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That&#8217;s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/" >issues in scientific data management</a>. Here&#8217;s some of what has transpired since then.</p>
<p>The main new activity I know of has been in the open source <a href="http://www.scidb.org/" onclick="javascript:pageTracker._trackPageview('/www.scidb.org');">SciDB</a> project.   <span id="more-2178"></span></p>
<ul>
<li>A company called Zetics has been started to commercialize SciDB. As of now, the entire staff seems to be CEO Marilyn Matz, techie Paul Brown, and part of Mike Stonebraker. Marilyn says Zetics has some venture capital, but even under NDA didn&#8217;t tell me who it was from. Zetics does not have its own web site.</li>
<li>Marilyn tells me there are 20-25 contributors to SciDB, led by Paul Brown and Mike Stonebraker. Brown is full-time. Persistent Systems has been donating the efforts of a few of its employees. Some <a href="http://www.lsst.org/lsst" onclick="javascript:pageTracker._trackPageview('/www.lsst.org');">LSST</a> folks have been doing SciDB work backed by grant money. Most or all of the rest seem to be purer volunteers. Some Russians have been particularly active.</li>
<li>Release 0.5 of SciDB is expected in June. Release 1.0 is expected in September. This is a rewrite; prior demo code has been scrapped. Perhaps not coincidentally, it&#8217;s also a small slip from prior project plans.</li>
<li>The array data model is an example of what&#8217;s being implemented first. (Duh &#8212; you can&#8217;t have a DBMS without a data model.) Support for uncertainty is an example of what&#8217;s been deferred until later.</li>
<li>As has been clear since XLDB3 last August, one major target market for SciDB is genomic research.</li>
<li>It&#8217;s obvious that the oil and gas industry, with all its geospatial data, should be interested in SciDB. But there&#8217;s not much activity in that regard; outreach is evidently needed. If you can think of somebody in that sector (or anywhere else) who should be alerted to SciDB, please ping them.</li>
<li>Interest from web analytics users in SciDB seems to have receded a bit from the days when eBay almost funded the project.</li>
</ul>
<p>In other scientific data management news,</p>
<ul>
<li>Microsoft put out a book called <a href="http://research.microsoft.com/en-us/collaboration/fourthparadigm/" onclick="javascript:pageTracker._trackPageview('/research.microsoft.com');">The Fourth Paradigm</a> on scientific database management. The whole thing can be downloaded, very officially, as a giant PDF. I think it&#8217;s worth skimming. I don&#8217;t think it&#8217;s worth actually reading. (I did read it.)</li>
<li><a href="http://www-conf.slac.stanford.edu/xldb/" onclick="javascript:pageTracker._trackPageview('/www-conf.slac.stanford.edu');">XLDB4</a> will be at Stanford October 5-7. Unlike prior XLDBs, it will have an open (i.e., no invitation required) part.</li>
</ul>
<p>Finally, you surely are aware of the whole &#8220;Climategate&#8221; mess, in which major climate researchers&#8217; email was hacked and many unkind conclusions were drawn. Well, one of the most technical parts of the disclosure was in a long series of Read Me files, in which an unfortunate programmer lamented about <a href="http://di2.nu/foia/HARRY_READ_ME-20.html" onclick="javascript:pageTracker._trackPageview('/di2.nu');">the difficulty of reconstructing published results from files at hand</a>. These turned out to illustrate a classic problem that SciDB or alternatives are meant to solve:</p>
<ul>
<li>Raw data was impossible to use without various adjustments to regularize it (the word &#8220;regridding&#8221; comes up a lot, for example). Massaging was needed before analytics could be done on it.</li>
<li>The raw data was thrown out or lost, and could not be reconstructed (why they couldn&#8217;t have asked the suppliers of the data to give it to them again was unclear in this case, since it wasn&#8217;t original experimental data).</li>
<li>It was thus impossible to massage the data in any new or improved way.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Teradata&#8217;s nebulous cloud strategy</title>
		<link>http://www.dbms2.com/2009/10/27/teradatas-nebulous-cloud-strategy/</link>
		<comments>http://www.dbms2.com/2009/10/27/teradatas-nebulous-cloud-strategy/#comments</comments>
		<pubDate>Tue, 27 Oct 2009 19:41:47 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1180</guid>
		<description><![CDATA[As the pun goes, Teradata&#8217;s cloud strategy is – well, it&#8217;s somewhat nebulous. More precisely, for the foreseeable future, Teradata&#8217;s cloud strategy is a collection of rather disjointed parts, including:

What Teradata calls the Teradata 	 Agile Analytics Cloud, which is a combination of previously 	existing technology plus one new portlet called the Teradata 	Elastic Mart(s) [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">As the pun goes, Teradata&#8217;s cloud strategy is – well, it&#8217;s somewhat nebulous. More precisely, for the foreseeable future, Teradata&#8217;s cloud strategy is a collection of rather disjointed parts, including:</p>
<ul>
<li>What Teradata calls the <em>Teradata 	 Agile Analytics Cloud, </em>which is a combination of previously 	existing technology plus one new portlet called the <em>Teradata 	Elastic Mart(s) Builder.</em> (Teradata&#8217;s <em>Elastic Mart(s) Builder 	Viewpoint</em><span style="font-style: normal;"> portlet is avail</span>able 	for <span style="font-style: normal;">download from <a href="../2009/05/26/teradata-developer-exchange-devx-begins-to-emerge/">Teradata&#8217;s 	Developer Exchange</a>.)</span></li>
<li><em>Teradata Data Mover 2.0,</em> coming “Soon”, which will ease copying (ETL without any 	significant “T”) from one Teradata system to another.</li>
<li><em>Teradata Express</em> DBMS 	crippleware (1 terabyte only, no production use), now available on 	Amazon EC2 and VMware. (I don&#8217;t see where this has much connection to the rest of Teradata&#8217;s cloud strategy, except insofar as it serves to fill out a slide.)</li>
<li>Unannounced (and so far as I can 	tell largely undesigned) future products.</li>
</ul>
<p style="margin-bottom: 0in;">Teradata openly admits that its direction is heavily influenced by Oliver Ratzesberger at <a href="../2009/04/30/ebays-two-enormous-data-warehouses/">eBay</a>. Like Teradata, Oliver and eBay favor virtual data marts over physical ones. That is, Oliver and eBay believe that the ideal scenario is that every piece of data is only stored once, in an integrated Teradata warehouse. But eBay believes and Teradata increasingly agrees that users need a great deal of control over their use of this data, including the ability to import additional data into private sandboxes, and join it to the warehouse data already there.<span id="more-1180"></span></p>
<p style="margin-bottom: 0in;">The <em>Teradata Elastic Mart(s) Builder Viewpoint</em> portlet automates the inclusion of outside data. If you&#8217;re already an authorized Teradata data warehouse user, you can fill in a very short form (three or so fields) and add authorization to import outside data, e.g. from a .CSV file. No fuss, little bother. Trivial as that sounds, when you combine it with Teradata&#8217;s pre-existing robust workload management tools, it creates a pretty good <em>virtual data mart</em> story.</p>
<p style="margin-bottom: 0in;">Spinning out and maintaining consistency with physical data marts is a different matter. Teradata doesn&#8217;t seem too sure it believes in those. And while Teradata is obviously planning to increase its capability in that regard anyway, I didn&#8217;t get a lot of detail beyond the reference to Data Mover 2.0.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>My Greenplum-inspired post on <a href="../2009/06/08/the-future-of-data-marts/">the 	future of data marts</a>, outlining issues in “private cloud” 	data warehousing.</li>
<li>eBay&#8217;s “<a href="http://www.xlmpp.com/articles/16-articles/39-analytics-as-a-service" onclick="javascript:pageTracker._trackPageview('/www.xlmpp.com');">Analytics 	as a Service</a>” pitch (about 1 ½ years old)</li>
<li><a href="http://developer.teradata.com/database/articles/what-is-the-teradata-agile-analytics-cloud" onclick="javascript:pageTracker._trackPageview('/developer.teradata.com');">A 	post by Teradata&#8217;s Dan Graham</a> explaining the <em>Teradata Agile 	Analytics Cloud</em><span style="font-style: normal;"> and </span><em>Elastic 	Mart(s) Builder Viewpoint</em> portlet</li>
<li>Home page and complete screen shot 	for the <a href="http://developer.teradata.com/download/viewpoint/elastic-marts-builder" onclick="javascript:pageTracker._trackPageview('/developer.teradata.com');"><em>Teradata 	Elastic Mart(s) Builder Viewpoint</em> portlet</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/27/teradatas-nebulous-cloud-strategy/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Introduction to the XLDB and SciDB projects</title>
		<link>http://www.dbms2.com/2009/09/12/xldb-scid/</link>
		<comments>http://www.dbms2.com/2009/09/12/xldb-scid/#comments</comments>
		<pubDate>Sat, 12 Sep 2009 19:54:51 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Michael Stonebraker]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=883</guid>
		<description><![CDATA[Before I write anything else about the overlapping efforts known as XLDB and SciDB, I probably should explain and disambiguate what they are as best I can.  XLDB was organized and still is run by guys who want to solve a scientific problem in eXtremely Large DataBase Management, most especially Jacek Becla of SLAC [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Before I write anything else about the overlapping efforts known as <em>XLDB</em> and <em>SciDB</em>, I probably should explain and disambiguate what they are as best I can.  XLDB was organized and still is run by guys who want to solve a scientific problem in eXtremely Large DataBase Management, most especially Jacek Becla of SLAC (the organization previously known as Stanford Linear Accelerator Center). Becla&#8217;s original motivation was that he needs a DBMS to manage what will be 55 petabytes of raw image data and 100 petabytes of astronomical data total for <a href="http://www.lsst.org/lsst" onclick="javascript:pageTracker._trackPageview('/www.lsst.org');">LSST</a> (Large Synoptic Survey Telescope).<span id="more-883"></span></p>
<p style="margin-bottom: 0in;">XLDB more or less comprises:</p>
<ul>
<li>A series of what have now been 	three workshops: <span style="font-style: normal;"><a href="http://www-conf.slac.stanford.edu/xldb07/" onclick="javascript:pageTracker._trackPageview('/www-conf.slac.stanford.edu');">XLDB1 	in 2007</a>, <a href="http://www-conf.slac.stanford.edu/xldb08/" onclick="javascript:pageTracker._trackPageview('/www-conf.slac.stanford.edu');">XLDB2 	in 2008</a>, and <a href="http://www-conf.slac.stanford.edu/xldb09/default.htm" onclick="javascript:pageTracker._trackPageview('/www-conf.slac.stanford.edu');">XLDB3 	in 2009</a></span> (the closest thing to a master link is probably 	the <a href="http://www-conf.slac.stanford.edu/xldb09/links.htm" onclick="javascript:pageTracker._trackPageview('/www-conf.slac.stanford.edu');">XLDB3 	site&#8217;s related link page)</a>. Participants have included, among 	others:
<ul>
<li>A lot of big-name 	database-oriented computer science researchers &#8212; Mike Stonebraker, 	Dave DeWitt, Martin Kersten, and numerous others</li>
<li>Academics responsible for 	scientific database management, especially but not only in the 	astronomy area</li>
<li>Some vendors (although vendor 	participation was cut back after XLDB1) &#8212; at XLDB3, which is the 	one I went to, the three vendor folks who actually talked were 	Stephen Brobst of Teradata, Luke Lonergan of Greenplum (who worked 	in scientific high performance computing earlier in his career), and 	Jeff Hammerbacher of Cloudera.</li>
<li>eBay and to some extent other 	large web companies</li>
<li>A European Union funding 	bureaucrat</li>
<li>Me</li>
</ul>
</li>
<li>An attempt to kick start a broader 	movement, perhaps comprising (it&#8217;s not totally clear yet):
<ul>
<li>Computer science researchers 	interested in database issues</li>
<li>Database technology vendors</li>
<li>Scientific researchers (academic) 	who have very large or otherwise difficult database management 	problems</li>
<li>Scientific researchers 	(commercial) who have very large or otherwise difficult database 	management problems</li>
<li>Other commercial users who have 	very large database management problems</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">The first result or spin-out from the XLDB effort seems to have been the <a href="http://www.scidb.org/" onclick="javascript:pageTracker._trackPageview('/www.scidb.org');">SciDB</a> project. This is an effort to build an open source DBMS called SciDB that will address <strong>some</strong> of the needs the XLDB effort is uncovering. (More on that in other posts.) Somewhat confusingly, <strong>all</strong> the use cases the XLDB group is collecting are currently being posted on SciDB&#8217;s website, apparently because it&#8217;s glitzier and healthier than, say, the excessively sparse XLDB wiki. Some SciDB development has happened, but no large sugar daddy has yet been found. (It&#8217;s a fairly open secret that eBay looked seriously and favorably at funding SciDB before the economic downturn.) hit.</p>
<p style="margin-bottom: 0in;">Numerous big-name computer scientists are associated with SciDB, indeed more closely (it would seem) than with XLDB. That said, I&#8217;m guessing Dave DeWitt&#8217;s involvement in the open-source SciDB isn&#8217;t what it would be if he hadn&#8217;t gone to Microsoft. DeWitt actually skipped XLDB3, although he was in town for VLDB. (XLDB3 was back-to-back with VLDB 2009 in Lyon, France in late August.) Stonebraker just didn&#8217;t make the flight for either conference, due to the double-knee &#8220;upgrade&#8221; he had back in March.</p>
<p style="margin-bottom: 0in;">There&#8217;s a lot more to be said about the cross-discipline or science-specific requirements that researchers place on data management, but I&#8217;ll leave that for later and just get this posted as a start &#8212; assuming, of course, that <a href="http://www.dbms2.com/2009/09/12/availability-nightmares-continue/" >blog outages</a> permit. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> </p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li><a href="http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_26.pdf" onclick="javascript:pageTracker._trackPageview('/www-db.cs.wisc.edu');">Paper 	laying out the SciDB project</a></li>
<li><a href="http://database.cs.brown.edu/projects/scidb/" onclick="javascript:pageTracker._trackPageview('/database.cs.brown.edu');">One 	version of a SciDB overview page</a>, with links to academic papers</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/09/12/xldb-scid/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The future of data marts</title>
		<link>http://www.dbms2.com/2009/06/08/the-future-of-data-marts/</link>
		<comments>http://www.dbms2.com/2009/06/08/the-future-of-data-marts/#comments</comments>
		<pubDate>Mon, 08 Jun 2009 08:25:07 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[DATAllegro]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=805</guid>
		<description><![CDATA[Greenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC). Key observations around the concept &#8212; mixing mine and Greenplum&#8217;s together &#8212; include:

Data marts aren&#8217;t just for 	performance (or price/performance). They also exist to give 	individual analysts or small teams control of their analytic 	destiny.
Thus, it would be really cool if [...]]]></description>
			<content:encoded><![CDATA[<p>Greenplum is announcing today a long-term vision, under the name <em>Enterprise Data Cloud (EDC). </em><span style="font-style: normal;">Key observations around the concept &#8212; mixing mine and Greenplum&#8217;s together &#8212; include:</span></p>
<ul>
<li><strong>Data marts aren&#8217;t just for 	performance</strong> (or price/performance). They also exist to give 	individual analysts or small teams control of their analytic 	destiny.</li>
<li>Thus, it would be really cool if 	business users could have their own <strong>analytic &#8220;sandboxes&#8221;</strong> &#8212; virtual or physical analytic databases that they can manipulate 	without breaking anything else.</li>
<li>In any case, business users want 	to analyze data when they want to analyze it. <strong>It is often unwise 	to ask business users to postpone analysis</strong> until after an 	enterprise data model can be extended to fully incorporate the new 	data they want to look at.</li>
<li>Whether or not you agree with 	that, it&#8217;s an empirical fact that enterprises have many <strong>legacy 	data marts</strong> (or even, especially due to M&amp;A, multiple legacy 	data warehouses).  Similarly, it&#8217;s an empirical fact that many 	business users have the clout to order up <strong>new data marts</strong> as 	well.</li>
<li><strong>Consolidating</strong> data marts 	onto one common technological platform has important benefits.</li>
</ul>
<p style="margin-bottom: 0in;">In essence, Greenplum is pitching the story:</p>
<ul>
<li>Thesis: Enterprise Data Warehouses 	(EDWs)</li>
<li>Antithesis: Data Warehouse 	Appliances</li>
<li>Synthesis: Greenplum&#8217;s Enterprise 	Data Cloud vision</li>
</ul>
<p style="margin-bottom: 0in;">When put that starkly, it&#8217;s overstated, not least because</p>
<p style="margin-left: 0.49in; margin-bottom: 0in;">Specialized Analytic DBMS != Data Warehouse Appliance</p>
<p style="margin-bottom: 0in;">But basically it makes sense, for two main reasons:</p>
<ul>
<li>Analysis is performed on all sorts 	of novel data, from sources far beyond an enterprise&#8217;s core 	transactions.  This data neither has to fit nor particularly 	benefits from being tightly fitted into the core enterprise data 	model.  Requiring it to do so is just an unnecessary and painful 	bureaucratic delay.</li>
<li>On the other hand, consolidation 	can be a good idea even when systems don&#8217;t particularly 	interoperate. Data marts, which commonly do in part interoperate 	with central data stores, have all the more reason to be 	consolidated onto a central technology platform/stack.</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-805"></span><span style="font-style: normal;">Of course, the EDC vision isn&#8217;t quite as new or differentiated as Greenplum ideally would wish one to believe.</span></p>
<ul>
<li><span style="font-style: normal;">To 	a first approximation, EDC sounds a lot like <a href="../2009/04/30/ebays-two-enormous-data-warehouses/">what 	eBay has already built on Teradata equipment.</a> </span></li>
<li><span style="font-style: normal;">Greenplum&#8217;s 	EDC vision also sounds a lot like what Stuart Frost was talking 	about at DATA</span>llegro, <a href="../2009/03/02/closing-the-book-on-the-datallegro-customer-base/">what 	Dell was planning to build on DATAllegro equipment</a>, and what Stuart 	continues to talk about now that he&#8217;s been acquired into Microsoft.</li>
<li>Something like EDC can also be 	presumed to be implicit in the strategies of the other 	one-size-fits-all vendors &#8212; i.e., Oracle and IBM.</li>
<li>Greenplum has only implemented a 	little more of the EDC vision so far than have other firms, unless 	you give it credit for being cheap/fast/MPP/running on commodity 	hardware, but deny that credit to Teradata (specialized hardware, 	and not cheap in its most popular configurations), Oracle (ditto for 	Exadata), IBM (also not cheap), or Microsoft/DATAllegro (not 	released yet).</li>
<li>Specifically: In <a href="../2009/06/05/greenplum-update-release-3-3/">Greenplum 	Release 3.3</a>, which is being announced today, Greenplum is 	introducing the (enhanced?) ability for data marts to be spun out as 	a background operation, while the database otherwise remains 	functional.  As of 3.3, spinning out a data mart is a command-line 	operation. But in Release 3.4, Greenplum plans to offer a web-based 	interface for same, at which point the &#8220;self-service data mart 	creation&#8221; discussion will become operative.  Otherwise, EDC is 	a roadmap/vision/statement-of-direction much more than it is a 	fully-baked technical project.</li>
</ul>
<p style="margin-bottom: 0in;">One particular source of potential confusion is Greenplum&#8217;s emphasis on the buzzphrase <em>self-service (data mart).</em> This seems to be a conflation of two related concepts:</p>
<ul>
<li><strong>End users should be able to 	create new data marts themselves.</strong> Strictly speaking, I view this 	ability as useless at most enterprises, and important at very few, 	because of logistical issues.  (Who gives the permissions? Who 	decides which hardware is used?)  That said, useless &#8220;end user&#8221; 	tools often wind up being important productivity aids for IT 	professionals, and this kind of &#8220;self-service&#8221; would 	surely be another example. <em> Edit: Hmm. Doug Henschen inspired me to think that over again, and I&#8217;m beginning to soften. Suppose users could order up the data mart they want, perhaps test it at a very low processing priority (if they choose), and then send the completed request to IT for approval and provisioning. That would have some value.</em></li>
<li><strong>End users should be able to 	manage data marts themselves, once created.</strong> That&#8217;s a great 	idea, full of agility and don&#8217;t-make-IT-a-roadblock goodness. Data 	miners and similar analytic professionals commonly have the 	technical ability to manage a simple database, and should be allowed 	to do so if it&#8217;s ensured that they don&#8217;t break anything for anybody 	else.</li>
</ul>
<p style="margin-bottom: 0in;">One thing that&#8217;s needed for this technology to come to full fruition is sophisticated data movement and synchronization.  Ideally, some tables in a data mart could be virtual &#8212; views against a central database. But others would be physically recopied from the center, with all the ETL/ELT/ETLT/replication issues that entails. Meanwhile, it&#8217;s not obvious that the ideal architecture is a simpleminded hub-spoke &#8212; perhaps one should be able to spin data marts out of other marts, perhaps at least somewhat reducing the proliferation of tables and the recopying of data. And it should be easy for administrators to change deployment strategies, e.g. by starting a table out as a view and changing over to making it a physical copy as usage profiles change.</p>
<p style="margin-bottom: 0in;">Oliver Ratzesberger of eBay also argues that workload management &#8212; <a href="../2009/06/08/more-on-fox-interactive-medias-use-of-greenplum/">not a current Greenplum strength</a> &#8212; can be crucial. For example, if the CEO wants the CFO to get her an answer TODAY, the fastest approach may be to create an entirely virtual data mart, with very favorable SLAs (Service Level Agreements).  More generally, if you&#8217;re setting up dozens of marts that contain views of the central database, sophisticated SLA management can be essential. There&#8217;s a big virtualization opportunity here &#8212; but virtualization requires a lot of system management infrastructure.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>My recent post on <a href="http://www.dbms2.com/2009/05/30/reinventing-business-intelligence/" >reinventing 	business intelligence</a></li>
<li>Greenplum adviser Joe 	Hellerstein&#8217;s pitch for <a href="http://databeta.wordpress.com/2009/03/20/mad-skills/" onclick="javascript:pageTracker._trackPageview('/databeta.wordpress.com');">agile data warehousing</a></li>
<li>Charlie 	Bachman&#8217;s &#8220;<a href="http://www.oberon2005.ru/paper/cb2004-01e.pdf" onclick="javascript:pageTracker._trackPageview('/www.oberon2005.ru');">private database</a>&#8221; idea, which never went 	anywhere (pp. 138-139)</li>
<li>Greenplum&#8217;s <a href="http://www.prweb.com/releases/2009/06/prweb2505854.htm" onclick="javascript:pageTracker._trackPageview('/www.prweb.com');">EDC</a> and <a href="http://www.prweb.com/releases/2009/06/prweb2505844.htm" onclick="javascript:pageTracker._trackPageview('/www.prweb.com');">Release 3.3</a> press releases</li>
<li>An interview with some of Greenplum co-founder <a href=" {x|r} is divisible by x when x is a prime and r doesnâ€™t = x or 0 because there are no prime factors in the denominator but there is one in the numerator. As in the above equation the binomial coefficients donâ€™t have r=x or r=0. So a common x may be brought out   That's very badly phrased.  For example, ">Scott Yara&#8217;s own words</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/06/08/the-future-of-data-marts/feed/</wfw:commentRss>
		<slash:comments>22</slash:comments>
		</item>
		<item>
		<title>eBay&#8217;s two enormous data warehouses</title>
		<link>http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/</link>
		<comments>http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/#comments</comments>
		<pubDate>Thu, 30 Apr 2009 10:13:47 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=770</guid>
		<description><![CDATA[A few weeks ago, I had the chance to visit eBay, meet briefly with Oliver Ratzesberger and his team, and then catch up later with Oliver for dinner. I&#8217;ve already alluded to those discussions in a couple of posts, specifically on MapReduce (which eBay doesn&#8217;t like) and the astonishingly great difference between high- and low-end [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">A few weeks ago, I had the chance to visit eBay, meet briefly with Oliver Ratzesberger and his team, and then catch up later with Oliver for dinner. I&#8217;ve already alluded to those discussions in a couple of posts, specifically on <a href="../2009/04/14/ebay-thinks-mpp-dbms-clobber-mapreduce/">MapReduce</a> (which eBay doesn&#8217;t like) and the astonishingly <a href="../2009/04/28/data-warehouse-storage-options-cheap-expensive-or-solid-state-disk-drives/">great difference between high- and low-end disk drives </a>(to which eBay clued me in).  Now I&#8217;m finally getting around to writing about the core of what we discussed, which is two of the very largest data warehouses in the world.</p>
<p style="margin-bottom: 0in;">Metrics on eBay&#8217;s main Teradata data warehouse include:</p>
<ul>
<li><strong>&gt;2 petabytes of user data</strong></li>
<li><strong>10s of 1000s of users</strong></li>
<li><strong>Millions of queries per day</strong></li>
<li>72 nodes</li>
<li>&gt;140 GB/sec of I/O, or <strong>2 	GB/node/sec,</strong> or maybe that&#8217;s a peak when the workload is 	scan-heavy</li>
<li>100s of production databases being 	fed in</li>
</ul>
<p style="margin-bottom: 0in;">Metrics on eBay&#8217;s Greenplum data warehouse (or, if you like, data mart) include:</p>
<ul>
<li><strong>6  1/2 petabytes of user data</strong></li>
<li><strong>17 trillion records</strong></li>
<li><strong>150 billion new records/day, </strong><span>which seems to suggest an 	ingest rate well over </span><strong>50 terabytes/day</strong></li>
<li>96 nodes</li>
<li><strong>200 MB/node/sec</strong> of I/O 	(that&#8217;s the order of magnitude difference that triggered my post on 	disk drives)</li>
<li>4.5 petabytes of storage</li>
<li>70% compression</li>
<li>A small number of concurrent users</li>
</ul>
<p style="margin-bottom: 0in;"><a href="../2008/10/15/teradatas-petabyte-power-players/"><span id="more-770"></span>eBay&#8217;s Teradata installation</a> is a full enterprise data warehouse.   Besides size and scope, it is most notable for its implementation of Oliver&#8217;s misleadingly named <a href="http://www.xlmpp.com/articles/16-articles/39-analytics-as-a-service" onclick="javascript:pageTracker._trackPageview('/www.xlmpp.com');">analytics-as-a-service</a> vision. In essence, eBay spins out dozens of <strong><em>virtual data marts,</em></strong><span style="font-style: normal;"> which:</span></p>
<ul>
<li><span style="font-style: normal;">Combine 	views and aggregations on the central data warehouse with 	(optionally) additional &#8220;private&#8221; data the data mart user 	loads in. </span></li>
<li><span style="font-style: normal;">Are 	usually &lt;5 terabytes in size, and indeed often &lt;500 gigabytes. </span></li>
<li><span style="font-style: normal;">Can 	be created &#8220;instantaneously&#8221; by setting permissions, 	resource quotas, and the like.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">The whole scheme relies heavily on Teradata&#8217;s workload management software to deliver with assurance on many SLAs (Service-Level Agreements) at once. </span><em>Resource partitions</em><span style="font-style: normal;"> are a key concept in all this.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">So far as I can tell, eBay uses Greenplum to manage one kind of data &#8212; web and network event logs.  These seem to be managed primarily at two levels of detail &#8212; Oliver said that the 17 trillion event detail records reduce to 1 trillion real event records. When I asked where the 17:1 ratio comes from, Oliver explained that a single web page click &#8212; which is what is memorialized in an event record &#8212; resulted in 50-150 details. That leaves a missing factor of 3-8X, but perhaps other less complex kinds of events are also mixed in.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">The Greenplum metrics I quoted above represent over 100 days of data. Ultimately, eBay expects to keep 90-180 days of ultimate detail, and &gt;1 years of event data. The 6 1/2 petabyte figure comes from dividing 2 terabytes of compressed data by (100%-70%). Since that all fits on a 4 1/2 petabyte system, I presume there&#8217;s only one level of mirroring (duh), not much temp space, and even less in the way of indexes.</span></p>
<p style="margin-bottom: 0in;">Two uses of eBay&#8217;s Greenplum database are disclosed &#8212; whittling down from detailed to click-level event data, and sessionization. The latter seems to be done in batch runs and take 30 minutes per day.  A couple of other uses are undisclosed.  I assume eBay is doing something that requires UDFs (User-Defined Functions), because Oliver remarked that he likes the language choices offered by Greenplum&#8217;s Postgres-based UDF capability. But basically eBay&#8217;s Greenplum database is used for and evidently does very nicely at:</p>
<ul>
<li>Data ingest &#8212; it&#8217;s the first 	place log data goes</li>
<li>Feeding the Teradata database</li>
<li>A small number of big queries</li>
</ul>
<p style="margin-bottom: 0in;">eBay&#8217;s Teradata database handles the rest.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links:</strong></em></p>
<ul>
<li>Wal-Mart, Bank of America, another financial services company, and Dell also have <a href="http://www.dbms2.com/2008/10/15/teradatas-petabyte-power-players/" >very large Teradata databases</a>.</li>
<li>Yahoo&#8217;s web/network events database, running on proprietary software, sounded about <a href="http://www.dbms2.com/2008/05/29/yahoo-scales-web-analytics-database-petabyte/" >1/6th the size of eBay&#8217;s Greenplum system</a> when it was described about a year ago.</li>
<li><a href="http://www.dbms2.com/2009/04/15/cloudera-presents-the-mapreduce-bull-case/" >Facebook has 2 1/2 terabytes managed by Hadoop</a> &#8212; without a DBMS!<a href="http://www.dbms2.com/2009/04/15/cloudera-presents-the-mapreduce-bull-case/"><br />
</a></li>
<li>Fox Interactive Media/MySpace has multi-hundred terabyte databases running on each of <a href="http://www.dbms2.com/2009/03/05/fox-interactive-medias-multi-hundred-terabyte-database-running-on-greenplum/" >Greenplum</a> and Aster Data <a href="http://www.dbms2.com/2009/03/05/myspaces-multi-hundred-terabyte-database-running-on-aster-data/" >nCluster</a>.</li>
<li><a href="http://www.dbms2.com/2008/05/23/data-warehouse-appliance-power-user-teoco/" >TEOCO has 100s of terabytes running on DATAllegro</a>.</li>
<li>To a probably lesser extent, the same is now also true of <a href="http://www.dbms2.com/2009/03/02/closing-the-book-on-the-datallegro-customer-base/" >Dell</a>.</li>
<li><a href="http://www.dbms2.com/2009/04/25/vertica-pricing-and-customer-metrics/" >Vertica has a couple of unnamed customers with databases in the 200 terabyte range</a>.</li>
<li>In response to this post, Greenplum CTO Luke Lonergan quickly <a href="http://www.greenplum.com/news/197/231/CTO-View-Singularity-or-astounding-scale-at-eBay/d,blog/" onclick="javascript:pageTracker._trackPageview('/www.greenplum.com');">blogged about the eBay project</a>. Other related posts on the same blog may follow.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/feed/</wfw:commentRss>
		<slash:comments>28</slash:comments>
		</item>
		<item>
		<title>Data warehouse storage options &#8212; cheap, expensive, or solid-state disk drives</title>
		<link>http://www.dbms2.com/2009/04/28/data-warehouse-storage-options-cheap-expensive-or-solid-state-disk-drives/</link>
		<comments>http://www.dbms2.com/2009/04/28/data-warehouse-storage-options-cheap-expensive-or-solid-state-disk-drives/#comments</comments>
		<pubDate>Tue, 28 Apr 2009 09:47:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=768</guid>
		<description><![CDATA[This is a long post, so I&#8217;m going to recap the highlights up front.  In the opinion of somebody I have high regard for, namely Carson Schmidt of Teradata:

There&#8217;s currently a huge &#8212; one 	order of magnitude &#8212; performance difference between cheap and 	expensive disks for data warehousing workloads.
New disk generations coming soon 	will [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">This is a long post, so I&#8217;m going to recap the highlights up front.  In the opinion of somebody I have high regard for, namely Carson Schmidt of Teradata:</p>
<ul>
<li>There&#8217;s currently a huge &#8212; one 	order of magnitude &#8212; performance difference between cheap and 	expensive disks for data warehousing workloads.</li>
<li>New disk generations coming soon 	will have best-of-both-worlds aspects, combining high-end 	performance with lower-end cost and power consumption.</li>
<li><a href="../2008/10/23/teradata-solid-state-drives-ssd/">Solid-state 	drives</a> will likely add one or two orders of magnitude to 	performance a few years down the road.  Echoing the most famous 	logjam in VC history &#8212; namely the 60+ hard disk companies that got 	venture funding in the 1980s &#8212; 20+ companies are vying to cash in.</li>
</ul>
<p style="margin-bottom: 0in;">In other news, Carson likes 10 Gigabit Ethernet, dislikes Infiniband, and is &#8220;ecstatic&#8221; about Intel&#8217;s Nehalem, which will be the basis for Teradata&#8217;s next generation of servers.</p>
<p style="margin-bottom: 0in;"><span id="more-768"></span>Here&#8217;s the longer version.</p>
<p style="margin-bottom: 0in;">Oliver Ratzesberger of eBay made the interesting comment to me that 15K RPM disk drives could have 10X or more the performance of 7200 RPM ones, a difference that clearly is not explained just by rotational speed.  He said this was due to the large number of retries required by the cheaper drives, which eBay had tested as being in the 5-8X range on its particular equipment, for an overall 10X+ difference in effective scan rates.  When I continued to probe, Oliver suggested that the guy I really should talk with is Carson Schmidt of Teradata, advice I took eagerly based on <a href="../2008/10/23/teradata-solid-state-drives-ssd/">past experience</a>.</p>
<p style="margin-bottom: 0in;">Yesterday, Carson &#8212; who was unsurprised at Oliver&#8217;s figures* &#8212; patiently explained to me his views of the current differences between <em>cheap</em> and <em>expensive</em> disk drives. (Carson uses the terms &#8220;near-line&#8221; and &#8220;enterprise-class&#8221;.)  Besides price, cheap drives optimize for power consumption, while expensive drives optimize for performance and reliability.  Currently, for Teradata, &#8220;cheap&#8221; equates to SATA, &#8220;expensive&#8221; equates to Fibre Channel, and SAS 1.0 isn&#8217;t used.  But SAS 2.0, coming soon, will supersede both of those interfaces, as discussed below.</p>
<p style="margin-bottom: 0in;"><em>*Carson did note that the performance differential varied significantly by the kind of workload. The more mixed and oriented to random reads the workload is, the bigger the difference. If you&#8217;re just doing sequential scans, it&#8217;s smaller.  Oliver&#8217;s order-of-magnitude figure seemed to be based on scan-heavy tasks.</em></p>
<p style="margin-bottom: 0in;">As I understand Carson&#8217;s view, <strong>mechanical features</strong> sported only by expensive drives include:</p>
<ul>
<li>Smaller media, more platters, and 	more disk heads</li>
<li>Faster rotational speeds</li>
<li>Enclosures that do a better job of 	damping vibration from disk rotation or fans.</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Electronic features</strong> of expensive storage includes:</p>
<ul>
<li>More CPU (at least 2X)</li>
<li>More RAM (also at least 2X), which 	is useful for caching.</li>
<li>Dual ports for networking.  	Teradata doesn&#8217;t just use dual storage ports for reliability; it 	load balances across them and sometimes gets significantly enhanced 	performance.</li>
</ul>
<p style="margin-bottom: 0in;">Finally, there is <strong>firmware,</strong> in which expensive disk drives seem to have two major kinds of advantages:</p>
<ul>
<li>Command scheduling/queuing, which 	Carson believes provides a benefit at least comparable to the 2X 	derived from different rotational speeds.</li>
<li>Better data integrity checking, in 	line with the T10 DIF standard. Not only does this seem to give much 	higher reliability, but it can be done closer to the platter, 	yielding a performance advantage.</li>
</ul>
<p style="margin-bottom: 0in;">Apparently, this isn&#8217;t even possible for <strong>SATA</strong> and <strong>SAS 1.0</strong> disk drives, but is common for drives that use the <strong>Fibre Channel</strong> interface, and will also be possible in the forthcoming <strong>SAS 2.0</strong> standard. (As you may have guessed, I&#8217;m a little fuzzy on the details of this firmware stuff.)</p>
<p style="margin-bottom: 0in;">In Carson&#8217;s view, the disk drive industry has consolidated to the point that there are two credible vendors of expensive/enterprise-class disk drives:  Seagate and Hitachi. What Teradata actually uses in <a href="../2008/10/23/teradata-appliance-product-lines/">its own systems</a> right now is:</p>
<ul>
<li>In Teradata&#8217;s high-end 5550 line 	&#8211; Seagate Fiber Channel 3-1/2&#8243; drives</li>
<li>In Teradata&#8217;s mid-range 2550 line 	&#8211; SAS drives from Seagate and perhaps also Hitachi. I get the 	impression these have some of the electromechanical features of 	expensive drives, but not the firmware.</li>
<li>In Teradata&#8217;s low-end 1550 line &#8212; 	Hitachi 1-TB cheap drives.</li>
</ul>
<p style="margin-bottom: 0in;">All this is of course subject to change. In the short term that mainly means the possible use of alternate suppliers. As the Teradata product line is repeatedly refreshed, however, greater changes will occur. Some of the biggest are:</p>
<ul>
<li><strong>A new SAS 2.0 standard</strong> will 	allow enterprise-class firmware for cheaper disks.</li>
<li><strong>The form factor for high-end 	disk drives will shrink</strong> from 3 1/2&#8243; drives to 2 1/2&#8243; drives 	of 1/2 the volume.</li>
<li><strong>The rotation speed sweet spot 	may actually decrease,</strong> to 10K RPM, with offsetting improvements 	to seek and latency so as not to cut performance. Power consumption 	benefits will ensue.</li>
<li>There probably will be <strong>multi-TB 	SAS drives</strong> &#8212; &#8220;fat SAS.&#8221;  SATA may be enhanced to 	compete  with those. And by the way, SAS and SATA are electrically 	compatible, and hence could be combined in the same system.</li>
</ul>
<p style="margin-bottom: 0in;">I got the impression that at least the first three of these developments are expected soon, perhaps within a year.</p>
<p style="margin-bottom: 0in;">And in a few years all of this will be pretty moot, because <strong>solid-state drives (SSDs)</strong> will be taking over.  Carson thinks SSDs will have a 100X performance benefit versus disk drives, a figure that took me aback. However, he&#8217;s not yet sure about how fast SSDs will mature. Also complicating things is a possible transition some years down the road from SLC (Single-Level Cell) to MLC (Multi-Level Cell) SSDs. MLC SSDs which store multiple bits of information at once, are surely denser than SLC SSDs.  I don&#8217;t know whether they&#8217;re more power efficient as well.</p>
<p style="margin-bottom: 0in;">The main weirdnesses Carson sees in SSDs are those I&#8217;ve highlighted in the following quote from <a href="http://en.wikipedia.org/wiki/Flash_memory" onclick="javascript:pageTracker._trackPageview('/en.wikipedia.org');">Wikipedia</a>:</p>
<blockquote><p>One limitation of flash memory is that although it can be read or programmed a byte or a word at a time in a random access fashion, it must be erased a &#8220;block&#8221; at a time. &#8230;</p>
<p>Another limitation is that flash memory has a finite number of erase-write cycles. &#8230; This effect is partially offset in some chip firmware or file system drivers by counting the writes and dynamically remapping blocks in order to spread write operations between sectors; this technique is called <a href="http://en.wikipedia.org/wiki/Wear_levelling" onclick="javascript:pageTracker._trackPageview('/en.wikipedia.org');">wear leveling.</a> Another approach is to perform write verification and remapping to spare sectors in case of write failure, a technique called bad block management (BBM).</p></blockquote>
<p style="margin-bottom: 0in;">And finally, I unearthed a couple of non-storage tidbits, since I was talking with Carson anyway:</p>
<ul>
<li>Carson has become a 10 GigE 	&#8220;bigot&#8221;, and Teradata will soon certify 10 Gigabit 	Ethernet cards for connectivity to external systems. Carson&#8217;s 	interest in Infiniband, never high, went entirely away after Cisco 	decommitted to it.  Obviously, this stands in contrast to the 	endorsements of Infiniband for data warehousing by <a href="http://www.dbms2.com/2008/10/17/oracle-notes/" >Oracle</a> and 	<a href="http://www.dbms2.com/2009/02/23/microsoft-sql-server-fast-track/" >Microsoft</a>.</li>
<li><span style="font-style: normal;">Intel&#8217;s 	Neh</span>alem will be the basis for Teradata&#8217;s next server product. 	 Carson is &#8220;ecstatic&#8221; with Intel at the moment, which is 	different from his stance at other times.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/04/28/data-warehouse-storage-options-cheap-expensive-or-solid-state-disk-drives/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>eBay thinks MPP DBMS clobber MapReduce</title>
		<link>http://www.dbms2.com/2009/04/14/ebay-thinks-mpp-dbms-clobber-mapreduce/</link>
		<comments>http://www.dbms2.com/2009/04/14/ebay-thinks-mpp-dbms-clobber-mapreduce/#comments</comments>
		<pubDate>Tue, 14 Apr 2009 05:53:41 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=747</guid>
		<description><![CDATA[I talked with Oliver Ratzesberger and his team at eBay last week, who I already knew to be MapReduce non-fans.  This time I added more detail.
Oliver believes that, on the whole, MapReduce is 6-8X slower than native functionality in an MPP DBMS, and hence should only be used sporadically. This view is based on [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I talked with Oliver Ratzesberger and his team at eBay last week, who I already knew to be <a href="../2008/10/15/ebay-doesnt-love-mapreduce/">MapReduce non-fans</a>.  This time I added more detail.</p>
<p style="margin-bottom: 0in;">Oliver believes that, on the whole, MapReduce is 6-8X slower than native functionality in an MPP DBMS, and hence should only be used sporadically. This view is based on part on simulations eBay ran of the <a href="http://perspectives.mvdirona.com/2008/11/22/GoogleMapReduceWinsTeraSort.aspx" onclick="javascript:pageTracker._trackPageview('/perspectives.mvdirona.com');">Terasort</a> benchmark.  On 72 Teradata nodes or 96 lower-powered nodes running another (currently unnamed, as per yet another of my PR fire drills) MPP DBMS, a simulation of Terasort executed in 78 and 120 secs respectively, which is very comparable to the times Google and Yahoo got on 1000 nodes or more.</p>
<p style="margin-bottom: 0in;">And by the way, if you use many fewer nodes, you also consume much less floor space or electric power.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/04/14/ebay-thinks-mpp-dbms-clobber-mapreduce/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Named customer silliness</title>
		<link>http://www.dbms2.com/2009/03/02/named-customer-silliness/</link>
		<comments>http://www.dbms2.com/2009/03/02/named-customer-silliness/#comments</comments>
		<pubDate>Mon, 02 Mar 2009 12:14:27 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=708</guid>
		<description><![CDATA[Neither Greenplum nor eBay will say for the record that eBay is a Greenplum customer. Indeed, saying that is quite verboten. On the other hand, Greenplum&#8217;s press release boilerplate says that Skype is a Greenplum customer, and Skype is of course a subsidiary of eBay.  (Edit: Speaking of silliness, fixed a typo there.)
The point of [...]]]></description>
			<content:encoded><![CDATA[<p>Neither Greenplum nor eBay will say for the record that eBay is a Greenplum customer. Indeed, saying that is quite <em>verboten. </em>On the other hand, Greenplum&#8217;s <a href="http://www.greenplum.com/news/174/388/Greenplum-Strengthens-EMEA-Commitment-by-Opening-UK-Office/d,press-releases/" onclick="javascript:pageTracker._trackPageview('/www.greenplum.com');">press release boilerplate</a> says that Skype is a Greenplum customer, and Skype is of course a subsidiary of eBay.  <em>(Edit: Speaking of silliness, fixed a typo there.)</em></p>
<p>The point of such distinctions is sometimes lost on me.</p>
<p>In related news, of <a href="http://www.dbms2.com/2008/08/25/greenplums-single-biggest-customer/" >Greenplum&#8217;s two customers who back in August were supposedly heading into production soon with petabyte-plus databases</a>, one hasn&#8217;t yet made it to that size. (&#8221;As we speak&#8221; turned out to be a longer conversation than I might have anticipated &#8230;.) The other (of course unnamed) customer has, Greenplum assures me, made it that high.  But upon checking with that (unnamed, in case I forgot to mention the point) customer, I don&#8217;t detect a whole lot of enthusiasm about Greenplum.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/03/02/named-customer-silliness/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Data warehousing business trends</title>
		<link>http://www.dbms2.com/2009/02/26/data-warehousing-business-trends/</link>
		<comments>http://www.dbms2.com/2009/02/26/data-warehousing-business-trends/#comments</comments>
		<pubDate>Thu, 26 Feb 2009 19:06:27 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=707</guid>
		<description><![CDATA[I&#8217;ve talked with a whole lot of vendors recently, some here at TDWI, as well as users, fellow analysts, and so on. Repeated themes include:

Large enterprise data warehouse 	projects are often being deferred or cut back.  (My sense is that a 	little of this happened in 2008, but more is happening with new 	budget [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I&#8217;ve talked with a whole lot of vendors recently, some here at TDWI, as well as users, fellow analysts, and so on. Repeated themes include:<span id="more-707"></span></p>
<ul>
<li>Large enterprise data warehouse 	projects are often being deferred or cut back.  (My sense is that a 	little of this happened in 2008, but more is happening with new 	budget cycles in 2009.)</li>
<li>Smaller projects with credible, 	quick ROIs are doing fine. In some cases, these are the scaled-back 	or nose-under-the-tent parts of bigger enterprise-wide efforts.</li>
<li>Perhaps not coincidentally, the 	technical trend of having a variety of data marts inside an EDW – 	sometimes called “sandboxes” &#8212; is going strong.  I&#8217;ve commented 	on that before in connection with <a href="http://www.dbms2.com/2009/02/23/microsoft-sql-server-fast-track/" ><em>Microsoft</em></a>.  It&#8217;s also at the heart of eBay&#8217;s Teradata-based &#8220;<a href="http://www.intelligententerprise.com/showArticle.jhtml?articleID=211200065" onclick="javascript:pageTracker._trackPageview('/www.intelligententerprise.com');">analytics-as-a-service</a>&#8221; strategy.<em></em></li>
<li>Uses of data 	warehousing for security or compliance still seem strong.  I guess 	“Keep us safe (and out of jail)” is still a strong motivator for 	buying.</li>
<li>Oracle&#8217;s is 	the #1 installed base in which smaller vendors go fishing. 	Teradata&#8217;s is probably #2. Microsoft SQL Server and MySQL are also 	suffering data warehouse competitive replacements.</li>
<li>I haven&#8217;t 	heard many more examples of enterprises <a href="http://www.dbms2.com/2009/02/07/analytics-role-in-a-frightening-economy/" >doing more analysis</a> because the bad economy has invalidated their prior models and 	assumptions.  More&#8217;s the pity.</li>
<li>Projects to 	provide data to one&#8217;s customers are going gangbusters.  There are 	many flavors of this, from pure third-party analytics vendors to 	(for example) credit card companies who sell transactional data back 	to their merchants to governments that become more “transparent” 	by exposing information to their citizens.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">Obviously, if the economy is bad enough, everything will be hurt. E.g., a lot of those data sellers have something to do with advertising, and the underlying business sector is in the tank. Ditto consumer or mass B-to-B marketing as well.  But for now, the worst hits are being suffered by projects with large price tags, long lead times, and unclear benefits.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/02/26/data-warehousing-business-trends/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
