<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; eBay</title>
	<atom:link href="http://www.dbms2.com/category/users/ebay/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Tue, 07 Feb 2012 06:49:30 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>QlikView 11 and the rise of collaborative BI</title>
		<link>http://www.dbms2.com/2011/11/16/qlikview-collaborative-business-intelligence/</link>
		<comments>http://www.dbms2.com/2011/11/16/qlikview-collaborative-business-intelligence/#comments</comments>
		<pubDate>Wed, 16 Nov 2011 13:19:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5681</guid>
		<description><![CDATA[QlikView 11 came out last month. Let me start by pointing out: As one might expect, QlikView 11 contains fairly leading-edge stuff, but also some &#8220;better late than never&#8221; features. The leading-edge stuff is concentrated in the general area of &#8220;collaboration&#8221;. Additionally, QlikTech is always pushing the QlikView user interface ahead in various ways. The [...]]]></description>
			<content:encoded><![CDATA[<p>QlikView 11 came out last month. Let me start by pointing out:</p>
<ul>
<li>As one might expect, QlikView 11 contains fairly leading-edge stuff, but also some &#8220;better late than never&#8221; features.</li>
<li>The leading-edge stuff is concentrated in the general area of &#8220;collaboration&#8221;.</li>
<li>Additionally, QlikTech is always pushing the QlikView user interface ahead in various ways.</li>
<li>The &#8220;Well, it&#8217;s about time!&#8221; feature list starts with the ability to load QlikView via third-party ETL tools (Informatica now, others coming).</li>
<li>QlikTech is generally good at putting up pretty pictures of its product. You can find some in the &#8220;What&#8217;s New in QlikView 11&#8243; document via a general <a href="http://www.qlikview.com/us/explore/resources/brochures-datasheets?language=english&amp;page=1">QlikView resource page</a>.*</li>
<li>Stephen Swoyer wrote <a href="http://tdwi.org/articles/2011/11/01/QlikView-Update-Enterprise-Makeover.aspx">a good article summarizing QlikView 11</a>.</li>
</ul>
<p><em>*One confusing aspect to that paper:  non-standard uses of the terms &#8220;analytic app&#8221; and &#8220;document&#8221;.</em></p>
<p>As QlikTech tells it, QlikView 11 adds two kinds of collaboration features:</p>
<ul>
<li>Integration with social media, which QlikTech calls &#8220;asynchronous integration.&#8221;</li>
<li>Direct sharing of the QlikView UI, which QlikTech calls &#8220;synchronous integration.&#8221;</li>
</ul>
<p>I&#8217;d add a third kind, because QlikView 11 also takes some baby steps toward what I regard as a key aspect of BI collaboration &#8212; the ability to define and track your own metrics. It&#8217;s way, way short of what <a href="../../../../../2010/07/25/alerts-metrics-dashboards/">I called for in metric flexibility in a post last year</a>, but at least it&#8217;s a small start.</p>
<p><span id="more-5681"></span>That <strong>direct sharing of user interfaces is a cool feature, which every business intelligence vendor should offer. </strong>In an era of distributed workforces, when people can&#8217;t be assumed able to huddle around the same desk, it has value even for use among close coworkers. But it also should prove useful in a variety of more naturally remote use cases, multiple examples of which can be found in each of the areas of:</p>
<ul>
<li>Support (internal or external).</li>
<li>Faceoffs &#8212; I mean collaborations <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  &#8212; between two or more enterprise departments. Examples might include: manufacturing and purchasing, manufacturing and sales, or accounting and anybody else.</li>
</ul>
<p>As for <strong>social media being used for BI collaboration</strong> &#8212; that&#8217;s generally in the air. For example:</p>
<ul>
<li><a href="http://www.texttechnologies.com/2011/09/14/social-technology-in-the-enterprise/">salesforce.com is pushing enterprise social media use broadly</a>, and will surely increase its emphasis on the social media/BI intersection now that Dave Kellogg is there.</li>
<li>Spotfire has announced similar features in its latest release.</li>
<li>The more cumbersome side of the feature set (portal-based collaboration, emailing of individual reports) has been available from multiple vendors for years.</li>
<li>eBay open-sourced a more dataset-centric version of the idea, just as Oliver Ratzesberger left the firm.*</li>
</ul>
<p><em>*Umm &#8212; does anybody have a link to the project, or at least a name for it? <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </em></p>
<p>BI has been a communication tool since the first green paper report was dumped on the first desk. And there&#8217;s been collaboration in doing analysis at least since it&#8217;s been possible to email .XLS file attachments. Still<strong>, BI is too often used as bludgeon rather than binocular. Hopefully, the current generation of technology will finally serve to change that.</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/16/qlikview-collaborative-business-intelligence/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>What those nested data structures are about</title>
		<link>http://www.dbms2.com/2011/10/19/nested-data-structures/</link>
		<comments>http://www.dbms2.com/2011/10/19/nested-data-structures/#comments</comments>
		<pubDate>Wed, 19 Oct 2011 17:29:59 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5506</guid>
		<description><![CDATA[As I&#8217;ve noted before, the very big web companies have an issue with nested data structures. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a [...]]]></description>
			<content:encoded><![CDATA[<p>As I&#8217;ve noted before, <a href="http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/">the very big web companies have an issue with nested data structures</a>. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a good tactic.</p>
<p>The explanation was led by Oliver Ratzesberger, late of eBay*and progenitor of <a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay&#8217;s Singularity project</a>. In simplest terms, <strong>one event can spawn a lot of event attribute information, perhaps in the form of name-value pairs,</strong> which it then makes sense to store together in some way. The example Oliver dwelled on was that, on any given web page, there can be 100+ pieces of information to record, including:</p>
<ul>
<li>All 50 search results you were shown, and their positions in the search rankings.</li>
<li>Every ad, image, or graphical element.</li>
<li>An ID as to which test you were participating in (every page you see on eBay has some element being tested).</li>
</ul>
<p><em>*Oliver is leaving eBay for a still-secret large company. I would conjecture that Michael McIntire is on the move too, either to replace Oliver or to go with him, but Oliver did a very good job of not commenting on the matter.</em></p>
<p>There are several reasons why one might wish to store this information in ways that grieve relational purists. First, reconstructing all this information via joins would be brutally expensive. What&#8217;s more, reconstructing all this information via joins could be impractical. Some comes from third party ad servers, which might not reproduce the same ads upon demand. Other is in the form of rankings, which can&#8217;t always be reliably reproduced from one query to the next. (That&#8217;s just one of several reasons <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">text search and relational DBMS are an awkward fit</a>.)</p>
<p>Also, there&#8217;s a strong <a href="http://www.dbms2.com/2011/07/31/dynamic-fixed-schema-databases/">dynamic schema</a> flavor to these databases. The list of attributes for one web click might be very different in kind from the list for the next page. Forcing that kind of variability into a fixed relational schema, while theoretically possible, doesn&#8217;t necessarily make a lot of sense.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/19/nested-data-structures/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Some notes on Hadoop (mainly) and appliances</title>
		<link>http://www.dbms2.com/2011/09/23/hadoop-appliances/</link>
		<comments>http://www.dbms2.com/2011/09/23/hadoop-appliances/#comments</comments>
		<pubDate>Fri, 23 Sep 2011 19:59:42 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[EMC]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapR]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5341</guid>
		<description><![CDATA[1. EMC Greenplum has evolved its appliance product line. As I read that, the latest announcement boils down to saying that you can neatly network together various Greenplum appliances in quarter-rack increments. If you take a quarter rack each of four different things, then Greenplum says &#8220;Hooray! Our appliance is all-in-one!&#8221; Big whoop. 2. That [...]]]></description>
			<content:encoded><![CDATA[<p>1. <a href="http://www.greenplum.com/products/greenplum-dca">EMC Greenplum has evolved its appliance product line</a>. As I read that, the latest announcement boils down to saying that you can neatly network together various Greenplum appliances in quarter-rack increments. If you take a quarter rack each of four different things, then Greenplum says &#8220;Hooray! Our appliance is all-in-one!&#8221; Big whoop.</p>
<p>2. That said, the Hadoop part of EMC &#8216;s story is based on MapR, which so far as I can tell is actually a pretty good Hadoop implementation. More precisely, MapR makes strong claims about performance and so on, and Apache Hadoop folks don&#8217;t reply &#8220;MapR is full of &amp;#$!&#8221; Rather, they say &#8220;We&#8217;re going to close the gap with MapR a lot faster than the MapR folks like to think &#8212; and by the way, guys, thanks for the butt-kick.&#8221; A lot more precision about MapR may be found in this <a href="http://www.slideshare.net/mcsrivas/design-scale-and-performance-of-maprs-distribution-for-hadoop">M. C. Srivas SlideShare</a>.</p>
<p>3. On its latest earnings call, Oracle clearly <a href="http://seekingalpha.com/article/294885-oracle-s-ceo-discusses-q1-2012-results-earnings-call-transcript?part=qanda">said it would introduce a Hadoop appliance</a>, versus just <a href="../../../../../2011/06/24/forthcoming-oracle-appliances/">hinting at a Hadoop appliance</a> the prior quarter. The money quote was:  <span id="more-5341"></span></p>
<blockquote><p>Finally, big data or the searching of large amounts of data using Hadoop. After Hadoop finishes filtering the data, the place you want to put that data is an Oracle Database, and that&#8217;s what a lot of our customers are doing. And we are exploiting the trend, the big data technology and the big data trend, if you prefer, by building a Hadoop appliance that attaches to the Oracle Exadata database or any Oracle Database for that matter. But you don&#8217;t have to buy our Hadoop appliance if you can use whatever servers you want running Hadoop, and we provide the interface between Hadoop and the Oracle Database.</p></blockquote>
<p>In other words, Oracle is saying &#8220;We&#8217;d like to sell you a Hadoop appliance, but you can run Hadoop in some other way and we&#8217;ll coexist with it just fine.&#8221; That makes sense; refusing to coexist with Hadoop is not exactly a realistic option.</p>
<p>4. Back in June, I expressed <a href="../../../../../2011/06/02/why-you-would-want-an-appliance-and-when-you-wouldnt/">great skepticism about the idea of a Hadoop appliance</a>. There was at least partial pushback in the comment thread from both Amr Awadallah and Eric Baldeschwieler. Oops.</p>
<p>Their reasoning seems to be centered around matters of installation, administration, and general packaging.</p>
<p>5. A month ago I noted aggressive near-term plans for <a href="../../../../../2011/08/21/hadoop-evolution/">Apache Hadoop evolution</a>. As noted above, one reason this is needed is competition from folks like MapR. Also, I note that:</p>
<ul>
<li>Three years ago, Oliver Ratzesberger&#8217;s group at eBay complained that <a href="../../../../../2008/10/15/ebay-doesnt-love-mapreduce/">CPU utilization running Hadoop was at 18%</a>.</li>
<li><a href="../../../../../2011/08/21/hadoop-evolution/#comment-241679">Now Oliver uses a figure of 10-15%.</a>, and attributes an even lower figure to &#8212; I&#8217;m guessing here &#8212; Yahoo. (Another possibility might be Facebook.)</li>
<li>In between eBay became one of the biggest and most prominent users of Hadoop.</li>
</ul>
<p>The moral of eBay&#8217;s Hadoop adventures, as I see it, is neither &#8220;Hadoop sucks!&#8221; nor &#8220;Hadoop doesn&#8217;t suck!&#8221;; rather, it&#8217;s that there&#8217;s a lot of scope for Hadoop to operate differently in the future than it does today.</p>
<p><em>Similarly, whatever throughput Yahoo does or doesn&#8217;t get, it clearly has adopted Hadoop at the expense of the <a href="../../../../../2008/05/29/yahoo-scales-web-analytics-database-petabyte/">columnar-in-Postgres</a> system it previously was so proud of.</em></p>
<p>Also, there has been a claim going around that &#8212; notwithstanding NameNode&#8217;s status as a single point of Hadoop failure &#8212;  no Hadoop installation has ever lost data due to a NameNode failure. The folks at MapR beg to differ, and sent over <a href="https://issues.apache.org/jira/browse/HDFS-1539">some</a> <a href="http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201107.mbox/%3CCAFUA3X2R_wH9GGGseUVSXVNVZQ+dBjZKDn0_pmDO8U31C05tMw@mail.gmail.com%3E">links</a> that sure seem to say the opposite.</p>
<p>6. Since we&#8217;ve just established that Hadoop will change, rapidly and pretty fundamentally, what exactly is the benefit of an appliance that is &#8220;balanced&#8221; for Hadoop usage today?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/23/hadoop-appliances/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Notes and links October 22, 2010</title>
		<link>http://www.dbms2.com/2010/10/22/notes-and-links-october-22-2010/</link>
		<comments>http://www.dbms2.com/2010/10/22/notes-and-links-october-22-2010/#comments</comments>
		<pubDate>Fri, 22 Oct 2010 06:47:05 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[VoltDB and H-Store]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3346</guid>
		<description><![CDATA[A number of recent posts have had good comments. This time, I won&#8217;t call them out individually. Evidently Mike Olson of Cloudera is still telling the machine-generated data story, exactly as he should be. The Information Arbitrage/IA Ventures folks said something similar, focusing specifically on &#8220;sensor data&#8221; &#8230; &#8230; and, even better, went on to [...]]]></description>
			<content:encoded><![CDATA[<p>A number of recent posts have had good comments. This time, I won&#8217;t call them out individually.</p>
<p>Evidently <a href="http://www.cscyphers.com/blog/2010/10/12/hadoop-world-2010/">Mike Olson of Cloudera is still telling the machine-generated data story</a>, exactly as he should be. The <a href="http://informationarbitrage.com/post/1359525958/big-ideas-around-big-problems-in-big-data">Information Arbitrage/IA Ventures</a> folks said something similar, focusing specifically on &#8220;sensor data&#8221; &#8230;</p>
<p>&#8230; and, even better, went on to say:  <span id="more-3346"></span></p>
<blockquote><p><strong>Privacy is dead</strong>.<br />
What do we consider to be the  boundaries of privacy, especially with respect to items like medical  data? In a data privacy-free world, should we be regulating data usage  instead? How do we deal with asymmetric access to our personal data,  e.g., how is it that insurance companies claim the right to our personal  information?</p></blockquote>
<p>Obviously, <a href="http://www.dbms2.com/2010/04/04/privacy-liberty-continued/">my answer to the second question is Yes!!!!</a></p>
<p>Also from Hadoop World &#8212; Dave Menninger, now an analyst, reports on <a href="http://www.ventanaresearch.com/blog/commentblog.aspx?id=4003">some Hadoop metrics</a>:</p>
<blockquote><p><span id="Contentblock1"><span>How big is “big data”?  In his opening remarks, Mike shared some statistics from a survey of  attendees. The average Hadoop cluster among respondents was 66 nodes and  114 terabytes of data. However there is quite a range. The largest in  the survey responses was a cluster of 1,300 nodes and more than 2  petabytes of data. (Presenters from eBay blew this away, describing  their production cluster of  8,500 nodes and 16 petabytes of storage.)  Over 60 percent of respondents had 10 terabytes or less, and half were  running 10 nodes or less.</span></span></p></blockquote>
<p><a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">That eBay comment was particularly interestin</a>g. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>A while back, Doug Henschen noted that Netezza flagship reference Catalina Marketing is now at <a href="http://intelligent-enterprise.informationweek.com/blog/archives/2010/07/big_data_the_ea.html#more">2.5 petabytes</a>. Most of that is in one 600 billion row table. Oddly, the article talks of the Netezza/SAS partnership accelerating model-building via in-database scoring (not modeling) technology. Doug also wrote of a lot of <a href="http://intelligent-enterprise.informationweek.com/blog/archives/2010/08/whats_at_stake.html#more">analytic DBMS replacements</a>, including:</p>
<ul>
<li>Microsoft by ParAccel</li>
<li>Oracle by Aster Data, IBM, Oracle Exadata, probably Netezza, and probably Hadoop</li>
<li>Netezza by Greenplum</li>
<li>IBM by Teradata</li>
</ul>
<p>Carl Olofson pointed out on Twitter that <a href="http://www.oracle.com/us/corporate/Acquisitions/datascaler/index.html">DataScaler was an in-memory database technology just bought by Oracle</a>. This inspired me to google on them, and I found a sparse <a href="http://www.svadventure.com/">DataScaler CEO blog</a>. I link it because of an amusing juxtaposition &#8212; the second-to-last post says, in effect, &#8220;We make appliances and we recommend all these awesome technology design partners who helped us design the hardware,&#8221; while the very last post says &#8220;Designing our own hardware was a mistake.&#8221; <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p><a href="http://www.dbms2.com/2010/07/23/some-interesting-links/">Fred Holahan</a> is now VP of Marketing at <a href="http://www.dbms2.com/2010/05/25/voltdb-finally-launches/">VoltDB</a>, which is a lesson to me about giving free consulting &#8230; Anyhow, Fred tells me that VoltDB has about a dozen users on their way to production, some of whom are headed to being VoltDB paying customers, some of whom are not.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/10/22/notes-and-links-october-22-2010/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>eBay followup &#8212; Greenplum out, Teradata &gt; 10 petabytes, Hadoop has some value, and more</title>
		<link>http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/</link>
		<comments>http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/#comments</comments>
		<pubDate>Wed, 06 Oct 2010 13:21:00 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3116</guid>
		<description><![CDATA[I chatted with Oliver Ratzesberger of eBay around a Stanford picnic table yesterday (the XLDB 4 conference is being held at Jacek Becla’s home base of SLAC, which used to stand for “Stanford Linear Accelerator Center”). Todd Walter of Teradata also sat in on the latter part of the conversation. Things I learned included:  eBay [...]]]></description>
			<content:encoded><![CDATA[<p>I chatted with Oliver Ratzesberger of eBay around a Stanford picnic table yesterday (the XLDB 4 conference is being held at Jacek Becla’s home base of SLAC, which used to stand for “Stanford Linear Accelerator Center”). Todd Walter of Teradata also sat in on the latter part of the conversation. Things I learned included:  <span id="more-3116"></span></p>
<ul>
<li><strong>eBay      has thrown out Greenplum. </strong><em>(Edit: As per the comments below, eBay wouldn&#8217;t endorse that wording itself.) </em><strong> </strong>eBay’s <a href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/">6 ½ petabyte Greenplum database</a> has turned into<strong> a &gt;10 petabyte Teradata database, which will grow 2      1/2x further in size soon.</strong>
<ul>
<li>Specifically,       Oliver told me there are 8 petabytes of spinning disk, with 80%       compression. So that’s 40 petabytes before you multiply by a reducing       factor to cover mirroring, temp space, and so on. My low end for       that factor would be 25-28%; my high end would be 35-40%; either way,       we’re talking about &gt;10 petabytes of true user data.</li>
<li>The       8 petabytes of spinning disk are headed to 20 petabytes next year.</li>
<li>Oliver       gave the impression that Greenplum got thrown out more for reliability       reasons than performance. (While eBay saw a major performance difference between Teradata and Greenplum, Oliver previously indicated he was inclined to attribute this more to <a href="http://www.dbms2.com/2009/04/28/data-warehouse-storage-options-cheap-expensive-or-solid-state-disk-drives/">specific Sun Thumper hardware/storage choices</a> than to software.)</li>
</ul>
</li>
<li>That      database, called “Singularity,” has some interesting aspects – notably, <strong>a      character field that’s a string of name-value pairs</strong> – on which you can      do views and so on for virtual tables &#8212; in a table that otherwise has      dozens of conventional relational columns.
<ul>
<li>The       system ingests log data in the form of lots and lots of name-value pairs.</li>
<li>The       most commonly found ones go into columns in the usual way.</li>
<li>The       rest are strung together into, well, a character string.</li>
<li>Teradata       has developed some features for eBay that make it easier to index, query,       etc. on that character string of name-value pairs.</li>
</ul>
</li>
<li>eBay’s      more EDW-like (Enterprise Data Warehouse) multi-petabyte Teradata database continues to grow, with the      main system apparently up to <strong>4 ½ petabytes</strong> from the previous 2 ½.</li>
<li>I took      the opportunity to ask <a href="http://www.dbms2.com/2009/06/08/the-future-of-data-marts/">what kinds of data marts (virtual or otherwise)      were spun out</a> in practice.
<ul>
<li>In       Oliver’s ranking,
<ul>
<li>#1        was derived data based on other data already in the data warehouse.</li>
<li>#2        was other data within eBay that had never been put into the data        warehouse in the first place.</li>
<li>#3 was        data truly from outside data.</li>
</ul>
</li>
<li>Todd       Walter chimed in to point out that at other Teradata customers who       perhaps didn’t have as fully fleshed out an EDW, #1 and #2 could be       reversed.</li>
</ul>
</li>
<li>eBay sees Hadoop as an interesting tool for certain special purposes.
<ul>
<li>eBay likes Hadoop for certain tasks such as image analysis.<em> (Edit: And <a href="http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/">analysis of search results</a>.)</em></li>
<li>eBay doesn&#8217;t like Hadoop for anything that requires data movement, such as a join.</li>
<li>Similarly, eBay doesn&#8217;t like HBase.</li>
</ul>
</li>
<li>eBay      is enamored of the idea to do <strong>“social networking around analytics.”</strong>
<ul>
<li>This       is something that has been built but not rolled out yet.</li>
<li>It       seems more focused on actual business intelligence than on the underlying       data, unlike <a href="http://www.dbms2.com/2010/04/12/greenplumchorus/">Greenplum Chorus</a>, which seems more focused on the       databases themselves.</li>
<li>Since       it hasn’t been rolled out yet, we don’t know which (if any) of activity       streams, forums, or whatever will actually get significant adoption.</li>
</ul>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/feed/</wfw:commentRss>
		<slash:comments>28</slash:comments>
		</item>
		<item>
		<title>Nested data structures keep coming up, especially for log files</title>
		<link>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/</link>
		<comments>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/#comments</comments>
		<pubDate>Sat, 31 Jul 2010 10:42:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2723</guid>
		<description><![CDATA[Nested data structures have come up several times now, almost always in the context of log files. Google has published about a project called Dremel. Per Tasso Agyros, one of Dremel&#8217;s key concepts is nested data structures. Those arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data [...]]]></description>
			<content:encoded><![CDATA[<p>Nested data structures have come up several times now, almost always in the context of log files.</p>
<ul>
<li>Google has published about a project called <a href="http://www.asterdata.com/blog/index.php/2010/07/19/google%E2%80%99s-dremel-%E2%80%93-or-can-mapreduce-itself-handle-fast-interactive-querying/">Dremel</a>. Per Tasso Agyros, one of Dremel&#8217;s key concepts is nested data structures.</li>
<li>Those <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/">arrays</a> that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of course log-oriented. <a href="http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/">eBay was very interested in that project too</a>.</li>
<li>Facebook&#8217;s log files have a big nested data structure flavor.</li>
</ul>
<p>I don&#8217;t have a grasp yet on what exactly is happening here, but it&#8217;s something.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Cloudera Enterprise and Hadoop evolution</title>
		<link>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/</link>
		<comments>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 17:22:27 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2440</guid>
		<description><![CDATA[I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I&#8217;d say:  If you are or want to be a serious MapReduce user – and you&#8217;re past the “play around over the weekend” stage &#8212; you probably should have either: A serious non-DBMS MapReduce distribution. MapReduce integrated [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I&#8217;d say:  <span id="more-2440"></span></p>
<ul>
<li>If you are or want to be a serious 	MapReduce user – and you&#8217;re past the “play around over the 	weekend” stage &#8212; you probably should have either:
<ul>
<li>A serious non-DBMS MapReduce 	distribution.</li>
<li>MapReduce integrated into your 	analytic DBMS.</li>
<li>Both.</li>
</ul>
</li>
<li>The obvious choice for non-DBMS 	MapReduce is Hadoop.</li>
<li>The obvious choice for a Hadoop 	distribution is <strong>Cloudera Enterprise.</strong></li>
<li>Cloudera Enterprise has three main 	aspects, in an inseparable bundle:
<ul>
<li>Distributions for a double-digit 	number of open source projects. It&#8217;s nice having all that in one 	package – unless, of course, you like playing with Tinkertoys.</li>
<li>Proprietary Cloudera code.</li>
<li>Cloudera support.</li>
</ul>
</li>
<li>Cloudera says its proprietary code 	is and in the future is planned to be concentrated – at least in 	large part &#8212; on integrating open source technology with closed 	source products. This has the virtue of being targeted directly at 	that segment of the market which has proven it&#8217;s actually willing to 	pay money for software.</li>
<li>Cloudera Enterprise areas of 	focus, now and in the presumed future, include:
<ul>
<li><strong>Core Hadoop engine,</strong> which 	Cloudera says is quite predictably and appropriately evolving more 	slowly than the tools around it.</li>
</ul>
<ul>
<li><strong>Development, management and 	administrative tools,</strong> including:
<ul>
<li><strong>Pig</strong> and <strong>Hive</strong>. Cloudera says &gt;70% 	of Facebook Hadoop jobs are initiated through Hive, and the same is 	true of Yahoo and Pig.</li>
<li>Connectivity to commercial tools.</li>
<li>The product formerly known as 	“Cloudera Desktop.”</li>
</ul>
</li>
<li><strong>Workflow</strong>, which in this context 	refers to letting you create a Hadoop application as a sequence of 	small steps, rather than forcing you to kluge it into being one 	unwieldy thing. At the moment, this is much less widely adopted than 	Pig and Hive, but Cloudera has high hopes for it, because of its 	obvious benefits in modularity and manageability.</li>
<li><strong>Quasi-DBMS technology.</strong> Besides Hive and Pig, this includes <strong>HBase.</strong> Cloudera says there has 	been considerable demand for HBase, and it is pleased that the project 	is now mature enough to ship. Cloudera stresses that it intends 	HBase not for OLTP, but as an adjunct to analytic processing. E.g., 	Cloudera suggests HBase would be a fine vehicle for replicating 	dimension tables across each node of a cluster.</li>
<li><strong>Data connectivity, </strong><span style="font-weight: normal;">e.g. 	to MySQL or to sensor log files.</span></li>
</ul>
</li>
<li>Cloudera Enterprise pricing is 	well below DBMS prices – not by a full order of magnitude, if I&#8217;m 	right about everybody&#8217;s quantity discount policies, but even so by a 	lot. Details are NDA.</li>
</ul>
<p style="margin-bottom: 0in;">Cloudera sometimes sends confusing signals about its beliefs and strategies. For example, one can get different stories depending on whether one talks to:</p>
<ul>
<li>Somebody at Cloudera who comes 	primarily from the user and open source communities.</li>
<li>Somebody at Cloudera who has 	actually worked at a software company before.</li>
</ul>
<p style="margin-bottom: 0in;">But I predict that Cloudera will now stick for a while with more or less the strategy outlined above.</p>
<p style="margin-bottom: 0in;">Naturally, we also talked about Hadoop adoption. Highlights of that part – no doubt somewhat biased towards Cloudera&#8217;s own customer base &#8212; included:</p>
<ul>
<li>Notwithstanding <a href="http://www.dbms2.com/2009/04/14/ebay-thinks-mpp-dbms-clobber-mapreduce/">eBay&#8217;s prior 	skepticism about MapReduce</a>, it is quoted saying nice things in a Cloudera press release, 	and has apparently become quite a large Hadoop user, starting out 	with a search-quality use case.</li>
<li>Typical Hadoop deployment sizes 	are 10 nodes or so when experimenting, 80-500+ in production.</li>
<li>10 terabytes/node – I&#8217;m pretty 	sure Cloudera meant of user data &#8212; is not inconceivable, so a 	cost-conscious 500-node user could have 5 petabytes of data managed 	by Hadoop.</li>
<li>Cloudera has half a dozen 	customers at the 75+ node production level.</li>
<li>Web and financial services are the 	two vertical markets moving most aggressively into Hadoop 	production. The government is also in significant Hadoop production, 	but the details of that are classified.</li>
<li>Web uses for Hadoop include:
<ul>
<li>Clickstream – sessionization, 	etc. – that&#8217;s a super-mainstream use.</li>
<li>Search – analyzing search 	attempts in conjunction with structured data.</li>
<li>Machine learning (for ad serving, 	etc.).</li>
</ul>
</li>
<li>Financial services uses for Hadoop 	include:
<ul>
<li>Internal trading rule 	enforcement/fraud detection.</li>
<li>Complex ETL.</li>
<li>Portfolio risk assessment 	(typically overnight).</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">None of this is inconsistent with previous surveys of <a href="http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/">Hadoop use cases</a>.</p>
<p style="margin-bottom: 0in; font-style: normal;">Various users talked at the Hadoop Summit this week. I wasn&#8217;t there, and won&#8217;t write about their stories for now. That said, <a href="http://www.slideshare.net/kevinweil/hadoop-at-twitter-hadoop-summit-2010">Twitter&#8217;s slide deck</a> from same has some interesting stuff, including:</p>
<ul>
<li><span style="font-style: normal;">7 	TB/day ETLed from MySQL.</span></li>
<li><span style="font-style: normal;">Petabytes-being-stored 	accordingly coming soon.</span></li>
<li><span style="font-style: normal;">Open 	sourcing their ETL tool Crane.</span></li>
<li><span style="font-style: normal;">3-4X 	LZO compression at little CPU cost.</span></li>
<li><span style="font-style: normal;">HBase 	is a more usable for them than HDFS, which isn&#8217;t mutable enough.</span></li>
<li><span style="font-style: normal;">Pig 	= 5% of code and coding effort vs. vanilla Hadoop at 30% or less 	performance hit.</span></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Notes on SciDB and scientific data management</title>
		<link>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/</link>
		<comments>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/#comments</comments>
		<pubDate>Sat, 22 May 2010 08:04:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[SciDB]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2178</guid>
		<description><![CDATA[I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That&#8217;s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about issues in scientific data management. Here&#8217;s some of what has transpired since then. The [...]]]></description>
			<content:encoded><![CDATA[<p>I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That&#8217;s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/">issues in scientific data management</a>. Here&#8217;s some of what has transpired since then.</p>
<p>The main new activity I know of has been in the open source <a href="http://www.scidb.org/">SciDB</a> project.   <span id="more-2178"></span></p>
<ul>
<li>A company called Zetics has been started to commercialize SciDB. As of now, the entire staff seems to be CEO Marilyn Matz, techie Paul Brown, and part of Mike Stonebraker. Marilyn says Zetics has some venture capital, but even under NDA didn&#8217;t tell me who it was from. Zetics does not have its own web site.</li>
<li>Marilyn tells me there are 20-25 contributors to SciDB, led by Paul Brown and Mike Stonebraker. Brown is full-time. Persistent Systems has been donating the efforts of a few of its employees. Some <a href="http://www.lsst.org/lsst">LSST</a> folks have been doing SciDB work backed by grant money. Most or all of the rest seem to be purer volunteers. Some Russians have been particularly active.</li>
<li>Release 0.5 of SciDB is expected in June. Release 1.0 is expected in September. This is a rewrite; prior demo code has been scrapped. Perhaps not coincidentally, it&#8217;s also a small slip from prior project plans.</li>
<li>The array data model is an example of what&#8217;s being implemented first. (Duh &#8212; you can&#8217;t have a DBMS without a data model.) Support for uncertainty is an example of what&#8217;s been deferred until later.</li>
<li>As has been clear since XLDB3 last August, one major target market for SciDB is genomic research.</li>
<li>It&#8217;s obvious that the oil and gas industry, with all its geospatial data, should be interested in SciDB. But there&#8217;s not much activity in that regard; outreach is evidently needed. If you can think of somebody in that sector (or anywhere else) who should be alerted to SciDB, please ping them.</li>
<li>Interest from web analytics users in SciDB seems to have receded a bit from the days when eBay almost funded the project.</li>
</ul>
<p>In other scientific data management news,</p>
<ul>
<li>Microsoft put out a book called <a href="http://research.microsoft.com/en-us/collaboration/fourthparadigm/">The Fourth Paradigm</a> on scientific database management. The whole thing can be downloaded, very officially, as a giant PDF. I think it&#8217;s worth skimming. I don&#8217;t think it&#8217;s worth actually reading. (I did read it.)</li>
<li><a href="http://www-conf.slac.stanford.edu/xldb/">XLDB4</a> will be at Stanford October 5-7. Unlike prior XLDBs, it will have an open (i.e., no invitation required) part.</li>
</ul>
<p>Finally, you surely are aware of the whole &#8220;Climategate&#8221; mess, in which major climate researchers&#8217; email was hacked and many unkind conclusions were drawn. Well, one of the most technical parts of the disclosure was in a long series of Read Me files, in which an unfortunate programmer lamented about <a href="http://di2.nu/foia/HARRY_READ_ME-20.html">the difficulty of reconstructing published results from files at hand</a>. These turned out to illustrate a classic problem that SciDB or alternatives are meant to solve:</p>
<ul>
<li>Raw data was impossible to use without various adjustments to regularize it (the word &#8220;regridding&#8221; comes up a lot, for example). Massaging was needed before analytics could be done on it.</li>
<li>The raw data was thrown out or lost, and could not be reconstructed (why they couldn&#8217;t have asked the suppliers of the data to give it to them again was unclear in this case, since it wasn&#8217;t original experimental data).</li>
<li>It was thus impossible to massage the data in any new or improved way.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Teradata&#8217;s nebulous cloud strategy</title>
		<link>http://www.dbms2.com/2009/10/27/teradatas-nebulous-cloud-strategy/</link>
		<comments>http://www.dbms2.com/2009/10/27/teradatas-nebulous-cloud-strategy/#comments</comments>
		<pubDate>Tue, 27 Oct 2009 19:41:47 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1180</guid>
		<description><![CDATA[As the pun goes, Teradata&#8217;s cloud strategy is – well, it&#8217;s somewhat nebulous. More precisely, for the foreseeable future, Teradata&#8217;s cloud strategy is a collection of rather disjointed parts, including: What Teradata calls the Teradata Agile Analytics Cloud, which is a combination of previously existing technology plus one new portlet called the Teradata Elastic Mart(s) [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">As the pun goes, Teradata&#8217;s cloud strategy is – well, it&#8217;s somewhat nebulous. More precisely, for the foreseeable future, Teradata&#8217;s cloud strategy is a collection of rather disjointed parts, including:</p>
<ul>
<li>What Teradata calls the <em>Teradata 	 Agile Analytics Cloud, </em>which is a combination of previously 	existing technology plus one new portlet called the <em>Teradata 	Elastic Mart(s) Builder.</em> (Teradata&#8217;s <em>Elastic Mart(s) Builder 	Viewpoint</em><span style="font-style: normal;"> portlet is avail</span>able 	for <span style="font-style: normal;">download from <a href="../2009/05/26/teradata-developer-exchange-devx-begins-to-emerge/">Teradata&#8217;s 	Developer Exchange</a>.)</span></li>
<li><em>Teradata Data Mover 2.0,</em> coming “Soon”, which will ease copying (ETL without any 	significant “T”) from one Teradata system to another.</li>
<li><em>Teradata Express</em> DBMS 	crippleware (1 terabyte only, no production use), now available on 	Amazon EC2 and VMware. (I don&#8217;t see where this has much connection to the rest of Teradata&#8217;s cloud strategy, except insofar as it serves to fill out a slide.)</li>
<li>Unannounced (and so far as I can 	tell largely undesigned) future products.</li>
</ul>
<p style="margin-bottom: 0in;">Teradata openly admits that its direction is heavily influenced by Oliver Ratzesberger at <a href="../2009/04/30/ebays-two-enormous-data-warehouses/">eBay</a>. Like Teradata, Oliver and eBay favor virtual data marts over physical ones. That is, Oliver and eBay believe that the ideal scenario is that every piece of data is only stored once, in an integrated Teradata warehouse. But eBay believes and Teradata increasingly agrees that users need a great deal of control over their use of this data, including the ability to import additional data into private sandboxes, and join it to the warehouse data already there.<span id="more-1180"></span></p>
<p style="margin-bottom: 0in;">The <em>Teradata Elastic Mart(s) Builder Viewpoint</em> portlet automates the inclusion of outside data. If you&#8217;re already an authorized Teradata data warehouse user, you can fill in a very short form (three or so fields) and add authorization to import outside data, e.g. from a .CSV file. No fuss, little bother. Trivial as that sounds, when you combine it with Teradata&#8217;s pre-existing robust workload management tools, it creates a pretty good <em>virtual data mart</em> story.</p>
<p style="margin-bottom: 0in;">Spinning out and maintaining consistency with physical data marts is a different matter. Teradata doesn&#8217;t seem too sure it believes in those. And while Teradata is obviously planning to increase its capability in that regard anyway, I didn&#8217;t get a lot of detail beyond the reference to Data Mover 2.0.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>My Greenplum-inspired post on <a href="../2009/06/08/the-future-of-data-marts/">the 	future of data marts</a>, outlining issues in “private cloud” 	data warehousing.</li>
<li>eBay&#8217;s “<a href="http://www.xlmpp.com/articles/16-articles/39-analytics-as-a-service">Analytics 	as a Service</a>” pitch (about 1 ½ years old)</li>
<li><a href="http://developer.teradata.com/database/articles/what-is-the-teradata-agile-analytics-cloud">A 	post by Teradata&#8217;s Dan Graham</a> explaining the <em>Teradata Agile 	Analytics Cloud</em><span style="font-style: normal;"> and </span><em>Elastic 	Mart(s) Builder Viewpoint</em> portlet</li>
<li>Home page and complete screen shot 	for the <a href="http://developer.teradata.com/download/viewpoint/elastic-marts-builder"><em>Teradata 	Elastic Mart(s) Builder Viewpoint</em> portlet</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/27/teradatas-nebulous-cloud-strategy/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Introduction to the XLDB and SciDB projects</title>
		<link>http://www.dbms2.com/2009/09/12/xldb-scid/</link>
		<comments>http://www.dbms2.com/2009/09/12/xldb-scid/#comments</comments>
		<pubDate>Sat, 12 Sep 2009 19:54:51 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Michael Stonebraker]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=883</guid>
		<description><![CDATA[Before I write anything else about the overlapping efforts known as XLDB and SciDB, I probably should explain and disambiguate what they are as best I can. XLDB was organized and still is run by guys who want to solve a scientific problem in eXtremely Large DataBase Management, most especially Jacek Becla of SLAC (the [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Before I write anything else about the overlapping efforts known as <em>XLDB</em> and <em>SciDB</em>, I probably should explain and disambiguate what they are as best I can.  XLDB was organized and still is run by guys who want to solve a scientific problem in eXtremely Large DataBase Management, most especially Jacek Becla of SLAC (the organization previously known as Stanford Linear Accelerator Center). Becla&#8217;s original motivation was that he needs a DBMS to manage what will be 55 petabytes of raw image data and 100 petabytes of astronomical data total for <a href="http://www.lsst.org/lsst">LSST</a> (Large Synoptic Survey Telescope).<span id="more-883"></span></p>
<p style="margin-bottom: 0in;">XLDB more or less comprises:</p>
<ul>
<li>A series of what have now been 	three workshops: <span style="font-style: normal;"><a href="http://www-conf.slac.stanford.edu/xldb07/">XLDB1 	in 2007</a>, <a href="http://www-conf.slac.stanford.edu/xldb08/">XLDB2 	in 2008</a>, and <a href="http://www-conf.slac.stanford.edu/xldb09/default.htm">XLDB3 	in 2009</a></span> (the closest thing to a master link is probably 	the <a href="http://www-conf.slac.stanford.edu/xldb09/links.htm">XLDB3 	site&#8217;s related link page)</a>. Participants have included, among 	others:
<ul>
<li>A lot of big-name 	database-oriented computer science researchers &#8212; Mike Stonebraker, 	Dave DeWitt, Martin Kersten, and numerous others</li>
<li>Academics responsible for 	scientific database management, especially but not only in the 	astronomy area</li>
<li>Some vendors (although vendor 	participation was cut back after XLDB1) &#8212; at XLDB3, which is the 	one I went to, the three vendor folks who actually talked were 	Stephen Brobst of Teradata, Luke Lonergan of Greenplum (who worked 	in scientific high performance computing earlier in his career), and 	Jeff Hammerbacher of Cloudera.</li>
<li>eBay and to some extent other 	large web companies</li>
<li>A European Union funding 	bureaucrat</li>
<li>Me</li>
</ul>
</li>
<li>An attempt to kick start a broader 	movement, perhaps comprising (it&#8217;s not totally clear yet):
<ul>
<li>Computer science researchers 	interested in database issues</li>
<li>Database technology vendors</li>
<li>Scientific researchers (academic) 	who have very large or otherwise difficult database management 	problems</li>
<li>Scientific researchers 	(commercial) who have very large or otherwise difficult database 	management problems</li>
<li>Other commercial users who have 	very large database management problems</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">The first result or spin-out from the XLDB effort seems to have been the <a href="http://www.scidb.org/">SciDB</a> project. This is an effort to build an open source DBMS called SciDB that will address <strong>some</strong> of the needs the XLDB effort is uncovering. (More on that in other posts.) Somewhat confusingly, <strong>all</strong> the use cases the XLDB group is collecting are currently being posted on SciDB&#8217;s website, apparently because it&#8217;s glitzier and healthier than, say, the excessively sparse XLDB wiki. Some SciDB development has happened, but no large sugar daddy has yet been found. (It&#8217;s a fairly open secret that eBay looked seriously and favorably at funding SciDB before the economic downturn.) hit.</p>
<p style="margin-bottom: 0in;">Numerous big-name computer scientists are associated with SciDB, indeed more closely (it would seem) than with XLDB. That said, I&#8217;m guessing Dave DeWitt&#8217;s involvement in the open-source SciDB isn&#8217;t what it would be if he hadn&#8217;t gone to Microsoft. DeWitt actually skipped XLDB3, although he was in town for VLDB. (XLDB3 was back-to-back with VLDB 2009 in Lyon, France in late August.) Stonebraker just didn&#8217;t make the flight for either conference, due to the double-knee &#8220;upgrade&#8221; he had back in March.</p>
<p style="margin-bottom: 0in;">There&#8217;s a lot more to be said about the cross-discipline or science-specific requirements that researchers place on data management, but I&#8217;ll leave that for later and just get this posted as a start &#8212; assuming, of course, that <a href="http://www.dbms2.com/2009/09/12/availability-nightmares-continue/">blog outages</a> permit. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> </p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li><a href="http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_26.pdf">Paper 	laying out the SciDB project</a></li>
<li><a href="http://database.cs.brown.edu/projects/scidb/">One 	version of a SciDB overview page</a>, with links to academic papers</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/09/12/xldb-scid/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

