<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Scientific research</title>
	<atom:link href="http://www.dbms2.com/category/applications/scientific-research/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>MarkLogic 5, and why you might care</title>
		<link>http://www.dbms2.com/2011/11/01/marklogic-version-5/</link>
		<comments>http://www.dbms2.com/2011/11/01/marklogic-version-5/#comments</comments>
		<pubDate>Tue, 01 Nov 2011 04:03:59 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5560</guid>
		<description><![CDATA[MarkLogic is releasing MarkLogic 5. Key elements of the announcement are: More-of-the-same in line with MarkLogic’s core positioning. A new bi-directional Hadoop connector. A free MarkLogic Express edition, limited in license terms more than in actual features, as per Slide 27 of the deck MarkLogic graciously supplied for me to post. Also, MarkLogic is early [...]]]></description>
			<content:encoded><![CDATA[<p>MarkLogic is releasing MarkLogic 5. Key elements of the announcement are:</p>
<ul>
<li>More-of-the-same      in line with MarkLogic’s core positioning.</li>
<li>A new      bi-directional Hadoop connector.</li>
<li>A free      MarkLogic Express edition, limited in license terms more than in actual      features, as per Slide 27 of <a href="http://www.monash.com/uploads/MarkLogic-5-Deck.pptx">the deck      MarkLogic graciously supplied for me to post</a>.</li>
</ul>
<p>Also, MarkLogic is early with a feature that most serious DBMS vendors will  soon have – support for tiered storage, with writes going first to  solid-state storage, then being flushed to disk via a caching-style  algorithm.* And as befits a sometime search-engine-substitute, MarkLogic has finally licensed a large set of document filters, from an Australian company called <a href="http://www.isys-search.com/index.html">Isys</a>. Apparently, the special virtue of the Isys filters is that they’re good at extracting not only text, but metadata as well.</p>
<p><em>*If there’s a caching algorithm that doesn’t contain a major element of LRU (Least Recently Used), I don’t recall ever hearing about it.</em></p>
<p>MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:</p>
<ul>
<li>MarkLogic      is a serious, enterprise-class DBMS (see for example Slide 12 of <a href="http://www.monash.com/uploads/MarkLogic-5-Deck.pptx">the MarkLogic      deck</a>) …</li>
<li>…      which has been optimized from the getgo for <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured      data</a>.</li>
<li>MarkLogic      can and does scale out to handle large amounts of data.</li>
<li>MarkLogic      is a general-purpose DBMS, suitable for <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">both      short-request and analytic tasks</a>.</li>
<li>MarkLogic      is particularly well suited for analyses with long chains of “progressive      enhancement” (MarkLogic’s favorite term when talking about <a href="../../../../../2011/05/30/another-category-of-derived-data/">derived      data</a>).</li>
<li><a href="http://blogs.avalonconsult.com/blog/search/is-marklogic-a-search-engine/">MarkLogic      often plays the role of a content assembler and/or search engine</a>, and      the people who use MarkLogic in those ways are commonly doing things that can      be described as research and analysis.</li>
</ul>
<p>Based on that reality, MarkLogic talks a lot about Volume, Velocity, Variety, Big Data, unstructured data, semi-structured data, and big data analytics.</p>
<p><span id="more-5560"></span><em>My <a href="../../../../../2010/11/29/marklogic-and-its-document-dbms/">November, 2010 overview of MarkLogic technology</a> remains pretty relevant. One correction, however: Node heterogeneity configurations, in which “data” and “evaluation” nodes reside on separate servers, are the exception rather than the rule.</em></p>
<p>Like <a href="../../../../../2011/10/18/vertica-community-edition/">Vertica</a>, MarkLogic has laudably said that true academic researchers can get MarkLogic for free without the severe license restrictions. Free MarkLogic should be of particular interest to researchers who:</p>
<ul>
<li>Are      studying natural networks or graphs, such as social networks or biological      pathways. (This might be a fit in the social or biological sciences.)</li>
<li>Are      managing metadata for, say, a variety of disparate kinds of experimental      files. (This might be a fit anywhere in the natural sciences.)</li>
<li>Are      managing actual documents, images, videos, etc., or data about such      things. (This might be a fit in the humanities or social sciences.)</li>
</ul>
<p>MarkLogic provided some disclosable financial substance by email, which I shall quote verbatim:</p>
<ul>
<li><em>MarkLogic      has 45% revenue growth and 55-60% license growth year over year.</em></li>
<li><em>We      expect to finish this year with over $85 million in revenue, up from $55      million last year.</em></li>
</ul>
<p>Arithmetical purists might note that 85/55 is more than 145%, but I’m just going to settle for the information I got and move on.</p>
<p><em>Edit: I posted separately about the <a href="http://www.dbms2.com/2011/11/03/marklogic-hadoop-connector/">MarkLogic Hadoop connector.</a></em> <span style="text-decoration: line-through;">As for that Hadoop connector – stay tuned for a short follow-up post, as writing about it now would not be convenient. (My backup discipline isn’t what it should be, and the only copy of my notes about that product is on a heavy tower computer in a house that doesn’t have working power.)</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/01/marklogic-version-5/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Commercial software for academic use</title>
		<link>http://www.dbms2.com/2011/10/14/commercial-software-for-academic-use/</link>
		<comments>http://www.dbms2.com/2011/10/14/commercial-software-for-academic-use/#comments</comments>
		<pubDate>Fri, 14 Oct 2011 06:21:21 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Scientific research]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5483</guid>
		<description><![CDATA[As Jacek Becla explained: Academic scientists like their software to be open source, for reasons that include both free-like-speech and free-like-beer. What&#8217;s more, they like their software to be dead-simple to administer and use, since they often lack the dedicated human resources for anything else. Even so, I think that academic researchers, in the natural [...]]]></description>
			<content:encoded><![CDATA[<p>As <a href="../../../../../2009/10/04/jacek-becla-on-issues-in-scientific-data-management/">Jacek Becla</a> explained:</p>
<ul>
<li>Academic scientists like their software to be open source, for reasons that include both free-like-speech and free-like-beer.</li>
<li>What&#8217;s more, they like their software to be dead-simple to administer and use, since they often lack the dedicated human resources for anything else.</li>
</ul>
<p>Even so, I think that <strong>academic researchers,</strong> in the natural and social sciences alike, <strong>commonly overlook the wealth of commercial software</strong> that could help them in their efforts.</p>
<p>I further think that <strong>the commercial software industry could do a better job of exposing its work to academics,</strong> where by &#8220;expose&#8221; I mean:</p>
<ul>
<li>Give your stuff to academics for <strong>free.</strong></li>
<li>Call their attention to your free offering.</li>
</ul>
<p>Reasons to do so include:</p>
<ul>
<li><strong>Public benefit.</strong> Scientific research is important.</li>
<li><strong>Training future customers.</strong> There&#8217;s huge academic/commercial crossover, especially as students join the for-profit workforce.</li>
</ul>
<p><span id="more-5483"></span>The biggest issue is probably <strong>large-scale database management.</strong> There&#8217;s a feeling, permeating for example parts of the <a href="../../../../../2011/09/20/xldb-the-one-conference-i-like-to-go-to/">XLDB conference</a> and the associated SciDB project, that data stores suitable for holding large amounts of data are either:</p>
<ul>
<li>Hadoop or</li>
<li>Forbiddingly expensive.</li>
</ul>
<p>I think that&#8217;s overstated. In particular:</p>
<ul>
<li>You can put &gt;10 terabytes of machine-generated data (or any other kind) into Infobright and have it well taken care of; Infobright is open source.</li>
<li>You can put &gt;1 petabyte into [name redacted],* among others; [name redacted]* should be out soon with a generously free offering for academic users. <em>Edit: That would be <a href="http://www.dbms2.com/2011/10/18/vertica-community-edition/">Vertica</a>.</em></li>
<li>Conventional relational queries, graph analysis, statistical analysis preparation and more can all be much faster in a good analytic DBMS than in alternative kinds of data stores.</li>
<li>Integration between SQL and other analytic languages is ever improving, as analytic DBMS evolve into &#8220;<a href="../../../../../2011/02/24/analytic-platforms/">analytic platforms</a>&#8220;.</li>
</ul>
<p><em>*My permission to use the name was yanked after this post was largely drafted. I&#8217;m sufficiently pleased with the forthcoming offering itself that I can&#8217;t get upset about the procedural confusion.</em></p>
<p>With a couple of exceptions, the <strong>statistics/predictive analytics</strong> situation seems more reasonable. Industry leaders such as SAS Institute and SPSS (now an IBM company) have engaged in varying degrees of academic outreach. R is in the process of crossing over from academia to business.</p>
<p><em>Unfortunately, I know next to nothing about Stata or, elsewhere in the technical languages area, Mathworks/Matlab. (Who knew that Mathworks was a <a href="http://www.mathworks.com/company/aboutus/">$600 million company</a>, local to my geographical area?)</em></p>
<p>One statistical tool that should perhaps be more present in academia is KXEN. KXEN seems to have some nice differentiation in not making you understand in advance which of your variables are most important. Econometricians and others with large numbers of independent variables might wish to take note.</p>
<p><em>If you think the true situation is nonlinear, and you&#8217;re trying to approximate it with linear models, you almost always have a large number of variables to consider. True, monomials in independent variables aren&#8217;t actually independent, but it might be interesting to pretend that they are and see if any insights fall out that could help in more rigorous analysis.</em></p>
<p>I&#8217;d further argue that, as part of neglecting commercial analytic DBMS, the scientific community in particular neglects the potential of <strong>integrated analytic platforms. </strong>Admittedly, the early leaders in that area &#8212; Aster Data, perhaps followed by Netezza (now an IBM company) &#8212; aren&#8217;t exactly priced in an academic-friendly way. But Vertica, EMC Greenplum, et al. are playing catch-up with analogous technology, and they&#8217;re more likely to offer appealing academic pricing.</p>
<p>There&#8217;s also the <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> side of business intelligence, especially in the area of visualization/discovery. While Spotfire (now a TIBCO company) got much of its start in research-oriented areas, the otherwise more visible &#8212; no pun intended &#8212; QlikTech and Tableau don&#8217;t seem to have done much in academia. Datameer and yet-younger Hadoop-oriented business intelligence startups don&#8217;t seem to be doing much on the academic front either, more&#8217;s the pity.</p>
<p>Frankly, <strong>I think that most scientific analytic technology needs are also found in the business world.*</strong> That convergence will only get closer as businesses focus more on <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>. Commercial software companies should pay more attention to scientists, and scientists should gaze out more often from their ramshackle, budget-constrained ivory towers.</p>
<p><em>*The converse isn&#8217;t as true. Businesses have issues not well reflected in science, derived (for example) from the complexity of their transactional schemas, or from office-politics considerations around &#8220;one version of the truth&#8221;.</em></p>
<p><strong><em>Edit: Some links that seem relevant to this year&#8217;s XLDB program</em></strong></p>
<ul>
<li><a href="http://www.dbms2.com/2011/09/05/zynga-linkedin-data-warehous/">Zynga and LinkedIn</a></li>
<li><a href="http://www.dbms2.com/2010/06/19/objectivity-infinite-graph/">Objectivity Infinite Graph</a></li>
<li><a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay as of last year&#8217;s XLDB</a> (the most expensive blog post I ever wrote, in light of Greenplum&#8217;s subsequent response)</li>
</ul>
<p><em><br />
</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/14/commercial-software-for-academic-use/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>IBM is buying parallelization expert Platform Computing</title>
		<link>http://www.dbms2.com/2011/10/11/ibm-is-buying-parallelization-expert-platform-computing/</link>
		<comments>http://www.dbms2.com/2011/10/11/ibm-is-buying-parallelization-expert-platform-computing/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 16:13:05 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Scientific research]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5473</guid>
		<description><![CDATA[IBM is acquiring Platform Computing, a company with which I had one briefing, last August. Quick background includes:  Platform Computing started ~20 years ago. Platform Computing claimed close to $100 million in revenue and &#62;500 people. (This is Platform Computing&#8217;s most famous splash to date.) Platform Computing technology underlies SAS Institute&#8217;s preferred method of parallelization, [...]]]></description>
			<content:encoded><![CDATA[<p>IBM is acquiring Platform Computing, a company with which I had one briefing, last August. Quick background includes:  <span id="more-5473"></span></p>
<ul>
<li>Platform Computing started ~20 years ago.</li>
<li>Platform Computing claimed close to $100 million in revenue and &gt;500 people.</li>
<li><strong>(This is Platform Computing&#8217;s most famous splash to date.)</strong> Platform Computing technology underlies SAS Institute&#8217;s preferred method of parallelization, which may variously be called:
<ul>
<li>SAS Grid Manager (the more or less official brand name).</li>
<li><a href="../../../../../2011/04/21/sas-hpa-does-make-sense-after-all/">SAS HPA</a> (High Performance Analytics), sort of an alternate brand name.</li>
<li>MPI (Message Passing Interface), the industry&#8217;s name for the underlying semantics/syntax/API.</li>
</ul>
</li>
<li>Platform Computing&#8217;s original business was scientific grid computing.</li>
<li>Platform Computing&#8217;s second major business was its &#8220;Symphony&#8221; product line. According to Platform Computing, Symphony:
<ul>
<li>Debuted 6-7 years ago.</li>
<li>Is more commercially oriented.</li>
<li>Is what supports SAS HPA.</li>
<li>SAS aside, has been sold to Wall Street and so on.</li>
<li>Is sometimes used in conjunction with <a href="../../../../../2011/08/25/renaming-cep-or-not/">CEP/streaming</a>, mainly for backtesting.</li>
<li>Can be used to build global (parallel) persistent memory for R.</li>
</ul>
</li>
<li><strong>(This is probably why IBM is buying Platform Computing.)</strong> Platform Computing&#8217;s has a new MapReduce offering that:
<ul>
<li>Is based on Symphony.</li>
<li>Shipped last July, except that early access was a couple months before that.</li>
<li>Is focused on:
<ul>
<li>Lowering the latency of MapReduce.</li>
<li>Consolidating multiple MapReduce use cases into one high(er)-utilization cluster.</li>
<li>Offering workload management in support of those goals.</li>
<li>Reliability, availability, predictability, puppies, kittens, and apple pie.</li>
</ul>
</li>
</ul>
</li>
<li>Is most specifically a MapReduce run-time engine, with other stuff beyond that.</li>
</ul>
<p>Unfortunately, I&#8217;m not precisely clear as to how tied this offering is to Hadoop, but using it with Hadoop is at least the base case. But Platform Computing did say:</p>
<ul>
<li>It can support multiple virtual Hadoop clusters, which can be grown or shrunk at will.</li>
<li>Non-Hadoop workloads can be mixed in.</li>
</ul>
<p>Platform Computing said that key technical benefits of this offering included:</p>
<ul>
<li><strong>1-3 seconds to start a job, vs. 40-50 in generic Hadoop.</strong></li>
<li>Automatic recovery of JobTracker nodes.</li>
<li>Failover for NameNodes.</li>
<li>Workload management that:
<ul>
<li>Manages all of CPU, I/O, and RAM (this is quickly becoming an industry standard level of capability, although I&#8217;m judging more by the standards of the analytic DBMS world).</li>
<li>Monitors but doesn&#8217;t actively manage network resources.</li>
<li>Can reprioritize jobs that are in flight. (Also an industry-standard capability.)</li>
</ul>
</li>
</ul>
<p>This conflation of scientific, commercial analytic, streaming, and MapReduce is right in IBM&#8217;s philosophical wheelhouse. I base that comment on, among other factors:</p>
<ul>
<li>How IBM positions &#8220;Big Insights&#8221;.</li>
<li>IBM&#8217;s &#8220;smart consolidation&#8221; picture/pitch (which I really should get around to posting).</li>
<li>The fuss IBM makes about Watson, Blue Gene, and so on.</li>
</ul>
<p>The IBM acquisition probably obviates a lot of Platform Computing&#8217;s previous business comments, but at the time they included:</p>
<ul>
<li>POCs (Proofs of Concept):
<ul>
<li>Mainly in financial services, government, and telecom.</li>
<li>At both existing customers and new prospects.</li>
<li>Typically running 30-50 nodes, 2-50 terabytes.* The smallest databases evidently tended to be an financial services firms.</li>
</ul>
</li>
<li>Pricing that was starting out:
<ul>
<li>Perpetual license: $3450/server, 21% annual maintenance after the first year.</li>
<li>Subscription: $2070/server annually, or $3070 with HDFS support bundled in.</li>
</ul>
</li>
</ul>
<p><em><strong>*1 terabyte or less per node</strong> is probably the lowest data-per-node figure I&#8217;ve heard for anything Hadoop-like &#8212; even below Hadapt, and well below what <a href="../../../../../2011/07/06/hadoop-hardware-and-compression/">Cloudera and Hortonworks</a> usually see.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/11/ibm-is-buying-parallelization-expert-platform-computing/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>XLDB: The one conference I like to attend</title>
		<link>http://www.dbms2.com/2011/09/20/xldb-the-one-conference-i-like-to-go-to/</link>
		<comments>http://www.dbms2.com/2011/09/20/xldb-the-one-conference-i-like-to-go-to/#comments</comments>
		<pubDate>Tue, 20 Sep 2011 05:32:03 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Scientific research]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5272</guid>
		<description><![CDATA[I&#8217;m not a big fan of conferences, but I really like XLDB. Last year I got a lot out of XLDB, even though I couldn&#8217;t stay long (my elder care issues were in full swing). The year before I attended the whole thing &#8212; in Lyon, France, no less &#8212; and learned a lot more. [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m not a big fan of conferences, but I really like XLDB. <a href="../../../../../2010/10/10/xldb4-xldb/">Last year</a> I got a lot out of XLDB, even though I couldn&#8217;t stay long (my <a href="http://www.softwarememories.com/2010/11/09/for-those-who-cared-about-the-late-peter-and-anita-monash/">elder care</a> issues were in full swing). The year before I attended the whole thing &#8212; in Lyon, France, no less &#8212; and learned <a href="../../../../../2009/10/03/issues-in-scientific-data-management/">a lot more</a>. This year&#8217;s XLDB conference is at SLAC &#8212; the organization formerly known as the Stanford Linear Accelerator Center &#8212; on Sand Hill Road in Menlo Park, October 18-19. As of right now, I plan to be there, at least on the first day. XLDB&#8217;s agenda and registration details (inexpensive) can be found on the <a href="http://www-conf.slac.stanford.edu/xldb/Events.asp">XLDB conference website</a>.</p>
<p><em>The only reason I wouldn&#8217;t go is if that turned out to be a lousy week for me to travel to California.</em></p>
<p>The people who go XLDB tend to be really smart &#8212; either research scientists, hardcore database technologists, or others who can hold their own with those folks. Audience participation can be intense; the most talkative members I can recall were Mike Stonebraker, Martin Kersten, Michael McIntire, and myself. Even the vendor folks tend to the smart &#8212; past examples include Stephen Brobst, Jeff Hammerbacher, Luke Lonergan, and IBM Fellow Laura Haas. When we had a <a href="../../../../../2011/09/05/zynga-linkedin-data-warehous/">datageek bash</a> on my last trip to the SF area, several guys said they were planning to attend XLDB as well.</p>
<p>XLDB stands for eXtremely Large DataBases, and those are indeed what gets talked about there. <span id="more-5272"></span>The lead organizer, Jacek Becla, seems to have started XLDB because he has 100 petabytes of astronomical data to plan for. XLDB is where I learned about how CERN manages what must by now be most of the particle physics data in the world. (Even the metadata for the experiment logs is over 10 terabytes.) Facebook, eBay, and Zynga are all on this year&#8217;s program.</p>
<p>XLDB&#8217;s focus is expanding a bit from data-management-only to analytic techniques as well; I tried to run a panel last year on analytics-DBMS integration before Jeff got hold of the idea and deleted the &#8220;DBMS integration&#8221; part. But in any case, I&#8217;d expect topics discussed at XLDB to be what even I might willingly label <a href="http://www.dbms2.com/2011/09/11/big-data-has-jumped-the-shark/">&#8220;big data&#8221;</a>. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/20/xldb-the-one-conference-i-like-to-go-to/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Hadoop notes</title>
		<link>http://www.dbms2.com/2011/09/12/hadoop-notes/</link>
		<comments>http://www.dbms2.com/2011/09/12/hadoop-notes/#comments</comments>
		<pubDate>Mon, 12 Sep 2011 09:03:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Health care]]></category>
		<category><![CDATA[Hortonworks]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapR]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5218</guid>
		<description><![CDATA[I visited California recently, and chatted with numerous companies involved in Hadoop &#8212; Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I&#8217;ll defer further Hadoop technical discussions for now &#8212; my target to restart them is later this month &#8212; but that still leaves some other issues to discuss, namely adoption and partnering. The total number [...]]]></description>
			<content:encoded><![CDATA[<p>I visited California recently, and chatted with numerous companies involved in Hadoop &#8212; Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I&#8217;ll defer further <a href="../../../../../2011/08/21/hadoop-evolution/">Hadoop technical discussions</a> for now &#8212; my target to restart them is later this month &#8212; but that still leaves some other issues to discuss, namely adoption and partnering.</p>
<p>The total number of enterprises in the world paying subscription and license fees that they would regard as being for &#8220;Hadoop or something Hadoop-related&#8221; probably is not much over 100 right now, but I&#8217;d expect to see pretty rapid growth. Beyond that, let&#8217;s divide customers into three groups:</p>
<ul>
<li>Internet businesses.</li>
<li>Traditional enterprises &#8216; internet operations.</li>
<li>Traditional enterprises&#8217; other operations.</li>
</ul>
<p>Hadoop vendors, in different mixes, claim to be doing well in all three segments. Even so, almost all use cases involve some kind of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>, with one exception being a credit card vendor crunching a large database of transaction details. Multiple kinds of machine-generated data come into play &#8212; web/network/mobile device logs, financial trade data, scientific/experimental data, and more. In particular, pharmaceutical research got some mentions, which makes sense, in that it&#8217;s one area of scientific research that actually enjoys fat for-profit research budgets.</p>
<p><span id="more-5218"></span>On the partnering side, I heard things about a Hortonworks conference call that do not seem to have been contradicted by my visit to Hortonworks. Namely, Hortonworks promised prospective partners, such as analytic DBMS vendors, hardware vendors, or large system integrators, that it wouldn&#8217;t compete with them, in that Hortonworks pledges not to introduce its own products for at least two years. This is presumably targeted most directly at <a href="../../../../../2010/10/10/partnering-with-cloudera/">Cloudera</a>, which has lots of partners, but also some <a href="../../../../../2010/06/30/cloudera-enterprise-hadoop-evolution/">proprietary code</a> of its own. MapR, I&#8217;d think, would be the #2 target, but that&#8217;s just speculation.</p>
<p>The other big part of <a href="../../../../../2011/07/10/cloudera-and-hortonworks/">Hortonworks&#8217; story</a> is the claim that it holds the axe in Apache Hadoop development. Nobody doubts that a large fraction of the work on Hadoop&#8217;s core projects was done by Yahoo employees. Many of those indeed moved to Hortonworks; others left Yahoo earlier; Hadoop creator Doug Cutting is actually at Cloudera. So just how dominant Hortonworks really is in core Hadoop development is a bit unclear. Meanwhile, Cloudera people seem to be leading a number of Hadoop companion or sub-projects, including the first two I can think of that relate to Hadoop integration or connectivity, namely Sqoop and Flume. So I&#8217;m not persuaded that the &#8220;we know this stuff better&#8221; part of the Hortonworks partnering story really holds up.</p>
<p>What I am persuaded of is that the Hadoop platform competition is a good thing. Whichever vendors and projects win will be healthier from having had to outcompete worthy alternatives.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/12/hadoop-notes/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Petabyte-scale Hadoop clusters (dozens of them)</title>
		<link>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/</link>
		<comments>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/#comments</comments>
		<pubDate>Wed, 06 Jul 2011 05:15:21 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4886</guid>
		<description><![CDATA[I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop. Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more [...]]]></description>
			<content:encoded><![CDATA[<p>I recently learned that there are <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">7 Vertica clusters with a petabyte</a> (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.</p>
<p>Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo&#8217;s latest stated figures are:</p>
<ul>
<li>42,000 Hadoop nodes &#8230;</li>
<li>&#8230; holding 180-200 petabytes of data.</li>
</ul>
<p><span id="more-4886"></span>That works out near the low end of the range I came up with for Yahoo&#8217;s newest gear, namely <a href="http://www.dbms2.com/2011/07/06/hadoop-hardware-and-compression/">36-90 TB/node</a>. Yahoo&#8217;s biggest clusters are little over 4,000 nodes (a limitation that&#8217;s getting worked on), and Yahoo has over 20 clusters in total.</p>
<p>Based on those numbers, it would seem that 10 or more of Yahoo&#8217;s Hadoop clusters are probably in the petabyte range. Facebook no doubt has a few petabyte-scale Hadoop clusters as well. So we&#8217;re probably over 3 dozen petabyte+ Hadoop clusters, just counting Yahoo, Facebook, and CDH users. There surely are others too, running Apache Hadoop without Cloudera&#8217;s help.</p>
<p>We also have some more information about the scale of Hadoop usage, and the markets it is being used in, because Omer Trajman of Cloudera kindly wrote the following &#8212; lightly edited as usual &#8212; for quotation:</p>
<blockquote><p>The number of Petabyte+ Hadoop clusters expanded dramatically over the past year, with our recent count reaching 22 in production (in addition to the well-known clusters at Yahoo! and Facebook). Just as our poll back at Hadoop World 2010 showed the average cluster size at just over 60 nodes, today it tops 200. While mean is not the same as median (most clusters are under 30 nodes), there are some beefy ones pulling up that average. Outside of the well-known large clusters at Yahoo and Facebook, we count today 16 organizations running PB+ clusters running CDH across a diverse number of industries including online advertising, retail, government, financial services, online publishing, web analytics and academic research. We expect to see many more in the coming years, as Hadoop gets easier to use and more accessible to a wide variety of enterprise organizations.</p></blockquote>
<p>Omer went on to add:</p>
<blockquote><p>The biggest number of PB clusters are in the advertising space. I often tell people that every ad you see on the internet touched at least one Hadoop cluster (or the Google equivalent).</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 2)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:18:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Rainstor]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4867</guid>
		<description><![CDATA[In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/">Part 1</a> of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list match that is even less clear.  <span id="more-4867"></span></p>
<p><strong><em>Bit bucket</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Logs, other technical/external</li>
<li><em>Likely use styles:</em> Staging/ETL, investigative</li>
<li><em>Canonical example: </em>Log files in a Hadoop cluster<em> </em></li>
<li><em>Stresses:</em> TCO, scale-out, transform/big-query performance, ETL functionality</li>
</ul>
<p>With the explosion of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> has come the need for a place to put it all, sometimes called the <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a>. This is like the investigative data mart for big databases, but more <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured</a>. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.</p>
<p>The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.</p>
<p><strong><em>Archival data store</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Operational, CDR (call detail record), security log</li>
<li><em>Likely use styles:</em> Archival, reporting (for compliance), possibly also investigative</li>
<li><em>Examples:</em> Any long-term detailed historical store</li>
<li><em>Stresses: </em>TCO, compression, scale-out, performance (if multi-use)<em> </em></li>
</ul>
<p><em> </em></p>
<p>Analytic DBMS vendors have been insulting each other with the claim &#8220;that&#8217;s just an archival data store,&#8221; dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only <a href="../../../../../2010/06/11/rainstor-update/">Rainstor</a> truly embraces the archival positioning, and I&#8217;ve become pretty dubious about their technical claims and their company alike.</p>
<p>Still, there&#8217;s a legitimate need for data stores &#8212; especially relational analytic DBMS that:</p>
<ul>
<li>Store data cheaply, with high rates of compression.</li>
<li>Have decent performance if you do want to query the data.</li>
<li>May have archiving/compliance-specific features as well.</li>
</ul>
<p>Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.</p>
<p><strong><em>Outsourced data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Traditional BI, investigative analytics, staging/ETL</li>
<li><em>Examples:</em> Advertising tracking, SaaS CRM</li>
<li><em>Stresses:</em> Performance, TCO, reliability, concurrency</li>
</ul>
<p>Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I&#8217;ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that&#8217;s an analytic data management challenge. The possibilities expand from there.</p>
<p>Data outsourcers are in the IT business, and so their IT development is &#8212; hopefully! &#8212; more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.*  <a href="../../../../../2011/06/26/what-to-think-about-before-you-make-a-technology-decision/">Multitenancy</a> is commonly an issue, as is running in the cloud.<em> </em></p>
<p><em>*Even so, there&#8217;s often That Guy who doesn&#8217;t want to migrate away from Oracle, no matter what.<strong> </strong></em></p>
<p>Vertica gets the nod in a number of these cases; it&#8217;s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you&#8217;re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.</p>
<p><strong><em>Operational analytic(s) server</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> Customer-centric, log, financial trade</li>
<li><em>Likely use styles:</em> Advanced operational analytics</li>
<li><em>Examples:</em>
<ul>
<li>Lower latency: Web or call-center personalization, anti-fraud</li>
<li>Higher latency: Customer profiling, Basel 3 risk analysis</li>
</ul>
</li>
<li><em>Stresses:</em> Performance, reliability, analytic functionality, perhaps concurrency</li>
</ul>
<p>Even with eight different choices, I need a &#8220;catch-all&#8221; category; this is it.</p>
<p>Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrating short-request and analytic processing</a>. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you&#8217;re set. Otherwise, you may want to pipe <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">derived data</a> into a more &#8220;industrial-strength&#8221; DBMS, ideally the one that runs your operational apps anyway</p>
<p>Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you&#8217;re starting out with the data in a convenient bit bucket.</p>
<p>Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.</p>
<p><em>So did I get them all? Or are there yet other analytic data management use cases that I don&#8217;t fit into my eight categories?</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 1)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:17:44 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4868</guid>
		<description><![CDATA[Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help. Let&#8217;s try eight categories instead. While no categorization [...]]]></description>
			<content:encoded><![CDATA[<p>Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help.</p>
<p>Let&#8217;s try eight categories instead. While <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">no categorization is ever perfect</a>, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need &#8212; and in most cases you&#8217;ll need several &#8212; is a great early step in your analytic technology planning.  <span id="more-4868"></span></p>
<p><strong><em>Enterprise data warehouse</em></strong> (Full or partial)</p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, but especially operational</li>
<li><em>Likely use styles:</em> All</li>
<li><em>Canonical example:</em> Central EDW for a big enterprise</li>
<li><em>Stresses:</em> Concurrency, reliability, workload management</li>
</ul>
<p>The enterprise data warehouse (EDW) ideal says that you copy all your data into one place, and drive all decision-making from there. <a href="../../../../../2011/06/21/its-official-the-grand-central-edw-will-never-happen/">Full EDWs are pipedreams</a>. Still, a partial EDW makes sense for most large enterprises, and many indeed already have one. The first product lines to consider for classical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL Server, especially if you&#8217;re going to stress concurrency and/or operational use cases.</p>
<p><strong><em>Traditional data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Business intelligence, budgeting/consolidation, investigative</li>
<li><em>Examples:</em> Reporting servers, planning/consolidation servers, anything MOLAP, etc.</li>
<li><em>Stresses:</em> Performance, concurrency, TCO</li>
</ul>
<p>Whether or not you have something like an enterprise data warehouse, it&#8217;s common to have lighter-weight data marts as well. A traditional data mart might drive reports and dashboards. Or it might be specialized for budgeting, planning, and/or consolidation.  Some <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> may be in the mix as well.</p>
<p>Any DBMS that can support an EDW can also support a data mart, but it may not be the most cost-effective way to do so. Columnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them &#8212; e.g. Sybase IQ and <a href="../../../../../2011/06/20/vertica-release-5/">Vertica</a> &#8212; have excellent track records in concurrent usage as well. <a href="../../../../../2011/05/29/when-to-use-relational-database-management-system/">Ted Codd</a> pushed what amounts to MOLAP (Multidimensional OnLine Analytic Processing) systems for these use cases. But relational DBMS commonly do a better job, which is one reason most major MOLAP products have wound up at RDBMS companies.</p>
<p><strong><em>Investigative data mart &#8212; agile</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> A few analysts getting a few TB to examine</li>
<li><em>Stresses:</em> Ease of setup/load, ease of admin, price/performance</li>
</ul>
<p>Besides the traditional data mart, there are at least two other kinds. Both are focused on investigative analytics, but they&#8217;re differentiated by database size.</p>
<p>If you have just a few analysts,* looking at no more than a few terabytes of data (perhaps even just some gigabytes) &#8212; and if that data is &#8220;single-subject&#8221; and fairly homogenous &#8212; your watchwords should be &#8220;cheap&#8221;, &#8220;easy&#8221;, and &#8220;fast&#8221;. You don&#8217;t need to invest in much hardware, in expensive software, in much administrative effort (the analysts can be their own DBAs),  nor should you endure much set-up time. Just grab a product, grab some data, and start running queries (or extracts into the statistical tool of your choice).</p>
<p><em>*If you have dozens or even hundreds of analysts hitting the same database, you&#8217;re probably back to the more concurrency-oriented scenarios outlined above.</em></p>
<p>Infobright is often cost-effective among columnar analytic DBMS. Other vendors might cut you a price break as well. If you have multiple terabytes of data, don&#8217;t rule out Netezza&#8217;s lowest-end products (even if they&#8217;d really rather sell you something bigger). Or, if you&#8217;re in the sub-terabyte range, maybe you can get by with an in-memory BI tool such as QlikView, and not do anything special on the DBMS side at all.</p>
<p><strong><em>Investigative data mart &#8212; big</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric, logs, financial trade, scientific</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> Single-subject 20 TB &#8211; 20 PB relational database<em></em></li>
<li><em>Stresses:</em> Performance, scale-out, analytic functionality</li>
</ul>
<p>But if you&#8217;re looking at tens of terabytes of relational data, or even more, you really do have a &#8220;big data&#8221; problem. Performance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum. Performance POCs (Proofs Of Concept) are a big part of the buying process. Vendor price negotiations are crucial too.</p>
<p><em>Actually, in the low tens of terabytes you might be able to get away with a shared-disk system that has excellent compression &#8212; e.g., columnar products like Sybase IQ, Infobright, or SAND, rather than just Vertica and ParAccel.</em></p>
<p>Assuming you have affordable, scalable query performance, the competitive differentiator can switch to additional analytic functionality. Aster, Netezza, ParAccel, Vertica, and Greenplum either offer full <a href="../../../../../2011/02/24/analytic-platforms/">analytic platforms</a>, or seem to be on the path to doing so. Teradata, which now owns Aster Data, offers substantial built-in analytic capability in its traditional products as well, and the same goes for Sybase IQ.</p>
<p><em>Continued in <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/">Part 2</a>,</em><em> where we cover some of the more difficult use cases.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>A few notes from XLDB 4</title>
		<link>http://www.dbms2.com/2010/10/10/xldb4-xldb/</link>
		<comments>http://www.dbms2.com/2010/10/10/xldb4-xldb/#comments</comments>
		<pubDate>Sun, 10 Oct 2010 17:49:03 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Health care]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[Michael Stonebraker]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Scientific research]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3132</guid>
		<description><![CDATA[As much as I believe in the XLDB conferences, I only found time to go to (a big) part of one day of XLDB 4 myself. In general:  XLDB 4 had a good crowd, including Phil Bernstein (quiet), Mike Stonebraker (not quiet), Martin Kersten (ditto), Luke Lonergan (ditto), Todd Walter (almost unrecognizable without his usual [...]]]></description>
			<content:encoded><![CDATA[<p>As much as <a href="http://www.dbms2.com/2010/07/01/why-you-should-go-to-xldb4/">I believe in the XLDB conferences</a>, I only found time to go to (a big) part of one day of XLDB 4 myself. In general:  <span id="more-3132"></span></p>
<ul>
<li>XLDB 4 had a good crowd, including Phil Bernstein (quiet), Mike Stonebraker (not quiet), Martin Kersten (ditto), Luke Lonergan (ditto), Todd Walter (almost unrecognizable without his usual cowboy gear), <a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">Oliver Ratzesberger</a>, and a bunch of actual science types.</li>
<li>XLDB 4 had one weakness &#8212; panels with lots of participants, but only a single microphone among them. That tends to make for serial declamations more than true interactive discussion, at least until the audience starts chiming in, which thankfully it tends to eventually do. (I had the same problem in spades while moderating the <a href="http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/">Boston Big Data Summit panel</a> last year; at least at XLDB 4 nobody was TRYING to filibuster.)</li>
</ul>
<p>My notes have unfortunately disappeared, but from memory:</p>
<ul>
<li>Mike Stonebraker asserted that SciDB outperforms sharded MySQL by two orders of magnitude for some classes of scientific application. One of the big reasons was that SciDB lets you overlap partitions, so that for any feature you want to extract, you can be confident there&#8217;s at least one partition that actually contains it.</li>
<li>I chatted with Peter Breunig of Chevron about analytic issues in the oil &amp; gas industry. I got the impression:
<ul>
<li>Refineries are generally well-instrumented with sensors.</li>
<li>Oil wells may not be, especially the less valuable/lower producing ones.</li>
<li>He&#8217;d love to scatter passive sensors all around, waiting for natural tremors &#8212; as opposed to just geologist-set explosions &#8212; to provide more insight into what&#8217;s under the ground.</li>
<li>50-100 TB geological data sets are common. Processing them takes 2-3 weeks. As the technology gets better, so do the results (rather than the time being shortened).</li>
<li>All this suggests that there&#8217;s a huge need for better technology in resovoir analysis.</li>
<li>His other big unmet analytic desire is refinery simulation.</li>
</ul>
</li>
<li>Kevin Winsen told about the proposed radio astronomy project <a href="http://www.atnf.csiro.au/SKA/">ASKAP</a>, which will have raw data volumes that make the <a href="http://www.dbms2.com/2009/09/12/xldb-scid/">LSST&#8217;s</a> look small. (More precisely, ASKAP is the name proposed by Australia, one of the two finalists for the location; South Africa presumably has a different name for it.) 8 petabytes/day were mentioned, although most of this will be rapidly discarded. That could be the largest unclassified data acquisition rate out there, although it&#8217;s known that there&#8217;s a classified one at &gt;10 PB/day (image data).</li>
<li>Health care researchers repeatedly complained that <strong>privacy </strong>regulations get in the way of them using clinical data for medical research. Just more grist for my &#8220;HIPAA must die so that people can live&#8221; mill.</li>
<li>Mike Stonebraker is pushing the idea of a &#8220;science benchmark.&#8221; (A <a href="http://www-conf.slac.stanford.edu/xldb10/docs/SSDB_benchmark.pdf">paper</a> on same has been posted.) The idea is that the existence of said benchmark should provide a spur for DBMS vendors to make their products run faster for scientific purposes, in line with the supposed salutary effects of <a href="http://www.softwarememories.com/2009/07/02/historical-significance-of-tpc-benchmarks/">TPC-A, TPC-B</a>, and <a href="http://itmarketstrategy.com/2010/10/08/database-benchmarks-the-gift-that-keeps-on-giving/">TPC-C</a>. Notwithstanding that attendees included Oracle, Microsoft, EMC/Greenplum, Teradata, and Aster Data &#8212; with Greenplum, IBM, and Aster Data also being sponsors &#8212; I am skeptical because:
<ul>
<li>Leaders of the XLDB effort seem convinced that <a href="http://www.dbms2.com/2009/10/04/jacek-becla-on-issues-in-scientific-data-management/">only open source DBMS can meet their needs</a>.</li>
<li>They further characterize scientific DBMS as <a href="http://www.scidb.org/about/history.php">a zero billion dollars/year market</a>.</li>
</ul>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/10/10/xldb4-xldb/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Nested data structures keep coming up, especially for log files</title>
		<link>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/</link>
		<comments>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/#comments</comments>
		<pubDate>Sat, 31 Jul 2010 10:42:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2723</guid>
		<description><![CDATA[Nested data structures have come up several times now, almost always in the context of log files. Google has published about a project called Dremel. Per Tasso Agyros, one of Dremel&#8217;s key concepts is nested data structures. Those arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data [...]]]></description>
			<content:encoded><![CDATA[<p>Nested data structures have come up several times now, almost always in the context of log files.</p>
<ul>
<li>Google has published about a project called <a href="http://www.asterdata.com/blog/index.php/2010/07/19/google%E2%80%99s-dremel-%E2%80%93-or-can-mapreduce-itself-handle-fast-interactive-querying/">Dremel</a>. Per Tasso Agyros, one of Dremel&#8217;s key concepts is nested data structures.</li>
<li>Those <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/">arrays</a> that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of course log-oriented. <a href="http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/">eBay was very interested in that project too</a>.</li>
<li>Facebook&#8217;s log files have a big nested data structure flavor.</li>
</ul>
<p>I don&#8217;t have a grasp yet on what exactly is happening here, but it&#8217;s something.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

