<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Petabyte-scale data management</title>
	<atom:link href="http://www.dbms2.com/category/database-theory-practice/petabyte-database-theory-practice/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Commercial software for academic use</title>
		<link>http://www.dbms2.com/2011/10/14/commercial-software-for-academic-use/</link>
		<comments>http://www.dbms2.com/2011/10/14/commercial-software-for-academic-use/#comments</comments>
		<pubDate>Fri, 14 Oct 2011 06:21:21 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Scientific research]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5483</guid>
		<description><![CDATA[As Jacek Becla explained: Academic scientists like their software to be open source, for reasons that include both free-like-speech and free-like-beer. What&#8217;s more, they like their software to be dead-simple to administer and use, since they often lack the dedicated human resources for anything else. Even so, I think that academic researchers, in the natural [...]]]></description>
			<content:encoded><![CDATA[<p>As <a href="../../../../../2009/10/04/jacek-becla-on-issues-in-scientific-data-management/">Jacek Becla</a> explained:</p>
<ul>
<li>Academic scientists like their software to be open source, for reasons that include both free-like-speech and free-like-beer.</li>
<li>What&#8217;s more, they like their software to be dead-simple to administer and use, since they often lack the dedicated human resources for anything else.</li>
</ul>
<p>Even so, I think that <strong>academic researchers,</strong> in the natural and social sciences alike, <strong>commonly overlook the wealth of commercial software</strong> that could help them in their efforts.</p>
<p>I further think that <strong>the commercial software industry could do a better job of exposing its work to academics,</strong> where by &#8220;expose&#8221; I mean:</p>
<ul>
<li>Give your stuff to academics for <strong>free.</strong></li>
<li>Call their attention to your free offering.</li>
</ul>
<p>Reasons to do so include:</p>
<ul>
<li><strong>Public benefit.</strong> Scientific research is important.</li>
<li><strong>Training future customers.</strong> There&#8217;s huge academic/commercial crossover, especially as students join the for-profit workforce.</li>
</ul>
<p><span id="more-5483"></span>The biggest issue is probably <strong>large-scale database management.</strong> There&#8217;s a feeling, permeating for example parts of the <a href="../../../../../2011/09/20/xldb-the-one-conference-i-like-to-go-to/">XLDB conference</a> and the associated SciDB project, that data stores suitable for holding large amounts of data are either:</p>
<ul>
<li>Hadoop or</li>
<li>Forbiddingly expensive.</li>
</ul>
<p>I think that&#8217;s overstated. In particular:</p>
<ul>
<li>You can put &gt;10 terabytes of machine-generated data (or any other kind) into Infobright and have it well taken care of; Infobright is open source.</li>
<li>You can put &gt;1 petabyte into [name redacted],* among others; [name redacted]* should be out soon with a generously free offering for academic users. <em>Edit: That would be <a href="http://www.dbms2.com/2011/10/18/vertica-community-edition/">Vertica</a>.</em></li>
<li>Conventional relational queries, graph analysis, statistical analysis preparation and more can all be much faster in a good analytic DBMS than in alternative kinds of data stores.</li>
<li>Integration between SQL and other analytic languages is ever improving, as analytic DBMS evolve into &#8220;<a href="../../../../../2011/02/24/analytic-platforms/">analytic platforms</a>&#8220;.</li>
</ul>
<p><em>*My permission to use the name was yanked after this post was largely drafted. I&#8217;m sufficiently pleased with the forthcoming offering itself that I can&#8217;t get upset about the procedural confusion.</em></p>
<p>With a couple of exceptions, the <strong>statistics/predictive analytics</strong> situation seems more reasonable. Industry leaders such as SAS Institute and SPSS (now an IBM company) have engaged in varying degrees of academic outreach. R is in the process of crossing over from academia to business.</p>
<p><em>Unfortunately, I know next to nothing about Stata or, elsewhere in the technical languages area, Mathworks/Matlab. (Who knew that Mathworks was a <a href="http://www.mathworks.com/company/aboutus/">$600 million company</a>, local to my geographical area?)</em></p>
<p>One statistical tool that should perhaps be more present in academia is KXEN. KXEN seems to have some nice differentiation in not making you understand in advance which of your variables are most important. Econometricians and others with large numbers of independent variables might wish to take note.</p>
<p><em>If you think the true situation is nonlinear, and you&#8217;re trying to approximate it with linear models, you almost always have a large number of variables to consider. True, monomials in independent variables aren&#8217;t actually independent, but it might be interesting to pretend that they are and see if any insights fall out that could help in more rigorous analysis.</em></p>
<p>I&#8217;d further argue that, as part of neglecting commercial analytic DBMS, the scientific community in particular neglects the potential of <strong>integrated analytic platforms. </strong>Admittedly, the early leaders in that area &#8212; Aster Data, perhaps followed by Netezza (now an IBM company) &#8212; aren&#8217;t exactly priced in an academic-friendly way. But Vertica, EMC Greenplum, et al. are playing catch-up with analogous technology, and they&#8217;re more likely to offer appealing academic pricing.</p>
<p>There&#8217;s also the <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> side of business intelligence, especially in the area of visualization/discovery. While Spotfire (now a TIBCO company) got much of its start in research-oriented areas, the otherwise more visible &#8212; no pun intended &#8212; QlikTech and Tableau don&#8217;t seem to have done much in academia. Datameer and yet-younger Hadoop-oriented business intelligence startups don&#8217;t seem to be doing much on the academic front either, more&#8217;s the pity.</p>
<p>Frankly, <strong>I think that most scientific analytic technology needs are also found in the business world.*</strong> That convergence will only get closer as businesses focus more on <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>. Commercial software companies should pay more attention to scientists, and scientists should gaze out more often from their ramshackle, budget-constrained ivory towers.</p>
<p><em>*The converse isn&#8217;t as true. Businesses have issues not well reflected in science, derived (for example) from the complexity of their transactional schemas, or from office-politics considerations around &#8220;one version of the truth&#8221;.</em></p>
<p><strong><em>Edit: Some links that seem relevant to this year&#8217;s XLDB program</em></strong></p>
<ul>
<li><a href="http://www.dbms2.com/2011/09/05/zynga-linkedin-data-warehous/">Zynga and LinkedIn</a></li>
<li><a href="http://www.dbms2.com/2010/06/19/objectivity-infinite-graph/">Objectivity Infinite Graph</a></li>
<li><a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay as of last year&#8217;s XLDB</a> (the most expensive blog post I ever wrote, in light of Greenplum&#8217;s subsequent response)</li>
</ul>
<p><em><br />
</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/14/commercial-software-for-academic-use/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Defining NoSQL</title>
		<link>http://www.dbms2.com/2011/10/02/defining-nosql/</link>
		<comments>http://www.dbms2.com/2011/10/02/defining-nosql/#comments</comments>
		<pubDate>Mon, 03 Oct 2011 00:32:02 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Schooner Information Technology]]></category>
		<category><![CDATA[dbShards and CodeFutures]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5394</guid>
		<description><![CDATA[A reporter tweeted:  &#8221;Is there a simple plain English definition for NoSQL?&#8221; After reminding him of my cynical yet accurate Third Law of Commercial Semantics, I gave it a serious try, and came up with the following. More precisely, I tweeted the bolded parts of what&#8217;s below; the rest is commentary added for this post. NoSQL [...]]]></description>
			<content:encoded><![CDATA[<p>A reporter tweeted:  &#8221;Is there a simple plain English definition for NoSQL?&#8221; After reminding him of my cynical yet accurate <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">Third Law of Commercial Semantics</a>, I gave it a serious try, and came up with the following. More precisely, I tweeted the bolded parts of what&#8217;s below; the rest is commentary added for this post.</p>
<p><strong>NoSQL is most easily defined by what it excludes: SQL, joins, strong analytic alternatives to those, and some forms of database integrity. If you leave all four out, and you have a strong scale-out story, you&#8217;re in the NoSQL mainstream.</strong>   <span id="more-5394"></span></p>
<ul>
<li>Thus, I&#8217;d say Cassandra, HBase, Mongo DB, and Couchbase are prime examples, in no particular order. Riak as well.</li>
<li>I might have phrased that better if I&#8217;d used a different word than simply &#8220;strong&#8221; &#8212; but hey, there was a 140-character limit, and he was on deadline.</li>
</ul>
<p><strong>Using NoSQL can make sense when at least one of two things is paramount: low-cost scale-out or dynamic schemas.</strong></p>
<ul>
<li>There are some seriously sensible use cases for <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schemas</a>.</li>
<li>&#8220;Low-cost&#8221; generally boils down to:
<ul>
<li>Performance.</li>
<li>Open source free-like-beer.</li>
<li>Not a lot of database administration.</li>
</ul>
</li>
</ul>
<p>I&#8217;ve generally given object-oriented DBMS vendors and also MarkLogic hard times whenever they consider saying they&#8217;re &#8220;NoSQL&#8221;. Reasons include:</p>
<ul>
<li>Closed source.</li>
<li>Database administration overhead (even if you get good stuff for incurring that overhead, like MarkLogic&#8217;s comprehensive indexing).</li>
</ul>
<p>Also, NoSQL started out being ACID-unfriendly.</p>
<p><strong>What you give up are the query flexibility and the easily automatic data integrity of SQL-based systems.</strong> I should have added something about a mature ecosystem.</p>
<p>In the most recent live example, I influenced a <a href="../../../../../2011/09/19/oltp-disk-solid-state/">client</a> away from Cassandra and toward scale-out MySQL (dbShards and/or Schooner flavors, most likely). Part of the reason was the ability to do joins, which are useful in their application. Another part is that their development practices obviated any significant benefit from dynamic schemas. But perhaps the most important &#8212; or at least resonant &#8212; reason of all was that they really, really cared about .NET support.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/02/defining-nosql/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Confusion about Teradata&#8217;s big customers</title>
		<link>http://www.dbms2.com/2011/09/24/confusion-about-teradatas-big-customers/</link>
		<comments>http://www.dbms2.com/2011/09/24/confusion-about-teradatas-big-customers/#comments</comments>
		<pubDate>Sun, 25 Sep 2011 03:50:02 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5349</guid>
		<description><![CDATA[Evidently further attempts to get information on this subject would be fruitless, but anyhow: Teradata emailed me a couple of months ago saying something like that at that point they could count 16 petabyte-level customers. In response to my repeated requests for clarification, Teradata has explicitly refused to identify the metric used in reaching that [...]]]></description>
			<content:encoded><![CDATA[<p>Evidently further attempts to get information on this subject would be fruitless, but anyhow:</p>
<ul>
<li>Teradata emailed me a couple of months ago saying something like that at that point they could count 16 petabyte-level customers. In response to my repeated requests for clarification, Teradata has explicitly refused to identify the metric used in reaching that conclusion.</li>
<li>At some point Teradata did something &#8212; as per a tweet of his &#8212; to convince Neil Raden that they have 20 petabyte-class users.</li>
<li>That tweet was made around the time that Teradata apparently showed a slide naming big users at the Strata conference (last week).</li>
<li>If Teradata is counting <a href="http://www.dbms2.com/2008/10/15/teradatas-petabyte-power-players/">the way they did three years ago</a>, that count of 16 or 20 or whatever is probably inflated compared to, say, <a href="http://www.dbms2.com/2011/06/20/columnar-dbms-vendor-customer-metrics/">Vertica&#8217;s figure of 7</a> a few months back.</li>
<li>Even so, it&#8217;s obvious &#8212; and not just from the <a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay</a> example &#8212; that Teradata has one of the most scalable analytic DBMS offerings around.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/24/confusion-about-teradatas-big-customers/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>&#8220;Big data&#8221; has jumped the shark</title>
		<link>http://www.dbms2.com/2011/09/11/big-data-has-jumped-the-shark/</link>
		<comments>http://www.dbms2.com/2011/09/11/big-data-has-jumped-the-shark/#comments</comments>
		<pubDate>Sun, 11 Sep 2011 13:23:51 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5212</guid>
		<description><![CDATA[I frequently observe that no market categorization is ever precise and, in particular, that bad jargon drives out good. But when it comes to &#8220;big data&#8221; or &#8220;big data analytics&#8221;, matters are worse yet. The definitive shark-jumping moment may be Forrester Research&#8217;s Brian Hopkins&#8217; claim that: &#8230; typical data warehouse appliances, even if they are [...]]]></description>
			<content:encoded><![CDATA[<p>I frequently observe that <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">no market categorization is ever precise</a> and, in particular, that <a href="http://www.strategicmessaging.com/monashs-first-law-of-commercial-semantics-explained/2009/01/09/">bad jargon drives out good</a>. But when it comes to <strong>&#8220;big data&#8221;</strong> or <strong>&#8220;big data analytics&#8221;,</strong> matters are worse yet. The definitive shark-jumping moment may be <a href="http://blogs.forrester.com/brian_hopkins/11-08-29-big_data_brewer_and_a_couple_of_webinars">Forrester Research&#8217;s Brian Hopkins&#8217; claim</a> that:</p>
<blockquote><p>&#8230; typical data warehouse appliances, even if they are petascale and parallel, [are] NOT big data solutions.</p></blockquote>
<p>Nonsense almost as bad can be found in other venues.</p>
<p>Forrester seems to claim that &#8220;big data&#8221; is characterized by Volume, Velocity, Variety, and Variability. Others, less alliteratively-inclined, might put Complexity in the mix. So far, so good; after all, much of what people call &#8220;big data&#8221; is collections of disparate data streams, all collected somewhere in a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a>. But when people start <strong>defining</strong> &#8220;big data&#8221; to include Variety and/or Variability, they&#8217;ve gone too far.</p>
<p><span id="more-5212"></span><em>Up to that point, Hopkins &#8212; while wrong &#8212; is far from alone. The less common part of his error is to further claim that for data to be &#8220;big&#8221;, it must be stored in a way that violates the C in the CAP Theorem. Yes, the bigger the data set, the more likely that each datum has low individual value, with immediate consistency not being strictly necessary. But there are plenty of big data use cases in which data accuracy turns out to be a good idea.</em></p>
<p>It actually is reasonable to say that Volume and Velocity of data go together. If you&#8217;re storing 5 terabytes of data per day, you have a &#8220;big data&#8221; kind of problem, whether you then keep it for 30 days or 3000. It also is reasonable to say that Variety and Variability go together; indeed, I&#8217;d guess that what Forrester means by those terms corresponds to <a href="../../../../../2011/05/17/poly-structured-database/">multi-structured and poly-structured</a> respectively, and using one of those terms is generally plenty.</p>
<p>But while we can whittle four concepts down to two, the reduction should stop there. I say this because any of four combinations is possible (and not just in edge cases):</p>
<ul>
<li><em>Data can be both <strong>big</strong> and <strong>poly-structured.</strong></em> For example, consider the classic Hadoop log-collection use case, or the bigger of MarkLogic&#8217;s databases, or of Splunk&#8217;s, or even the dynamic-schema parts of relational data warehouses built by <a href="../../../../../2011/09/05/zynga-linkedin-data-warehous/">Zynga</a> and <a href="../../../../../2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay</a>. And yes, also consider some of the NoSQL-based <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">short-request</a> systems Hopkins was surely thinking of as well.</li>
<li><em>Data can be both <strong>big</strong> and <strong>simply-structured.</strong></em> I think most of Teradata&#8217;s and <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">Vertica&#8217;s</a> petabyte-scale installations would fit that description, the partial counterexamples at eBay and Zynga notwithstanding.</li>
<li><em>Data can be <strong>not-so-big</strong> and <strong>poly-structured.</strong></em> Consider, for example, a typical user of <a href="../../../../../2010/01/15/intersystems-cache-highlights/">Intersystems Cache&#8217;</a>.</li>
<li><em>Data can be <strong>not-so-big</strong> and <strong>simply-structured.</strong></em> Consider, for example, most of the traditional RDBMS world.</li>
</ul>
<p>To pretend that those four possibilities are only two &#8212; &#8220;big data&#8221; and otherwise &#8212; is a travesty.</p>
<p>If the term &#8220;big data&#8221; has become useless, then what? Gartner may have switched over to <strong>&#8220;extreme data,&#8221; </strong><a href="http://www.sand.com/extreme-data/">as reported by my clients at SAND</a>, in honor of the multi-V stuff. That would be an improvement. Better yet would be to stop pretending that a matrix with two dimensions has only one. If what you mean is &#8220;huge, poly-structured databases&#8221;, then that&#8217;s what you should say, or something like it.</p>
<p>If things are bad for &#8220;big data&#8221;, they&#8217;re even worse for &#8220;big data analytics&#8221;, a term that starts out by inheriting all of big data&#8217;s problems and adds more of its own. &#8220;Big data analytics&#8221; surely means &#8220;analytics done on big data&#8221; &#8212; but nobody&#8217;s quite sure what &#8220;analytics&#8221; are. For example:</p>
<ul>
<li>I&#8217;m OK with &#8220;analytic processing&#8221; incorporating all of what might be called business intelligence, visualization (which sometimes now is just the new term for BI), data mining, machine learning, predictive analytics (which for some years has been the term for data mining and machine learning), planning, and yet more. However, &#8230;</li>
<li>&#8230; others don&#8217;t agree, and contrast &#8220;analytics&#8221; to &#8220;OLAP&#8221; and/or to &#8220;visualization&#8221;, and  seem to equate &#8220;analytics&#8221; to &#8220;predictive analytics&#8221; or something similar.</li>
<li>The latter is what most people have in mind when they say &#8220;big data analytics&#8221;, but &#8230;</li>
<li>&#8230; vendors who can only lay claim to the &#8220;analytics&#8221; term in its most expansive sense claim to be doing &#8220;big data analytics&#8221; as well.</li>
</ul>
<p><a href="http://soa.sys-con.com/node/1968472">Nonsense even worse than Forrester&#8217;s</a> ensues.</p>
<p>So here&#8217;s what I propose.</p>
<ul>
<li>Nobody should ever again say that &#8220;big data&#8221; doesn&#8217;t include big relational data warehouses.</li>
<li>If your definition of &#8220;big data&#8221; goes beyond Volume and perhaps Velocity to include Variety, Variability, or Complexity &#8212; please call it something else instead. &#8220;Extreme data&#8221; sounds like a snowboarding competition or something, but at least it&#8217;s not as totally erroneous as &#8220;big&#8221;.</li>
<li>Never, ever use the phrase &#8220;big data analytics&#8221; unless you have modifiers near it, to show what kind of big data analytics you&#8217;re talking about, or at least to describe the special value you think you bring to the big data analytics process.</li>
</ul>
<p><em>Edit: <a href="http://twitter.com/#!/merv/status/113078204364890112">Merv Adrian of Gartner Group</a> has a more reasonable &#8212; and wittier! &#8212; take than Forrester&#8217;s:</em></p>
<blockquote><p><em>You won&#8217;t see us telling people &#8220;That&#8217;s not <a title="#bigdata" rel="nofollow" href="http://twitter.com/#%21/search?q=%23bigdata">#<strong>bigdata</strong></a>. This is big data.&#8221; That&#8217;s Crocodile Dundee&#8217;s job.</em></p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/11/big-data-has-jumped-the-shark/feed/</wfw:commentRss>
		<slash:comments>30</slash:comments>
		</item>
		<item>
		<title>Data management at Zynga and LinkedIn</title>
		<link>http://www.dbms2.com/2011/09/05/zynga-linkedin-data-warehous/</link>
		<comments>http://www.dbms2.com/2011/09/05/zynga-linkedin-data-warehous/#comments</comments>
		<pubDate>Mon, 05 Sep 2011 08:49:04 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Zynga]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5159</guid>
		<description><![CDATA[Mike Driscoll and his Metamarkets colleagues organized a bit of a bash Thursday night. Among the many folks I chatted with were Ken Rudin of Zynga, Sam Shah of LinkedIn, and D. J. Patil, late of LinkedIn. I now know more about analytic data management at Zynga and LinkedIn, plus some bonus stuff on LinkedIn&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>Mike Driscoll and his <a href="http://www.metamarketsgroup.com/">Metamarkets</a> colleagues organized a bit of a <a href="http://yfrog.com/h8msmkqj">bash</a> Thursday night. Among the many folks I chatted with were Ken Rudin of Zynga, Sam Shah of LinkedIn, and D. J. Patil, late of LinkedIn. I now know more about analytic data management at Zynga and LinkedIn, plus some bonus stuff on LinkedIn&#8217;s People You May Know application. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>It&#8217;s blindingly obvious that Zynga is one of <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">Vertica&#8217;s petabyte-scale customers</a>, given that Zynga sends 5 TB/day of data into Vertica, and keeps that data for about a year. (Zynga may retain even more data going forward; in particular, Zynga regrets ever having thrown out the first month of data for any game it&#8217;s tried to launch.) This is game actions, for the most part, rather than log files; true logs generally go into Splunk.</p>
<p><em>I don&#8217;t know whether the missing data is completely thrown away, or just stashed on inaccessible tapes somewhere.</em></p>
<p>I found two aspects of the Zynga story particularly interesting. First, those 5 TB/day are going straight into Vertica (from, I presume, <a href="http://www.dbms2.com/2010/08/18/nosql-hvsp-adoption/">memcached/Membase/Couchbase</a>), as Zynga decided that sending the data to some kind of log first was more trouble than it&#8217;s worth. Second, there&#8217;s Zynga&#8217;s approach to analytic database design. Highlights of that include: <span id="more-5159"></span></p>
<ul>
<li>Data is divided into two parts. One part has a  pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like <a href="../../../../../2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay</a>&#8216;s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) About half the data is in each part, but I don&#8217;t think that&#8217;s by deliberate choice.</li>
<li>Zynga adds data into the real schema when it&#8217;s clear it will be needed for a while. This isn&#8217;t a matter of query volumes, for the most part; rather, it&#8217;s when Zynga&#8217;s tests (e.g. of new games?) have determined that the data will keep being collected and used for a while.</li>
<li>Zynga only adds columns to its analytic  database; it never goes through the more complex process of deleting them.</li>
</ul>
<p>Just as Zynga is one of Vertica&#8217;s flagship accounts, LinkedIn is one of Aster Data&#8217;s. Specifically, before leaving LinkedIn for Aster, Jonathan Goldman built LinkedIn&#8217;s People You May Know feature in Aster nCluster. This was long ago, and I&#8217;m not sure how sophisticated his use of <a href="../../../../../2009/03/07/three-greenplum-customers-applications-of-mapreduce/">SQL and MapReduce</a> would be in today&#8217;s terms; for example, I was told he didn&#8217;t use &#8220;nPath or anything like that.&#8221; <em>(Edit: See the comments below for clarifications from Jonathan.) </em>Anyhow, LinkedIn has replaced Aster for PYMK with Hadoop, and in my opinion is getting much better results.</p>
<p>That, from an Aster standpoint, is the bad news. The good news is that LinkedIn is happily using Aster nCluster for several other applications; LinkedIn folks doesn&#8217;t seem to regret throwing out* Greenplum for Aster; and they also seem to have a very high opinion of Jonathan and his work while he was there.</p>
<p><em>*And <a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">this time</a> that is indeed the phrase that was used. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </em></p>
<p>One thing that astonished me is that LinkedIn PYMK is based only on data innate to LinkedIn (as opposed to imported email addresses, the results of web crawls, and so on). Given that, I am at a loss to explain how it suggested a couple of old friends, to whom I have no discernable chain of connection. Yes, we were at Harvard at the same time, but if that&#8217;s all it was, there would be a huge number of false positives I&#8217;m not actually seeing.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/05/zynga-linkedin-data-warehous/feed/</wfw:commentRss>
		<slash:comments>24</slash:comments>
		</item>
		<item>
		<title>Petabyte-scale Hadoop clusters (dozens of them)</title>
		<link>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/</link>
		<comments>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/#comments</comments>
		<pubDate>Wed, 06 Jul 2011 05:15:21 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4886</guid>
		<description><![CDATA[I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop. Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more [...]]]></description>
			<content:encoded><![CDATA[<p>I recently learned that there are <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">7 Vertica clusters with a petabyte</a> (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.</p>
<p>Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo&#8217;s latest stated figures are:</p>
<ul>
<li>42,000 Hadoop nodes &#8230;</li>
<li>&#8230; holding 180-200 petabytes of data.</li>
</ul>
<p><span id="more-4886"></span>That works out near the low end of the range I came up with for Yahoo&#8217;s newest gear, namely <a href="http://www.dbms2.com/2011/07/06/hadoop-hardware-and-compression/">36-90 TB/node</a>. Yahoo&#8217;s biggest clusters are little over 4,000 nodes (a limitation that&#8217;s getting worked on), and Yahoo has over 20 clusters in total.</p>
<p>Based on those numbers, it would seem that 10 or more of Yahoo&#8217;s Hadoop clusters are probably in the petabyte range. Facebook no doubt has a few petabyte-scale Hadoop clusters as well. So we&#8217;re probably over 3 dozen petabyte+ Hadoop clusters, just counting Yahoo, Facebook, and CDH users. There surely are others too, running Apache Hadoop without Cloudera&#8217;s help.</p>
<p>We also have some more information about the scale of Hadoop usage, and the markets it is being used in, because Omer Trajman of Cloudera kindly wrote the following &#8212; lightly edited as usual &#8212; for quotation:</p>
<blockquote><p>The number of Petabyte+ Hadoop clusters expanded dramatically over the past year, with our recent count reaching 22 in production (in addition to the well-known clusters at Yahoo! and Facebook). Just as our poll back at Hadoop World 2010 showed the average cluster size at just over 60 nodes, today it tops 200. While mean is not the same as median (most clusters are under 30 nodes), there are some beefy ones pulling up that average. Outside of the well-known large clusters at Yahoo and Facebook, we count today 16 organizations running PB+ clusters running CDH across a diverse number of industries including online advertising, retail, government, financial services, online publishing, web analytics and academic research. We expect to see many more in the coming years, as Hadoop gets easier to use and more accessible to a wide variety of enterprise organizations.</p></blockquote>
<p>Omer went on to add:</p>
<blockquote><p>The biggest number of PB clusters are in the advertising space. I often tell people that every ad you see on the internet touched at least one Hadoop cluster (or the Google equivalent).</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 2)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:18:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Rainstor]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4867</guid>
		<description><![CDATA[In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/">Part 1</a> of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list match that is even less clear.  <span id="more-4867"></span></p>
<p><strong><em>Bit bucket</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Logs, other technical/external</li>
<li><em>Likely use styles:</em> Staging/ETL, investigative</li>
<li><em>Canonical example: </em>Log files in a Hadoop cluster<em> </em></li>
<li><em>Stresses:</em> TCO, scale-out, transform/big-query performance, ETL functionality</li>
</ul>
<p>With the explosion of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> has come the need for a place to put it all, sometimes called the <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a>. This is like the investigative data mart for big databases, but more <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured</a>. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.</p>
<p>The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.</p>
<p><strong><em>Archival data store</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Operational, CDR (call detail record), security log</li>
<li><em>Likely use styles:</em> Archival, reporting (for compliance), possibly also investigative</li>
<li><em>Examples:</em> Any long-term detailed historical store</li>
<li><em>Stresses: </em>TCO, compression, scale-out, performance (if multi-use)<em> </em></li>
</ul>
<p><em> </em></p>
<p>Analytic DBMS vendors have been insulting each other with the claim &#8220;that&#8217;s just an archival data store,&#8221; dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only <a href="../../../../../2010/06/11/rainstor-update/">Rainstor</a> truly embraces the archival positioning, and I&#8217;ve become pretty dubious about their technical claims and their company alike.</p>
<p>Still, there&#8217;s a legitimate need for data stores &#8212; especially relational analytic DBMS that:</p>
<ul>
<li>Store data cheaply, with high rates of compression.</li>
<li>Have decent performance if you do want to query the data.</li>
<li>May have archiving/compliance-specific features as well.</li>
</ul>
<p>Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.</p>
<p><strong><em>Outsourced data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Traditional BI, investigative analytics, staging/ETL</li>
<li><em>Examples:</em> Advertising tracking, SaaS CRM</li>
<li><em>Stresses:</em> Performance, TCO, reliability, concurrency</li>
</ul>
<p>Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I&#8217;ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that&#8217;s an analytic data management challenge. The possibilities expand from there.</p>
<p>Data outsourcers are in the IT business, and so their IT development is &#8212; hopefully! &#8212; more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.*  <a href="../../../../../2011/06/26/what-to-think-about-before-you-make-a-technology-decision/">Multitenancy</a> is commonly an issue, as is running in the cloud.<em> </em></p>
<p><em>*Even so, there&#8217;s often That Guy who doesn&#8217;t want to migrate away from Oracle, no matter what.<strong> </strong></em></p>
<p>Vertica gets the nod in a number of these cases; it&#8217;s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you&#8217;re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.</p>
<p><strong><em>Operational analytic(s) server</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> Customer-centric, log, financial trade</li>
<li><em>Likely use styles:</em> Advanced operational analytics</li>
<li><em>Examples:</em>
<ul>
<li>Lower latency: Web or call-center personalization, anti-fraud</li>
<li>Higher latency: Customer profiling, Basel 3 risk analysis</li>
</ul>
</li>
<li><em>Stresses:</em> Performance, reliability, analytic functionality, perhaps concurrency</li>
</ul>
<p>Even with eight different choices, I need a &#8220;catch-all&#8221; category; this is it.</p>
<p>Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrating short-request and analytic processing</a>. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you&#8217;re set. Otherwise, you may want to pipe <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">derived data</a> into a more &#8220;industrial-strength&#8221; DBMS, ideally the one that runs your operational apps anyway</p>
<p>Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you&#8217;re starting out with the data in a convenient bit bucket.</p>
<p>Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.</p>
<p><em>So did I get them all? Or are there yet other analytic data management use cases that I don&#8217;t fit into my eight categories?</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 1)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:17:44 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4868</guid>
		<description><![CDATA[Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help. Let&#8217;s try eight categories instead. While no categorization [...]]]></description>
			<content:encoded><![CDATA[<p>Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help.</p>
<p>Let&#8217;s try eight categories instead. While <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">no categorization is ever perfect</a>, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need &#8212; and in most cases you&#8217;ll need several &#8212; is a great early step in your analytic technology planning.  <span id="more-4868"></span></p>
<p><strong><em>Enterprise data warehouse</em></strong> (Full or partial)</p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, but especially operational</li>
<li><em>Likely use styles:</em> All</li>
<li><em>Canonical example:</em> Central EDW for a big enterprise</li>
<li><em>Stresses:</em> Concurrency, reliability, workload management</li>
</ul>
<p>The enterprise data warehouse (EDW) ideal says that you copy all your data into one place, and drive all decision-making from there. <a href="../../../../../2011/06/21/its-official-the-grand-central-edw-will-never-happen/">Full EDWs are pipedreams</a>. Still, a partial EDW makes sense for most large enterprises, and many indeed already have one. The first product lines to consider for classical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL Server, especially if you&#8217;re going to stress concurrency and/or operational use cases.</p>
<p><strong><em>Traditional data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Business intelligence, budgeting/consolidation, investigative</li>
<li><em>Examples:</em> Reporting servers, planning/consolidation servers, anything MOLAP, etc.</li>
<li><em>Stresses:</em> Performance, concurrency, TCO</li>
</ul>
<p>Whether or not you have something like an enterprise data warehouse, it&#8217;s common to have lighter-weight data marts as well. A traditional data mart might drive reports and dashboards. Or it might be specialized for budgeting, planning, and/or consolidation.  Some <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> may be in the mix as well.</p>
<p>Any DBMS that can support an EDW can also support a data mart, but it may not be the most cost-effective way to do so. Columnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them &#8212; e.g. Sybase IQ and <a href="../../../../../2011/06/20/vertica-release-5/">Vertica</a> &#8212; have excellent track records in concurrent usage as well. <a href="../../../../../2011/05/29/when-to-use-relational-database-management-system/">Ted Codd</a> pushed what amounts to MOLAP (Multidimensional OnLine Analytic Processing) systems for these use cases. But relational DBMS commonly do a better job, which is one reason most major MOLAP products have wound up at RDBMS companies.</p>
<p><strong><em>Investigative data mart &#8212; agile</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> A few analysts getting a few TB to examine</li>
<li><em>Stresses:</em> Ease of setup/load, ease of admin, price/performance</li>
</ul>
<p>Besides the traditional data mart, there are at least two other kinds. Both are focused on investigative analytics, but they&#8217;re differentiated by database size.</p>
<p>If you have just a few analysts,* looking at no more than a few terabytes of data (perhaps even just some gigabytes) &#8212; and if that data is &#8220;single-subject&#8221; and fairly homogenous &#8212; your watchwords should be &#8220;cheap&#8221;, &#8220;easy&#8221;, and &#8220;fast&#8221;. You don&#8217;t need to invest in much hardware, in expensive software, in much administrative effort (the analysts can be their own DBAs),  nor should you endure much set-up time. Just grab a product, grab some data, and start running queries (or extracts into the statistical tool of your choice).</p>
<p><em>*If you have dozens or even hundreds of analysts hitting the same database, you&#8217;re probably back to the more concurrency-oriented scenarios outlined above.</em></p>
<p>Infobright is often cost-effective among columnar analytic DBMS. Other vendors might cut you a price break as well. If you have multiple terabytes of data, don&#8217;t rule out Netezza&#8217;s lowest-end products (even if they&#8217;d really rather sell you something bigger). Or, if you&#8217;re in the sub-terabyte range, maybe you can get by with an in-memory BI tool such as QlikView, and not do anything special on the DBMS side at all.</p>
<p><strong><em>Investigative data mart &#8212; big</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric, logs, financial trade, scientific</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> Single-subject 20 TB &#8211; 20 PB relational database<em></em></li>
<li><em>Stresses:</em> Performance, scale-out, analytic functionality</li>
</ul>
<p>But if you&#8217;re looking at tens of terabytes of relational data, or even more, you really do have a &#8220;big data&#8221; problem. Performance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum. Performance POCs (Proofs Of Concept) are a big part of the buying process. Vendor price negotiations are crucial too.</p>
<p><em>Actually, in the low tens of terabytes you might be able to get away with a shared-disk system that has excellent compression &#8212; e.g., columnar products like Sybase IQ, Infobright, or SAND, rather than just Vertica and ParAccel.</em></p>
<p>Assuming you have affordable, scalable query performance, the competitive differentiator can switch to additional analytic functionality. Aster, Netezza, ParAccel, Vertica, and Greenplum either offer full <a href="../../../../../2011/02/24/analytic-platforms/">analytic platforms</a>, or seem to be on the path to doing so. Teradata, which now owns Aster Data, offers substantial built-in analytic capability in its traditional products as well, and the same goes for Sybase IQ.</p>
<p><em>Continued in <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/">Part 2</a>,</em><em> where we cover some of the more difficult use cases.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Columnar DBMS vendor customer metrics</title>
		<link>http://www.dbms2.com/2011/06/20/columnar-dbms-vendor-customer-metrics/</link>
		<comments>http://www.dbms2.com/2011/06/20/columnar-dbms-vendor-customer-metrics/#comments</comments>
		<pubDate>Mon, 20 Jun 2011 05:41:54 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4742</guid>
		<description><![CDATA[Last April, I asked some columnar DBMS vendors to share customer metrics. They answered, but it took until now to iron out a couple of details. Overall, the answers are pretty impressive.  Sybase said that Sybase IQ had &#62; 2000 direct customers and &#62;500 indirect customers (i.e., end customers of OEMs). That&#8217;s counting by customers; [...]]]></description>
			<content:encoded><![CDATA[<p>Last April, I asked some columnar DBMS vendors to share customer metrics. They answered, but it took until now to iron out a couple of details. Overall, the answers are pretty impressive.  <span id="more-4742"></span></p>
<p>Sybase said that <strong>Sybase IQ </strong>had<strong> &gt; 2000 direct customers </strong>and<strong> &gt;500 indirect customers</strong> (i.e., end customers of OEMs). That&#8217;s counting by customers; I know from prior discussions that Sybase IQ is running at close to two installations per customer. I also believe that Sybase counts different divisions of the same large enterprise as separate customers.</p>
<p><strong>Vertica</strong> cited a figure of <strong>500 customers</strong> as of April (end Q1?), which is close to <strong>600</strong> now, about <strong>40% or a little more direct.</strong> The difference between this and a <a href="http://www.dbms2.com/2011/02/14/now-we-know-why-vertica-has-been-so-weirdly-evasive/">2010 year-end figure of 328</a> is not only new sales, but also slow reporting by OEMs.  One cool figure &#8212; a single OEM reported 82 end sales in a single (quarterly?) report. And a number of those direct customers are substantial; Vertica&#8217;s <a href="http://www.vertica.com/customers/">customer logo</a> page features lots of telcos, lots of internet companies, and the national operation of Blue Cross/Blue Shield.</p>
<p><em>Pay no attention to small inconsistencies in the number of Vertica direct  customers (250 at year-end, no more than that now); Colin Mahony just  estimates these numbers for me from memory, and minor inaccuracies are quite excusable.</em></p>
<p>Even cooler &#8212; <strong>Vertica </strong>reports <strong>7 customers with a petabyte or more of user data each.</strong> About 5 of the 7 are obvious-suspect big-name firms; but unsurprisingly, those big names are NDA. I did secure permission to say that there are 2 telecom companies, a mobile gaming vendor, another internet company, and 3 financial services outfits of various kinds.</p>
<p><strong>SAND Technology </strong>reported <strong>&gt;600 total customers,</strong> including<strong> &gt;100 direct. </strong>Since SAND has been around since the 1990s, those aren&#8217;t great average annual figures, but they&#8217;re probably more than many people (including me) thought.</p>
<p><strong>Infobright</strong> reported around <strong>200 total paying customers, 130 direct.</strong> There are surely a lot more users of open source Infobright, but precise numbers are of course hard to come by.</p>
<p>If I asked <strong>ParAccel</strong> in the April go-round, I&#8217;ve misplaced their answer, but back in October the figure was &gt;30 customers, 2 of them over 100 terabytes. I&#8217;ve seen published figures of 40+ for ParAccel since.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/20/columnar-dbms-vendor-customer-metrics/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Notes and links October 22, 2010</title>
		<link>http://www.dbms2.com/2010/10/22/notes-and-links-october-22-2010/</link>
		<comments>http://www.dbms2.com/2010/10/22/notes-and-links-october-22-2010/#comments</comments>
		<pubDate>Fri, 22 Oct 2010 06:47:05 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[VoltDB and H-Store]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3346</guid>
		<description><![CDATA[A number of recent posts have had good comments. This time, I won&#8217;t call them out individually. Evidently Mike Olson of Cloudera is still telling the machine-generated data story, exactly as he should be. The Information Arbitrage/IA Ventures folks said something similar, focusing specifically on &#8220;sensor data&#8221; &#8230; &#8230; and, even better, went on to [...]]]></description>
			<content:encoded><![CDATA[<p>A number of recent posts have had good comments. This time, I won&#8217;t call them out individually.</p>
<p>Evidently <a href="http://www.cscyphers.com/blog/2010/10/12/hadoop-world-2010/">Mike Olson of Cloudera is still telling the machine-generated data story</a>, exactly as he should be. The <a href="http://informationarbitrage.com/post/1359525958/big-ideas-around-big-problems-in-big-data">Information Arbitrage/IA Ventures</a> folks said something similar, focusing specifically on &#8220;sensor data&#8221; &#8230;</p>
<p>&#8230; and, even better, went on to say:  <span id="more-3346"></span></p>
<blockquote><p><strong>Privacy is dead</strong>.<br />
What do we consider to be the  boundaries of privacy, especially with respect to items like medical  data? In a data privacy-free world, should we be regulating data usage  instead? How do we deal with asymmetric access to our personal data,  e.g., how is it that insurance companies claim the right to our personal  information?</p></blockquote>
<p>Obviously, <a href="http://www.dbms2.com/2010/04/04/privacy-liberty-continued/">my answer to the second question is Yes!!!!</a></p>
<p>Also from Hadoop World &#8212; Dave Menninger, now an analyst, reports on <a href="http://www.ventanaresearch.com/blog/commentblog.aspx?id=4003">some Hadoop metrics</a>:</p>
<blockquote><p><span id="Contentblock1"><span>How big is “big data”?  In his opening remarks, Mike shared some statistics from a survey of  attendees. The average Hadoop cluster among respondents was 66 nodes and  114 terabytes of data. However there is quite a range. The largest in  the survey responses was a cluster of 1,300 nodes and more than 2  petabytes of data. (Presenters from eBay blew this away, describing  their production cluster of  8,500 nodes and 16 petabytes of storage.)  Over 60 percent of respondents had 10 terabytes or less, and half were  running 10 nodes or less.</span></span></p></blockquote>
<p><a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">That eBay comment was particularly interestin</a>g. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>A while back, Doug Henschen noted that Netezza flagship reference Catalina Marketing is now at <a href="http://intelligent-enterprise.informationweek.com/blog/archives/2010/07/big_data_the_ea.html#more">2.5 petabytes</a>. Most of that is in one 600 billion row table. Oddly, the article talks of the Netezza/SAS partnership accelerating model-building via in-database scoring (not modeling) technology. Doug also wrote of a lot of <a href="http://intelligent-enterprise.informationweek.com/blog/archives/2010/08/whats_at_stake.html#more">analytic DBMS replacements</a>, including:</p>
<ul>
<li>Microsoft by ParAccel</li>
<li>Oracle by Aster Data, IBM, Oracle Exadata, probably Netezza, and probably Hadoop</li>
<li>Netezza by Greenplum</li>
<li>IBM by Teradata</li>
</ul>
<p>Carl Olofson pointed out on Twitter that <a href="http://www.oracle.com/us/corporate/Acquisitions/datascaler/index.html">DataScaler was an in-memory database technology just bought by Oracle</a>. This inspired me to google on them, and I found a sparse <a href="http://www.svadventure.com/">DataScaler CEO blog</a>. I link it because of an amusing juxtaposition &#8212; the second-to-last post says, in effect, &#8220;We make appliances and we recommend all these awesome technology design partners who helped us design the hardware,&#8221; while the very last post says &#8220;Designing our own hardware was a mistake.&#8221; <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p><a href="http://www.dbms2.com/2010/07/23/some-interesting-links/">Fred Holahan</a> is now VP of Marketing at <a href="http://www.dbms2.com/2010/05/25/voltdb-finally-launches/">VoltDB</a>, which is a lesson to me about giving free consulting &#8230; Anyhow, Fred tells me that VoltDB has about a dozen users on their way to production, some of whom are headed to being VoltDB paying customers, some of whom are not.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/10/22/notes-and-links-october-22-2010/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

