<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Greenplum</title>
	<atom:link href="http://www.dbms2.com/category/products-and-vendors/greenplum/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Wed, 08 Feb 2012 12:22:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Comments on the 2012 Forrester Wave: Enterprise Hadoop Solutions</title>
		<link>http://www.dbms2.com/2012/02/06/comments-on-the-2012-forrester-wave-enterprise-hadoop-solutions/</link>
		<comments>http://www.dbms2.com/2012/02/06/comments-on-the-2012-forrester-wave-enterprise-hadoop-solutions/#comments</comments>
		<pubDate>Mon, 06 Feb 2012 05:16:20 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EMC]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hortonworks]]></category>
		<category><![CDATA[MapR]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Pentaho]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5886</guid>
		<description><![CDATA[Forrester has released its Q1 2012 Forrester Wave: Enterprise Hadoop Solutions. (Googling turns up a direct link, but in case that doesn&#8217;t prove stable, here also is a registration-required link from IBM&#8217;s Conor O&#8217;Mahony.) My comments include: The Forrester Wave&#8217;s relative vendor rankings are meaningless, in that the document compares apples, peaches, almonds, and peanuts. [...]]]></description>
			<content:encoded><![CDATA[<p>Forrester has released its Q1 2012 Forrester Wave: Enterprise Hadoop Solutions. (Googling turns up a <a href="http://www.forrester.com/rb/go?docid=60755&amp;oid=1-K07LCA&amp;action=5">direct link</a>, but in case that doesn&#8217;t prove stable, here also is <a href="http://database-diary.com/2012/02/02/get-a-free-copy-of-the-forrester-wave-for-enterprise-hadoop-solutions/">a registration-required link from IBM&#8217;s Conor O&#8217;Mahony</a>.) My comments include:</p>
<ul>
<li>The Forrester Wave&#8217;s <strong>relative vendor rankings are meaningless,</strong> in that the document compares apples, peaches, almonds, and peanuts. Apparently, it covers any vendor that includes a distribution of Apache Hadoop MapReduce into something it offers, and that offered at least two (not necessarily full production) references for same.</li>
<li>The Forrester Wave for &#8220;enterprise Hadoop&#8221; contradicts itself on the subject of Hortonworks.
<ul>
<li>The Forrester Wave for &#8220;enterprise Hadoop&#8221; is correct when it says <strong>&#8220;Hortonworks &#8230; has Hadoop training and professional services offerings that are still embryonic.&#8221;</strong></li>
</ul>
<ul>
<li>Peculiarly, the Forrester Wave for &#8220;enterprise Hadoop&#8221; also says &#8220;Hortonworks offers an impressive Hadoop professional services portfolio&#8221;. Hortonworks will likely win one or more nice partnership deals with vendors in adjacent fields, but even so its professional services capabilities are &#8230; well, a good word might be &#8220;embryonic&#8221;.</li>
</ul>
</li>
<li><a href="http://www.dbms2.com/2011/02/11/comments-on-the-2011-forrester-wave-for-enterprise-data-warehouse-platforms/">Forrester Waves always seem to have weird implicit definitions of &#8220;data warehousing&#8221;</a>. This one is no exception.</li>
<li>Forrester gave top marks in &#8220;Functionality&#8221; to 11 of 13 &#8220;enterprise Hadoop&#8221; vendors. This seems odd.</li>
<li>I don&#8217;t know why MapR, which doesn&#8217;t like HDFS (Hadoop Distributed File System), got top marks in &#8220;Subproject integration&#8221;.</li>
<li>Forrester gave top marks in &#8220;Storage&#8221; to Datameer. It also gave higher marks to MapR than to EMC Greenplum, even though EMC Greenplum&#8217;s technology is a superset of MapR&#8217;s. Very strange. <em>(Edit: Actually, as per a comment below, there is some uncertainty about the EMC/MapR relationship.)</em></li>
<li>Forrester gave higher marks in &#8220;Acceleration and optimization&#8221; to Hortonworks than to Cloudera and IBM, and higher marks yet to Pentaho. Very odd.</li>
<li>I&#8217;m not sure what Forrester is calling a &#8220;Distributed EDW file store connector&#8221;, but it sounds like something that Cloudera has provided via partnership to a number of analytic DBMS vendors.</li>
<li>Forrester&#8217;s &#8220;Strategy&#8221; rankings seem to correlate to a metric of &#8220;We&#8217;re a large enough vendor to go in N directions at once&#8221;, for various values of N.</li>
<li>Forrester is correct to rank Cloudera&#8217;s &#8220;Adoption&#8221; as being stronger than EMC/Greenplum&#8217;s or MapR&#8217;s. But Hortonworks&#8217; strong mark for &#8220;Adoption&#8221; baffles me.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/02/06/comments-on-the-2012-forrester-wave-enterprise-hadoop-solutions/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Hope for a new PostgreSQL era?</title>
		<link>http://www.dbms2.com/2011/11/23/hope-for-a-new-postgresql-era/</link>
		<comments>http://www.dbms2.com/2011/11/23/hope-for-a-new-postgresql-era/#comments</comments>
		<pubDate>Wed, 23 Nov 2011 14:18:00 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[EnterpriseDB and Postgres Plus]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[salesforce.com]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5728</guid>
		<description><![CDATA[In a comedy of briefing errors, I&#8217;m not too clear on the details of my client salesforce.com&#8217;s new PostgreSQL-as-a-service offering, nor exactly on what my clients at VMware are bringing to the PostgreSQL virtualization/cloud party. That said: PostgreSQL is good technology. MySQL is narrowing the gap, but PostgreSQL is still ahead of MySQL in some [...]]]></description>
			<content:encoded><![CDATA[<p>In a comedy of briefing errors, I&#8217;m not too clear on the details of my client <a href="http://gigaom.com/cloud/heroku-launches-sql-database-as-a-service/">salesforce.com&#8217;s new PostgreSQL-as-a-service offering</a>, nor exactly on what my clients at VMware are bringing to the PostgreSQL virtualization/cloud party. That said:</p>
<ul>
<li>PostgreSQL is good technology.</li>
<li>MySQL is narrowing the gap, but PostgreSQL is still ahead of MySQL in some ways.  (Database extensibility if nothing else.)</li>
<li>PostgreSQL has a lot of users. (Many of them in academia and/or Russia.)</li>
<li>Neither EnterpriseDB (which now calls itself &#8220;The enterprise PostgreSQL company&#8221;) nor the PostgreSQL community leadership have covered themselves with stewardship glory.</li>
<li>A significant number of interesting DBMS products can be regarded as PostgreSQL forks (e.g. Greenplum, Aster Data nCluster, Netezza if you squint, and Vertica if you stand on your head*).</li>
<li>PostgreSQL advancement is not dead. For example, <a href="../../../../../2011/11/08/hadapt-is-moving-forward/">Hadapt beta users are running actual PostgreSQL on many nodes each</a>.</li>
<li><a href="../../../../../2009/12/14/oracle-mysql-storage-engine/">There&#8217;s no assurance that Oracle will be a benevolent MySQL steward forever</a>. (Specifically, Oracle&#8217;s &#8220;Play nicely with others&#8221; antitrust commitments expire in 2014.)</li>
</ul>
<p>So I think it would be cool if one or the other big company put significant wood behind the PostgreSQL arrow.</p>
<p><em>*While Vertica was originally released using little or no PostgreSQL code &#8212; reports varied &#8212; it featured high degrees of PostgreSQL compatibility.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/23/hope-for-a-new-postgresql-era/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Analytic trends in 2012: Q&amp;A</title>
		<link>http://www.dbms2.com/2011/11/21/analytic-trends-in-2012-qa/</link>
		<comments>http://www.dbms2.com/2011/11/21/analytic-trends-in-2012-qa/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 11:00:23 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EMC]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[HP and Neoview]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[SAP AG]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Tableau Software]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5692</guid>
		<description><![CDATA[As a new year approaches, it&#8217;s the season for lists, forecasts and general look-ahead. Press interviews of that nature have already begun. And so I&#8217;m working on a trilogy of related posts, all based on an inquiry about hot analytic trends for 2012. This post is a moderately edited form of an actual interview. Two [...]]]></description>
			<content:encoded><![CDATA[<p>As a new year approaches, it&#8217;s the season for lists, forecasts and general look-ahead. Press interviews of that nature have already begun. And so I&#8217;m working on a trilogy of related posts, all based on an inquiry about hot analytic trends for 2012.</p>
<p>This post is a moderately edited form of an actual interview. Two other posts cover analytic trends to watch (planned) and <a href="http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/">analytic vendor execution challenges to watch</a> (already up).</p>
<p><span id="more-5692"></span><strong>Question</strong>: What do you think will happen next year with the Tableaus of the world?</p>
<p><strong>Answer:</strong></p>
<ul>
<li>I think adoption of flexible-visualization business intelligence tools will continue to be rapid.</li>
<li>I think enterprise-friendly features will be increasingly important as a basis of competition.</li>
</ul>
<p><strong>Question</strong>: What do you mean by &#8220;enterprise-friendly&#8221;?</p>
<p><strong>Answer</strong>: An example would be <a href="http://www.dbms2.com/2011/11/16/qlikview-collaborative-business-intelligence/">QlikTech no longer forcing you to use their native ETL</a>, but rather working with Informatica and soon other third-party products. Also important can be:</p>
<ul>
<li>Database size.</li>
<li>Concurrency.</li>
<li>A full-featured development cycle for analytic applications.</li>
</ul>
<p><strong>Question</strong>: What does HP have to do to be relevant in analytics/data warehousing?</p>
<p><strong>Answer</strong>: Avoid stupidity. HP Vertica is already relevant.</p>
<p><strong>Question</strong>: OK. But what can HP do to build on Vertica?</p>
<p><strong>Answer</strong>: HP &#8212; which botched Exadata 1 hardware &#8212; could do a good job with SAP HANA or other kinds of appliance products.</p>
<p>However:</p>
<ul>
<li>I don&#8217;t think trying to force Vertica beyond its natural growth &#8212; <a href="http://www.dbms2.com/2011/04/16/unpacking-the-emc-greenplum-q1-sales-disaster-rumors/">the way EMC is with Greenplum</a> &#8212; is necessarily a good idea. Natural growth in Vertica&#8217;s case is plenty fast anyway.</li>
<li>Obviously, making good Vertica hardware would be nice. But being hardware-independent is crucial to Vertica, not least because of cloud deployment, an option many buyers want to at least have in their hip pockets.</li>
</ul>
<p><strong>Question</strong>: You expressed some skepticism toward mobile BI/use cases. Why so?</p>
<p><strong>Answer</strong>: The form factor hurts functionality a lot, so it&#8217;s only worthwhile in cases where timeliness is key.</p>
<p>And without more refined alert-setting functionality, it&#8217;s hard to think of that many cases.</p>
<p><em>Note: My views on mobile BI haven&#8217;t changed much since <a href="../../../../../2010/07/15/mobile-business-intelligence/">July, 2010</a>.</em></p>
<p><strong>Question</strong>: What about the idea of an enterprise being able to pay-per-drink to run jobs on an analytic cluster. Do you expect that concept to have any legs in 2012?</p>
<p><strong>Answer</strong>: While other kinds of SaaS (Software as a Service) BI might make sense, remote computing BI that focuses on hardware cost sharing is problematic. Moving data in and out of the cluster is a big part of the overall cost, at least if you plan to process it only occasionally once it gets there. I haven&#8217;t seen a plan yet that gets around that point.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/21/analytic-trends-in-2012-qa/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Highlights of a busy news week</title>
		<link>http://www.dbms2.com/2011/09/26/highlights-of-a-busy-news-week/</link>
		<comments>http://www.dbms2.com/2011/09/26/highlights-of-a-busy-news-week/#comments</comments>
		<pubDate>Mon, 26 Sep 2011 05:50:35 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Ingres]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[VectorWise]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5372</guid>
		<description><![CDATA[I put up 14 posts over the past week, so perhaps you haven&#8217;t had a chance yet to read them all. Highlights included: My most important post of the week was a general guide to IT vendor strategy. That one has already spawned discussion at many companies, from the tiny to the multi-billion-dollar. The best [...]]]></description>
			<content:encoded><![CDATA[<p>I put up 14 posts over the past week, so perhaps you haven&#8217;t had a chance yet to read them all. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  Highlights included:</p>
<ul>
<li>My most important post of the week was a general <a href="http://www.strategicmessaging.com/strategy-for-it-vendors-a-worksheet/2011/09/18/">guide to IT vendor strategy</a>. That one has already spawned discussion at many companies, from the tiny to the multi-billion-dollar.</li>
<li>The best comment thread of the week was probably on my post about <a href="http://www.dbms2.com/2011/09/19/oltp-disk-solid-state/">scale-out relational OLTP choices</a>, in which people discussed the merits of various particular alternatives.</li>
<li>I recommended that people strongly consider attending <a href="http://www.dbms2.com/2011/09/20/xldb-the-one-conference-i-like-to-go-to/">XLDB 5 in Menlo Park on October 18-19</a>.</li>
</ul>
<p>Most of the posts, however, were reactions to news events. In particular:</p>
<ul>
<li>Teradata announced that <a href="http://www.dbms2.com/2011/09/22/teradata-columnar-compression/">Teradata 14 will be hybrid-columnar</a>, more in Vertica&#8217;s way than in Greenplum&#8217;s or Aster Data&#8217;s. (Pay no attention to the <em>Wall Street Journal&#8217;s</em> apparent belief that <a href="http://www.dbms2.com/2011/09/22/hybrid-columnar-soundbites/">no other analytic DBMS is hybrid-columnar at all</a>.)</li>
<li>Aster announced the unsurprising news that there will be a Teradata Aster appliance. Also, <a href="http://www.dbms2.com/2011/09/22/aster-database-release-5-and-teradata-aster-appliance/">Aster talked about greater analytic flexibility in the forthcoming Aster 5.0</a>.</li>
<li>With Oracle OpenWorld coming up, Oracle decided to get some of its announcing out of the way early. In particular, it announced the <a href="http://www.dbms2.com/2011/09/21/oracle-database-appliance-soundbites/">Oracle Database Appliance</a>, which is small-business-friendly hardware for running the Oracle DBMS. However, the Oracle Database Appliance doesn&#8217;t seem to do much about the complexity of running the Oracle DBMS software.</li>
<li>In <a href="http://www.dbms2.com/2011/09/23/hadoop-appliances/">a catch-all Hadoop post</a>, I noted that:
<ul>
<li>Oracle has now clearly said it has a Hadoop appliance coming, no doubt next week at OpenWorld.</li>
<li>I still can&#8217;t see why Hadoop appliances would succeed, but a lot of smart folks seem to disagree with me.</li>
<li>Greenplum announced what looks like a nice but unimportant little product upgrade.</li>
<li>It&#8217;s a really good thing that previously reported plans to revamp Hadoop are underway.</li>
</ul>
</li>
<li>DataStax announced that <a href="http://www.dbms2.com/2011/09/22/datastax-pivots-back-to-its-original-strategy/">it really is a Cassandra company after all</a>. Pay no attention to previous marketing that seemed to put DataStax in the same Hadoop-alternative category as, say, MapR.</li>
<li><a href="../2011/09/25/ingres-actian/">Ingres has changed its name to Actian</a>. The announcement seems like a confession that Ingres and VectorWise are going nowhere.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/26/highlights-of-a-busy-news-week/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Some notes on Hadoop (mainly) and appliances</title>
		<link>http://www.dbms2.com/2011/09/23/hadoop-appliances/</link>
		<comments>http://www.dbms2.com/2011/09/23/hadoop-appliances/#comments</comments>
		<pubDate>Fri, 23 Sep 2011 19:59:42 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[EMC]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapR]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5341</guid>
		<description><![CDATA[1. EMC Greenplum has evolved its appliance product line. As I read that, the latest announcement boils down to saying that you can neatly network together various Greenplum appliances in quarter-rack increments. If you take a quarter rack each of four different things, then Greenplum says &#8220;Hooray! Our appliance is all-in-one!&#8221; Big whoop. 2. That [...]]]></description>
			<content:encoded><![CDATA[<p>1. <a href="http://www.greenplum.com/products/greenplum-dca">EMC Greenplum has evolved its appliance product line</a>. As I read that, the latest announcement boils down to saying that you can neatly network together various Greenplum appliances in quarter-rack increments. If you take a quarter rack each of four different things, then Greenplum says &#8220;Hooray! Our appliance is all-in-one!&#8221; Big whoop.</p>
<p>2. That said, the Hadoop part of EMC &#8216;s story is based on MapR, which so far as I can tell is actually a pretty good Hadoop implementation. More precisely, MapR makes strong claims about performance and so on, and Apache Hadoop folks don&#8217;t reply &#8220;MapR is full of &amp;#$!&#8221; Rather, they say &#8220;We&#8217;re going to close the gap with MapR a lot faster than the MapR folks like to think &#8212; and by the way, guys, thanks for the butt-kick.&#8221; A lot more precision about MapR may be found in this <a href="http://www.slideshare.net/mcsrivas/design-scale-and-performance-of-maprs-distribution-for-hadoop">M. C. Srivas SlideShare</a>.</p>
<p>3. On its latest earnings call, Oracle clearly <a href="http://seekingalpha.com/article/294885-oracle-s-ceo-discusses-q1-2012-results-earnings-call-transcript?part=qanda">said it would introduce a Hadoop appliance</a>, versus just <a href="../../../../../2011/06/24/forthcoming-oracle-appliances/">hinting at a Hadoop appliance</a> the prior quarter. The money quote was:  <span id="more-5341"></span></p>
<blockquote><p>Finally, big data or the searching of large amounts of data using Hadoop. After Hadoop finishes filtering the data, the place you want to put that data is an Oracle Database, and that&#8217;s what a lot of our customers are doing. And we are exploiting the trend, the big data technology and the big data trend, if you prefer, by building a Hadoop appliance that attaches to the Oracle Exadata database or any Oracle Database for that matter. But you don&#8217;t have to buy our Hadoop appliance if you can use whatever servers you want running Hadoop, and we provide the interface between Hadoop and the Oracle Database.</p></blockquote>
<p>In other words, Oracle is saying &#8220;We&#8217;d like to sell you a Hadoop appliance, but you can run Hadoop in some other way and we&#8217;ll coexist with it just fine.&#8221; That makes sense; refusing to coexist with Hadoop is not exactly a realistic option.</p>
<p>4. Back in June, I expressed <a href="../../../../../2011/06/02/why-you-would-want-an-appliance-and-when-you-wouldnt/">great skepticism about the idea of a Hadoop appliance</a>. There was at least partial pushback in the comment thread from both Amr Awadallah and Eric Baldeschwieler. Oops.</p>
<p>Their reasoning seems to be centered around matters of installation, administration, and general packaging.</p>
<p>5. A month ago I noted aggressive near-term plans for <a href="../../../../../2011/08/21/hadoop-evolution/">Apache Hadoop evolution</a>. As noted above, one reason this is needed is competition from folks like MapR. Also, I note that:</p>
<ul>
<li>Three years ago, Oliver Ratzesberger&#8217;s group at eBay complained that <a href="../../../../../2008/10/15/ebay-doesnt-love-mapreduce/">CPU utilization running Hadoop was at 18%</a>.</li>
<li><a href="../../../../../2011/08/21/hadoop-evolution/#comment-241679">Now Oliver uses a figure of 10-15%.</a>, and attributes an even lower figure to &#8212; I&#8217;m guessing here &#8212; Yahoo. (Another possibility might be Facebook.)</li>
<li>In between eBay became one of the biggest and most prominent users of Hadoop.</li>
</ul>
<p>The moral of eBay&#8217;s Hadoop adventures, as I see it, is neither &#8220;Hadoop sucks!&#8221; nor &#8220;Hadoop doesn&#8217;t suck!&#8221;; rather, it&#8217;s that there&#8217;s a lot of scope for Hadoop to operate differently in the future than it does today.</p>
<p><em>Similarly, whatever throughput Yahoo does or doesn&#8217;t get, it clearly has adopted Hadoop at the expense of the <a href="../../../../../2008/05/29/yahoo-scales-web-analytics-database-petabyte/">columnar-in-Postgres</a> system it previously was so proud of.</em></p>
<p>Also, there has been a claim going around that &#8212; notwithstanding NameNode&#8217;s status as a single point of Hadoop failure &#8212;  no Hadoop installation has ever lost data due to a NameNode failure. The folks at MapR beg to differ, and sent over <a href="https://issues.apache.org/jira/browse/HDFS-1539">some</a> <a href="http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201107.mbox/%3CCAFUA3X2R_wH9GGGseUVSXVNVZQ+dBjZKDn0_pmDO8U31C05tMw@mail.gmail.com%3E">links</a> that sure seem to say the opposite.</p>
<p>6. Since we&#8217;ve just established that Hadoop will change, rapidly and pretty fundamentally, what exactly is the benefit of an appliance that is &#8220;balanced&#8221; for Hadoop usage today?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/23/hadoop-appliances/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Hybrid-columnar soundbites</title>
		<link>http://www.dbms2.com/2011/09/22/hybrid-columnar-soundbites/</link>
		<comments>http://www.dbms2.com/2011/09/22/hybrid-columnar-soundbites/#comments</comments>
		<pubDate>Thu, 22 Sep 2011 18:06:30 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5326</guid>
		<description><![CDATA[Busy couple of days talking with reporters. A few notes on hybrid-columnar analytic DBMS, all backed up by yesterday&#8217;s post on Teradata columnar: Oracle does not actually offer columnar I/O; the other three systems do. But see the &#8220;I won&#8217;t be surprised&#8221; part in yesterday&#8217;s Teradata post. Aster does not offer columnar compression; the other [...]]]></description>
			<content:encoded><![CDATA[<p>Busy couple of days talking with reporters. A few notes on hybrid-columnar analytic DBMS, all backed up by <a href="http://www.dbms2.com/2011/09/22/teradata-columnar-compression/">yesterday&#8217;s post on Teradata columnar</a>:</p>
<ul>
<li>Oracle does not actually offer columnar I/O; the other three systems do. But see the &#8220;I won&#8217;t be surprised&#8221; part in yesterday&#8217;s Teradata post.</li>
<li>Aster does not offer columnar compression; the other three do.</li>
<li>EMC  Greenplum and Teradata offer different kinds of ways to mix column and  row storage in the same table; each has its advantages.</li>
<li>Teradata  generally has a more mature and capable offering than EMC Greenplum, for  most purposes, whichever way you choose to organize your tables.</li>
</ul>
<p><em>Edit: The <a href="http://online.wsj.com/article/BT-CO-20110921-715547.html">Wall Street Journal</a> got this wrong, writing that Teradata was the first-ever hybrid columnar system. Specifically, they wrote</em></p>
<p><em> </em></p>
<blockquote><p><em>While columnar technology has been around for years, Teradata says its  product is unique because it allows users to include both columns and  rows in the same database.</em></p></blockquote>
<p><em> </em></p>
<p><em>Googling on &#8220;Teradata To Unveil New Analytics Product To Speed Business Adoption&#8221; might get you around the paywall to see the offending piece.<br />
</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/22/hybrid-columnar-soundbites/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Data management at Zynga and LinkedIn</title>
		<link>http://www.dbms2.com/2011/09/05/zynga-linkedin-data-warehous/</link>
		<comments>http://www.dbms2.com/2011/09/05/zynga-linkedin-data-warehous/#comments</comments>
		<pubDate>Mon, 05 Sep 2011 08:49:04 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Zynga]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5159</guid>
		<description><![CDATA[Mike Driscoll and his Metamarkets colleagues organized a bit of a bash Thursday night. Among the many folks I chatted with were Ken Rudin of Zynga, Sam Shah of LinkedIn, and D. J. Patil, late of LinkedIn. I now know more about analytic data management at Zynga and LinkedIn, plus some bonus stuff on LinkedIn&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>Mike Driscoll and his <a href="http://www.metamarketsgroup.com/">Metamarkets</a> colleagues organized a bit of a <a href="http://yfrog.com/h8msmkqj">bash</a> Thursday night. Among the many folks I chatted with were Ken Rudin of Zynga, Sam Shah of LinkedIn, and D. J. Patil, late of LinkedIn. I now know more about analytic data management at Zynga and LinkedIn, plus some bonus stuff on LinkedIn&#8217;s People You May Know application. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>It&#8217;s blindingly obvious that Zynga is one of <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">Vertica&#8217;s petabyte-scale customers</a>, given that Zynga sends 5 TB/day of data into Vertica, and keeps that data for about a year. (Zynga may retain even more data going forward; in particular, Zynga regrets ever having thrown out the first month of data for any game it&#8217;s tried to launch.) This is game actions, for the most part, rather than log files; true logs generally go into Splunk.</p>
<p><em>I don&#8217;t know whether the missing data is completely thrown away, or just stashed on inaccessible tapes somewhere.</em></p>
<p>I found two aspects of the Zynga story particularly interesting. First, those 5 TB/day are going straight into Vertica (from, I presume, <a href="http://www.dbms2.com/2010/08/18/nosql-hvsp-adoption/">memcached/Membase/Couchbase</a>), as Zynga decided that sending the data to some kind of log first was more trouble than it&#8217;s worth. Second, there&#8217;s Zynga&#8217;s approach to analytic database design. Highlights of that include: <span id="more-5159"></span></p>
<ul>
<li>Data is divided into two parts. One part has a  pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like <a href="../../../../../2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay</a>&#8216;s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) About half the data is in each part, but I don&#8217;t think that&#8217;s by deliberate choice.</li>
<li>Zynga adds data into the real schema when it&#8217;s clear it will be needed for a while. This isn&#8217;t a matter of query volumes, for the most part; rather, it&#8217;s when Zynga&#8217;s tests (e.g. of new games?) have determined that the data will keep being collected and used for a while.</li>
<li>Zynga only adds columns to its analytic  database; it never goes through the more complex process of deleting them.</li>
</ul>
<p>Just as Zynga is one of Vertica&#8217;s flagship accounts, LinkedIn is one of Aster Data&#8217;s. Specifically, before leaving LinkedIn for Aster, Jonathan Goldman built LinkedIn&#8217;s People You May Know feature in Aster nCluster. This was long ago, and I&#8217;m not sure how sophisticated his use of <a href="../../../../../2009/03/07/three-greenplum-customers-applications-of-mapreduce/">SQL and MapReduce</a> would be in today&#8217;s terms; for example, I was told he didn&#8217;t use &#8220;nPath or anything like that.&#8221; <em>(Edit: See the comments below for clarifications from Jonathan.) </em>Anyhow, LinkedIn has replaced Aster for PYMK with Hadoop, and in my opinion is getting much better results.</p>
<p>That, from an Aster standpoint, is the bad news. The good news is that LinkedIn is happily using Aster nCluster for several other applications; LinkedIn folks doesn&#8217;t seem to regret throwing out* Greenplum for Aster; and they also seem to have a very high opinion of Jonathan and his work while he was there.</p>
<p><em>*And <a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">this time</a> that is indeed the phrase that was used. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </em></p>
<p>One thing that astonished me is that LinkedIn PYMK is based only on data innate to LinkedIn (as opposed to imported email addresses, the results of web crawls, and so on). Given that, I am at a loss to explain how it suggested a couple of old friends, to whom I have no discernable chain of connection. Yes, we were at Harvard at the same time, but if that&#8217;s all it was, there would be a huge number of false positives I&#8217;m not actually seeing.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/05/zynga-linkedin-data-warehous/feed/</wfw:commentRss>
		<slash:comments>24</slash:comments>
		</item>
		<item>
		<title>Hadoop futures and enhancements</title>
		<link>http://www.dbms2.com/2011/07/10/hadoop-futures-and-enhancements/</link>
		<comments>http://www.dbms2.com/2011/07/10/hadoop-futures-and-enhancements/#comments</comments>
		<pubDate>Mon, 11 Jul 2011 03:14:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadapt]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapR]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Zettaset]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4944</guid>
		<description><![CDATA[Hadoop is immature technology. As such, it naturally offers much room for improvement in both industrial-strengthness and performance. And since Hadoop is booming, multiple efforts are underway to fill those gaps. For example: Cloudera&#8217;s proprietary code is focused on management, set-up, etc. The &#8220;Phase 1&#8243; plans Hortonworks shared with me for Apache Hadoop are focused [...]]]></description>
			<content:encoded><![CDATA[<p>Hadoop is immature technology. As such, it naturally offers much room for improvement in both <strong>industrial-strengthness</strong> and <strong>performance.</strong> And since Hadoop is booming, multiple efforts are underway to fill those gaps. For example:</p>
<ul>
<li>Cloudera&#8217;s proprietary code is focused on management, set-up, etc.</li>
<li>The &#8220;Phase 1&#8243; plans Hortonworks shared with me for Apache Hadoop are focused on industrial-strengthness, as are significant parts of &#8220;Phase 2&#8243;.*</li>
<li>MapR tells a performance story versus generic Apache Hadoop HDFS and MapReduce. (One aspect of same is just C++ vs. Java.)</li>
<li>So does <a href="../../../../../2011/07/06/hadapt-update/">Hadapt</a>, but mainly vs. Hive.</li>
<li>Cloudera also tells me there&#8217;s a potential 4-5X performance improvement in Hive coming down the pike from what amounts to an optimizer rewrite.</li>
</ul>
<p>(Zettaset belongs in the discussion too, but made an unfortunate choice of embargo date.)</p>
<p><span id="more-4944"></span><em>*Hortonworks, <a href="http://www.dbms2.com/2011/07/10/cloudera-and-hortonworks/">a new Hadoop company spun out of Yahoo</a>,</em><em> graciously permitted me to post a <a href="http://www.monash.com/uploads/Hortonworks-Apache-Hadoop-July-2011.pptx">slide deck</a> outlining an Apache Hadoop roadmap. Phase 1 refers to stuff that is underway more or less now. Phase 2 is scheduled for alpha in October, 2011, with production availability not too late in 2012.</em></p>
<p>You&#8217;ve probably heard some <strong>single point of failure</strong> fuss. Hadoop NameNodes can crash, which wouldn&#8217;t cause data loss, but would shut down the cluster for a little while. It&#8217;s hard to come up with real-life stories in which this has been a problem; still, it&#8217;s something that should be fixed, and everybody (including the Apache Hadoop folks, as part of Phase 2) has a favored solution. A more serious problem is that Hadoop is currently bad for <strong>small updates,</strong> because:</p>
<ul>
<li>Hadoop&#8217;s fundamental paradigm assumes batch processing.</li>
<li>Both major workarounds to allow small updates are broken:
<ul>
<li>HBase is seriously buggy, to the point that it sometimes loses data.</li>
<li>Storing each update in a separate file runs afoul of a practical limit of 70-100 million files.</li>
</ul>
</li>
</ul>
<p><strong>File-count limits</strong> also get blamed for a second problem, in that there may not be enough intermediate files allowed for your Reduce steps, necessitating awkward and perhaps poorly-performing MapReduce workarounds. Anyhow, the Phase 2 Apache Hadoop roadmap features a serious <strong>HBase rewrite.</strong> I&#8217;m less clear as to where things stand with respect to file-count limits.</p>
<p><em>Edits: As per the comments below, I should perhaps have referred to HBase&#8217;s HDFS underpinnings rather than HBase itself. Anyhow, some details are in the slides. Please also see my follow-up post on <a href="http://www.dbms2.com/2011/07/18/hbase-is-not-broken/">how well HBase is indeed doing</a>.<br />
</em></p>
<p>The other big area for Hadoop improvement is <strong>modularity, pluggability, and coexistence</strong>, on both the <strong>storage</strong> and <strong>application execution</strong> tiers. For example:</p>
<ul>
<li>Greenplum/MapR and Hadapt both think you should have HDFS file management and relational DBMS coexisting on the same storage nodes. (I agree.)</li>
<li>Part of what Hortonworks calls &#8220;Phase 2&#8243; sets out to ensure that Hadoop can properly manage <a href="../2010/08/16/vertica-flash-temp-space/">temp space</a> and so on next to HDFS.</li>
<li>Perhaps HBase won&#8217;t always assume HDFS.</li>
<li>DataStax thinks you should <a href="http://www.dbms2.com/2011/03/23/datastax-cassandrafs-hadoop-brisk/">blend HDFS and Cassandra</a>.</li>
</ul>
<p>Meanwhile, Pig and Hive need to come closer together. Often you want to stream data into Hadoop. The argument that <a href="http://www.dbms2.com/2011/04/21/sas-hpa-does-make-sense-after-all/">MPI trumps MapReduce</a> does, in certain use cases, make sense. Apache Hadoop &#8220;Phase 2&#8243; and beyond are charted to accommodate some of those possibilities too.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/10/hadoop-futures-and-enhancements/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 2)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:18:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Rainstor]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4867</guid>
		<description><![CDATA[In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/">Part 1</a> of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list match that is even less clear.  <span id="more-4867"></span></p>
<p><strong><em>Bit bucket</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Logs, other technical/external</li>
<li><em>Likely use styles:</em> Staging/ETL, investigative</li>
<li><em>Canonical example: </em>Log files in a Hadoop cluster<em> </em></li>
<li><em>Stresses:</em> TCO, scale-out, transform/big-query performance, ETL functionality</li>
</ul>
<p>With the explosion of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> has come the need for a place to put it all, sometimes called the <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a>. This is like the investigative data mart for big databases, but more <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured</a>. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.</p>
<p>The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.</p>
<p><strong><em>Archival data store</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Operational, CDR (call detail record), security log</li>
<li><em>Likely use styles:</em> Archival, reporting (for compliance), possibly also investigative</li>
<li><em>Examples:</em> Any long-term detailed historical store</li>
<li><em>Stresses: </em>TCO, compression, scale-out, performance (if multi-use)<em> </em></li>
</ul>
<p><em> </em></p>
<p>Analytic DBMS vendors have been insulting each other with the claim &#8220;that&#8217;s just an archival data store,&#8221; dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only <a href="../../../../../2010/06/11/rainstor-update/">Rainstor</a> truly embraces the archival positioning, and I&#8217;ve become pretty dubious about their technical claims and their company alike.</p>
<p>Still, there&#8217;s a legitimate need for data stores &#8212; especially relational analytic DBMS that:</p>
<ul>
<li>Store data cheaply, with high rates of compression.</li>
<li>Have decent performance if you do want to query the data.</li>
<li>May have archiving/compliance-specific features as well.</li>
</ul>
<p>Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.</p>
<p><strong><em>Outsourced data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Traditional BI, investigative analytics, staging/ETL</li>
<li><em>Examples:</em> Advertising tracking, SaaS CRM</li>
<li><em>Stresses:</em> Performance, TCO, reliability, concurrency</li>
</ul>
<p>Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I&#8217;ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that&#8217;s an analytic data management challenge. The possibilities expand from there.</p>
<p>Data outsourcers are in the IT business, and so their IT development is &#8212; hopefully! &#8212; more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.*  <a href="../../../../../2011/06/26/what-to-think-about-before-you-make-a-technology-decision/">Multitenancy</a> is commonly an issue, as is running in the cloud.<em> </em></p>
<p><em>*Even so, there&#8217;s often That Guy who doesn&#8217;t want to migrate away from Oracle, no matter what.<strong> </strong></em></p>
<p>Vertica gets the nod in a number of these cases; it&#8217;s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you&#8217;re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.</p>
<p><strong><em>Operational analytic(s) server</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> Customer-centric, log, financial trade</li>
<li><em>Likely use styles:</em> Advanced operational analytics</li>
<li><em>Examples:</em>
<ul>
<li>Lower latency: Web or call-center personalization, anti-fraud</li>
<li>Higher latency: Customer profiling, Basel 3 risk analysis</li>
</ul>
</li>
<li><em>Stresses:</em> Performance, reliability, analytic functionality, perhaps concurrency</li>
</ul>
<p>Even with eight different choices, I need a &#8220;catch-all&#8221; category; this is it.</p>
<p>Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrating short-request and analytic processing</a>. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you&#8217;re set. Otherwise, you may want to pipe <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">derived data</a> into a more &#8220;industrial-strength&#8221; DBMS, ideally the one that runs your operational apps anyway</p>
<p>Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you&#8217;re starting out with the data in a convenient bit bucket.</p>
<p>Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.</p>
<p><em>So did I get them all? Or are there yet other analytic data management use cases that I don&#8217;t fit into my eight categories?</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 1)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:17:44 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4868</guid>
		<description><![CDATA[Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help. Let&#8217;s try eight categories instead. While no categorization [...]]]></description>
			<content:encoded><![CDATA[<p>Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help.</p>
<p>Let&#8217;s try eight categories instead. While <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">no categorization is ever perfect</a>, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need &#8212; and in most cases you&#8217;ll need several &#8212; is a great early step in your analytic technology planning.  <span id="more-4868"></span></p>
<p><strong><em>Enterprise data warehouse</em></strong> (Full or partial)</p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, but especially operational</li>
<li><em>Likely use styles:</em> All</li>
<li><em>Canonical example:</em> Central EDW for a big enterprise</li>
<li><em>Stresses:</em> Concurrency, reliability, workload management</li>
</ul>
<p>The enterprise data warehouse (EDW) ideal says that you copy all your data into one place, and drive all decision-making from there. <a href="../../../../../2011/06/21/its-official-the-grand-central-edw-will-never-happen/">Full EDWs are pipedreams</a>. Still, a partial EDW makes sense for most large enterprises, and many indeed already have one. The first product lines to consider for classical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL Server, especially if you&#8217;re going to stress concurrency and/or operational use cases.</p>
<p><strong><em>Traditional data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Business intelligence, budgeting/consolidation, investigative</li>
<li><em>Examples:</em> Reporting servers, planning/consolidation servers, anything MOLAP, etc.</li>
<li><em>Stresses:</em> Performance, concurrency, TCO</li>
</ul>
<p>Whether or not you have something like an enterprise data warehouse, it&#8217;s common to have lighter-weight data marts as well. A traditional data mart might drive reports and dashboards. Or it might be specialized for budgeting, planning, and/or consolidation.  Some <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> may be in the mix as well.</p>
<p>Any DBMS that can support an EDW can also support a data mart, but it may not be the most cost-effective way to do so. Columnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them &#8212; e.g. Sybase IQ and <a href="../../../../../2011/06/20/vertica-release-5/">Vertica</a> &#8212; have excellent track records in concurrent usage as well. <a href="../../../../../2011/05/29/when-to-use-relational-database-management-system/">Ted Codd</a> pushed what amounts to MOLAP (Multidimensional OnLine Analytic Processing) systems for these use cases. But relational DBMS commonly do a better job, which is one reason most major MOLAP products have wound up at RDBMS companies.</p>
<p><strong><em>Investigative data mart &#8212; agile</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> A few analysts getting a few TB to examine</li>
<li><em>Stresses:</em> Ease of setup/load, ease of admin, price/performance</li>
</ul>
<p>Besides the traditional data mart, there are at least two other kinds. Both are focused on investigative analytics, but they&#8217;re differentiated by database size.</p>
<p>If you have just a few analysts,* looking at no more than a few terabytes of data (perhaps even just some gigabytes) &#8212; and if that data is &#8220;single-subject&#8221; and fairly homogenous &#8212; your watchwords should be &#8220;cheap&#8221;, &#8220;easy&#8221;, and &#8220;fast&#8221;. You don&#8217;t need to invest in much hardware, in expensive software, in much administrative effort (the analysts can be their own DBAs),  nor should you endure much set-up time. Just grab a product, grab some data, and start running queries (or extracts into the statistical tool of your choice).</p>
<p><em>*If you have dozens or even hundreds of analysts hitting the same database, you&#8217;re probably back to the more concurrency-oriented scenarios outlined above.</em></p>
<p>Infobright is often cost-effective among columnar analytic DBMS. Other vendors might cut you a price break as well. If you have multiple terabytes of data, don&#8217;t rule out Netezza&#8217;s lowest-end products (even if they&#8217;d really rather sell you something bigger). Or, if you&#8217;re in the sub-terabyte range, maybe you can get by with an in-memory BI tool such as QlikView, and not do anything special on the DBMS side at all.</p>
<p><strong><em>Investigative data mart &#8212; big</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric, logs, financial trade, scientific</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> Single-subject 20 TB &#8211; 20 PB relational database<em></em></li>
<li><em>Stresses:</em> Performance, scale-out, analytic functionality</li>
</ul>
<p>But if you&#8217;re looking at tens of terabytes of relational data, or even more, you really do have a &#8220;big data&#8221; problem. Performance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum. Performance POCs (Proofs Of Concept) are a big part of the buying process. Vendor price negotiations are crucial too.</p>
<p><em>Actually, in the low tens of terabytes you might be able to get away with a shared-disk system that has excellent compression &#8212; e.g., columnar products like Sybase IQ, Infobright, or SAND, rather than just Vertica and ParAccel.</em></p>
<p>Assuming you have affordable, scalable query performance, the competitive differentiator can switch to additional analytic functionality. Aster, Netezza, ParAccel, Vertica, and Greenplum either offer full <a href="../../../../../2011/02/24/analytic-platforms/">analytic platforms</a>, or seem to be on the path to doing so. Teradata, which now owns Aster Data, offers substantial built-in analytic capability in its traditional products as well, and the same goes for Sybase IQ.</p>
<p><em>Continued in <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/">Part 2</a>,</em><em> where we cover some of the more difficult use cases.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

