<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Buying processes</title>
	<atom:link href="http://www.dbms2.com/category/buying-processes/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 01:51:16 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Exasol update</title>
		<link>http://www.dbms2.com/2011/11/12/exasol-update/</link>
		<comments>http://www.dbms2.com/2011/11/12/exasol-update/#comments</comments>
		<pubDate>Sun, 13 Nov 2011 02:37:13 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Exasol]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5661</guid>
		<description><![CDATA[I last wrote about Exasol in 2008. After talking with the team Friday, I&#8217;m fixing that now. The general theme was as you&#8217;d expect: Since last we talked, Exasol has added some new management, put some effort into sales and marketing, got some customers, kept enhancing the product and so on. Top-level points included: Exasol&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p><a href="../../../../../2008/08/16/exasol-technical-briefing/">I last wrote about Exasol in 2008</a>. After talking with the team Friday, I&#8217;m fixing that now. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  The general theme was as you&#8217;d expect: Since last we talked, Exasol has added some new management, put some effort into sales and marketing, got some customers, kept enhancing the product and so on.</p>
<p>Top-level points included:</p>
<ul>
<li>Exasol&#8217;s technical philosophy is substantially the same as before, albeit not with as extreme a focus on fitting everything in RAM.</li>
<li>Exasol believes its flagship DBMS EXASolution has great performance on a load-and-go basis.</li>
<li>Exasol has 25 EXASolution customers, all in Germany.*</li>
<li>5 of those are &#8220;cloud&#8221; customers, at hosting providers engaged by Exasol.</li>
<li>EXASolution database sizes now range from the low 100s of gigabytes up to 30 terabytes.</li>
<li>Pretty much the whole company is in Nuremberg.</li>
</ul>
<p><span id="more-5661"></span><em>*That excludes some money from Hitachi. Exasol&#8217;s Hitachi partnership is still in limbo, an apparent casualty of the world economic crisis.</em></p>
<p>On the technical side:</p>
<ul>
<li>As noted in my 2008 post, EXASolution is a columnar, no-head-node MPP (Massively Parallel Processing) DBMS.</li>
<li>The main way EXASolution compresses data is via dictionary/tokenization. 5:1 is a typical compression ratio before mirroring and so on, out of a 2-10:1 range.</li>
<li>EXASolution writes data to blocks in memory that are smaller than what is otherwise its preferred size (1/2 to 5 megabytes). These are sent to disk, where merge eventually happens. Exasol insists that write performance has always been fully satisfactory to customers to date.</li>
<li>EXASolution doesn&#8217;t have much in the way of performance tuning knobs. Exasol says they aren&#8217;t needed, and says that one really can start an EXASolution POC (Proof of Concept) in a day or so.</li>
<li>EXASolution doesn&#8217;t have much in the way of workload management capabilities, except what&#8217;s automagic (e.g., short query bias). However, it does collect statistics you can query via your favorite BI tool.</li>
<li>EXASolution doesn&#8217;t have much in the way of <a href="../../../../../2011/02/24/analytic-platforms/">analytic platform</a> capabilities, although there is some Lua-based scripting. However, there&#8217;s something NDA in the analytic platform area Coming Soon.*</li>
</ul>
<p>In general, the whole thing sounds somewhat like ParAccel, at least at a high level.</p>
<p><em>*Exasol is not and never has been our client, but we can keep secrets for them even so.</em></p>
<p>Naturally, Exasol believes EXASolution has fine concurrency, with at least one customer routinely running 2000 concurrent users, 200 concurrent sessions (via connection pooling), and 5-10 concurrent queries. Another customer has 3500 Cognos users. 1-200 concurrent queries appears to be the record peak load. Anyhow, Exasol says that plans to offer real workload management could be accelerated if a need were discovered.</p>
<p>Exasol says it almost never loses POCs, but admits that it competes fairly rarely against Vertica and ParAccel, no doubt for reasons of geography. Exasol boasts one visible Sybase IQ replacement (Sony Music).</p>
<p>While Exasol&#8217;s sales to date have been in Germany, there are plans to change that soon. At least one sales cycle is well underway in Eastern Europe. Offices in other Germanic countries are planned. Existing customers are planning to deploy additional copies outside Germany. Discussions are underway regarding other geographies, e.g. English-speaking ones.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/12/exasol-update/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 2)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:18:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Rainstor]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4867</guid>
		<description><![CDATA[In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/">Part 1</a> of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list match that is even less clear.  <span id="more-4867"></span></p>
<p><strong><em>Bit bucket</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Logs, other technical/external</li>
<li><em>Likely use styles:</em> Staging/ETL, investigative</li>
<li><em>Canonical example: </em>Log files in a Hadoop cluster<em> </em></li>
<li><em>Stresses:</em> TCO, scale-out, transform/big-query performance, ETL functionality</li>
</ul>
<p>With the explosion of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> has come the need for a place to put it all, sometimes called the <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a>. This is like the investigative data mart for big databases, but more <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured</a>. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.</p>
<p>The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.</p>
<p><strong><em>Archival data store</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Operational, CDR (call detail record), security log</li>
<li><em>Likely use styles:</em> Archival, reporting (for compliance), possibly also investigative</li>
<li><em>Examples:</em> Any long-term detailed historical store</li>
<li><em>Stresses: </em>TCO, compression, scale-out, performance (if multi-use)<em> </em></li>
</ul>
<p><em> </em></p>
<p>Analytic DBMS vendors have been insulting each other with the claim &#8220;that&#8217;s just an archival data store,&#8221; dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only <a href="../../../../../2010/06/11/rainstor-update/">Rainstor</a> truly embraces the archival positioning, and I&#8217;ve become pretty dubious about their technical claims and their company alike.</p>
<p>Still, there&#8217;s a legitimate need for data stores &#8212; especially relational analytic DBMS that:</p>
<ul>
<li>Store data cheaply, with high rates of compression.</li>
<li>Have decent performance if you do want to query the data.</li>
<li>May have archiving/compliance-specific features as well.</li>
</ul>
<p>Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.</p>
<p><strong><em>Outsourced data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Traditional BI, investigative analytics, staging/ETL</li>
<li><em>Examples:</em> Advertising tracking, SaaS CRM</li>
<li><em>Stresses:</em> Performance, TCO, reliability, concurrency</li>
</ul>
<p>Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I&#8217;ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that&#8217;s an analytic data management challenge. The possibilities expand from there.</p>
<p>Data outsourcers are in the IT business, and so their IT development is &#8212; hopefully! &#8212; more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.*  <a href="../../../../../2011/06/26/what-to-think-about-before-you-make-a-technology-decision/">Multitenancy</a> is commonly an issue, as is running in the cloud.<em> </em></p>
<p><em>*Even so, there&#8217;s often That Guy who doesn&#8217;t want to migrate away from Oracle, no matter what.<strong> </strong></em></p>
<p>Vertica gets the nod in a number of these cases; it&#8217;s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you&#8217;re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.</p>
<p><strong><em>Operational analytic(s) server</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> Customer-centric, log, financial trade</li>
<li><em>Likely use styles:</em> Advanced operational analytics</li>
<li><em>Examples:</em>
<ul>
<li>Lower latency: Web or call-center personalization, anti-fraud</li>
<li>Higher latency: Customer profiling, Basel 3 risk analysis</li>
</ul>
</li>
<li><em>Stresses:</em> Performance, reliability, analytic functionality, perhaps concurrency</li>
</ul>
<p>Even with eight different choices, I need a &#8220;catch-all&#8221; category; this is it.</p>
<p>Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrating short-request and analytic processing</a>. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you&#8217;re set. Otherwise, you may want to pipe <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">derived data</a> into a more &#8220;industrial-strength&#8221; DBMS, ideally the one that runs your operational apps anyway</p>
<p>Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you&#8217;re starting out with the data in a convenient bit bucket.</p>
<p>Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.</p>
<p><em>So did I get them all? Or are there yet other analytic data management use cases that I don&#8217;t fit into my eight categories?</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 1)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:17:44 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4868</guid>
		<description><![CDATA[Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help. Let&#8217;s try eight categories instead. While no categorization [...]]]></description>
			<content:encoded><![CDATA[<p>Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help.</p>
<p>Let&#8217;s try eight categories instead. While <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">no categorization is ever perfect</a>, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need &#8212; and in most cases you&#8217;ll need several &#8212; is a great early step in your analytic technology planning.  <span id="more-4868"></span></p>
<p><strong><em>Enterprise data warehouse</em></strong> (Full or partial)</p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, but especially operational</li>
<li><em>Likely use styles:</em> All</li>
<li><em>Canonical example:</em> Central EDW for a big enterprise</li>
<li><em>Stresses:</em> Concurrency, reliability, workload management</li>
</ul>
<p>The enterprise data warehouse (EDW) ideal says that you copy all your data into one place, and drive all decision-making from there. <a href="../../../../../2011/06/21/its-official-the-grand-central-edw-will-never-happen/">Full EDWs are pipedreams</a>. Still, a partial EDW makes sense for most large enterprises, and many indeed already have one. The first product lines to consider for classical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL Server, especially if you&#8217;re going to stress concurrency and/or operational use cases.</p>
<p><strong><em>Traditional data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Business intelligence, budgeting/consolidation, investigative</li>
<li><em>Examples:</em> Reporting servers, planning/consolidation servers, anything MOLAP, etc.</li>
<li><em>Stresses:</em> Performance, concurrency, TCO</li>
</ul>
<p>Whether or not you have something like an enterprise data warehouse, it&#8217;s common to have lighter-weight data marts as well. A traditional data mart might drive reports and dashboards. Or it might be specialized for budgeting, planning, and/or consolidation.  Some <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> may be in the mix as well.</p>
<p>Any DBMS that can support an EDW can also support a data mart, but it may not be the most cost-effective way to do so. Columnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them &#8212; e.g. Sybase IQ and <a href="../../../../../2011/06/20/vertica-release-5/">Vertica</a> &#8212; have excellent track records in concurrent usage as well. <a href="../../../../../2011/05/29/when-to-use-relational-database-management-system/">Ted Codd</a> pushed what amounts to MOLAP (Multidimensional OnLine Analytic Processing) systems for these use cases. But relational DBMS commonly do a better job, which is one reason most major MOLAP products have wound up at RDBMS companies.</p>
<p><strong><em>Investigative data mart &#8212; agile</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> A few analysts getting a few TB to examine</li>
<li><em>Stresses:</em> Ease of setup/load, ease of admin, price/performance</li>
</ul>
<p>Besides the traditional data mart, there are at least two other kinds. Both are focused on investigative analytics, but they&#8217;re differentiated by database size.</p>
<p>If you have just a few analysts,* looking at no more than a few terabytes of data (perhaps even just some gigabytes) &#8212; and if that data is &#8220;single-subject&#8221; and fairly homogenous &#8212; your watchwords should be &#8220;cheap&#8221;, &#8220;easy&#8221;, and &#8220;fast&#8221;. You don&#8217;t need to invest in much hardware, in expensive software, in much administrative effort (the analysts can be their own DBAs),  nor should you endure much set-up time. Just grab a product, grab some data, and start running queries (or extracts into the statistical tool of your choice).</p>
<p><em>*If you have dozens or even hundreds of analysts hitting the same database, you&#8217;re probably back to the more concurrency-oriented scenarios outlined above.</em></p>
<p>Infobright is often cost-effective among columnar analytic DBMS. Other vendors might cut you a price break as well. If you have multiple terabytes of data, don&#8217;t rule out Netezza&#8217;s lowest-end products (even if they&#8217;d really rather sell you something bigger). Or, if you&#8217;re in the sub-terabyte range, maybe you can get by with an in-memory BI tool such as QlikView, and not do anything special on the DBMS side at all.</p>
<p><strong><em>Investigative data mart &#8212; big</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric, logs, financial trade, scientific</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> Single-subject 20 TB &#8211; 20 PB relational database<em></em></li>
<li><em>Stresses:</em> Performance, scale-out, analytic functionality</li>
</ul>
<p>But if you&#8217;re looking at tens of terabytes of relational data, or even more, you really do have a &#8220;big data&#8221; problem. Performance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum. Performance POCs (Proofs Of Concept) are a big part of the buying process. Vendor price negotiations are crucial too.</p>
<p><em>Actually, in the low tens of terabytes you might be able to get away with a shared-disk system that has excellent compression &#8212; e.g., columnar products like Sybase IQ, Infobright, or SAND, rather than just Vertica and ParAccel.</em></p>
<p>Assuming you have affordable, scalable query performance, the competitive differentiator can switch to additional analytic functionality. Aster, Netezza, ParAccel, Vertica, and Greenplum either offer full <a href="../../../../../2011/02/24/analytic-platforms/">analytic platforms</a>, or seem to be on the path to doing so. Teradata, which now owns Aster Data, offers substantial built-in analytic capability in its traditional products as well, and the same goes for Sybase IQ.</p>
<p><em>Continued in <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/">Part 2</a>,</em><em> where we cover some of the more difficult use cases.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>What to think about BEFORE you make a technology decision</title>
		<link>http://www.dbms2.com/2011/06/26/what-to-think-about-before-you-make-a-technology-decision/</link>
		<comments>http://www.dbms2.com/2011/06/26/what-to-think-about-before-you-make-a-technology-decision/#comments</comments>
		<pubDate>Sun, 26 Jun 2011 18:51:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4835</guid>
		<description><![CDATA[When you are considering technology selection or strategy, there are a lot of factors that can each have bearing on the final decision &#8212; a whole lot. Below is a very partial list. In almost any IT decision, there are a number of environmental constraints that need to be acknowledged. Organizations may have standard vendors, [...]]]></description>
			<content:encoded><![CDATA[<p>When you are considering technology selection or strategy, there are a lot of factors that can each have bearing on the final decision &#8212; a whole lot. Below is a very partial list.</p>
<p>In almost any IT decision, there are a number of <strong>environmental constraints</strong> that need to be acknowledged. Organizations may have <strong>standard vendors</strong>, favored vendors, or simply vendors who give them <a href="../../../../../2011/06/24/observations-on-oracle-pricing/">particularly deep discounts</a>. <strong>Legacy systems</strong> are in place, application and system alike, and may or may not be open to replacement. Enterprises may have on-premise or off-premise preferences; SaaS (Software as a Service) vendors probably have <strong>multitenancy</strong> concerns. Your organization can determine which aspects of your system you&#8217;d ideally like to see be tightly <strong>integrated </strong>with each other, and which you&#8217;d prefer to keep only loosely coupled. You may have biases for or against <strong>open-source software.</strong> You may be pro- or anti-<strong>appliance.</strong> Some applications have a substantial need for elastic scaling. And some kinds of issues cut across multiple areas, such as <strong>budget</strong>, <strong>timeframe, security, </strong>or<strong> trained personnel.</strong></p>
<p>Multitenancy is particularly interesting, because it has numerous implications. <span id="more-4835"></span>If you&#8217;re a SaaS vendor supporting multiple customers, you must keep each customer&#8217;s data inaccessible to other users* &#8212; even if you offer high levels of flexibility or customization. You probably also want to keep data logically partitioned by user, in a way that the DBMS recognizes; you may also want that partition to hunt as a pack for caching purposes, especially if no one customer occupies a large part of your database. Administratively, you need a way to measure customer-specific metrics of the sort that might go into SLAs (Service-Level Agreements).</p>
<p><em>*Of course, there are exceptions. One of my clients is a SaaS vendor facilitating commerce; the whole point of their app is to let two different customers see and update the same records.</em></p>
<p>Getting more specific now, I&#8217;m usually called upon to <a href="http://www.monash.com/adviseusers.html">advise users</a> in two categories &#8212; those that already know they want to upgrade analytic functionality, and those that quickly realize they do once I remind them of it. Even so, many organizations struggle with the question &#8220;What do you want to do analytically?&#8221; It&#8217;s tough to blame them, for the question is distressingly circular; <strong>a big part of analytics is figuring out which kinds of analytics are worth doing.</strong> Also, SaaS vendors often struggle with the same question for a different reason, responding &#8220;Well, we know we&#8217;ve only been giving them basic stuff to date. What else do you think they would like?&#8221;</p>
<p>There&#8217;s no perfect solution to those difficulties, but a good way to start the evaluation is by assessing:</p>
<ul>
<li>The<strong> nature and value of your decisions that analytics could reasonably affect.</strong></li>
<li>Your <strong>realistic scope for automation of analytic decisions.</strong></li>
<li>The <strong>number and training of your &#8220;full-time analysts&#8221;</strong> &#8212; statisticians, SQL jocks who can program, SQL jocks who can&#8217;t really program, full-time users of BI tools, whatever.</li>
<li>The <strong>number and training of your &#8220;part-time analysts&#8221;</strong> &#8212; normal business users who can get something out of a dashboard, and perhaps even drill down into it.</li>
</ul>
<p>That should at least tell you which broad categories of analytics you want to engage in, and roughly how advanced in those areas you should try to be.</p>
<p><em>Basic business intelligence/dashboarding? Surely. Visualization-centric BI? If nothing else, it demos well. Basic predictive modeling? Hmm, are you sure nobody will want that? Advanced predictive modeling? Um, are you sure your users can handle that, or that the results will be worth the investment?</em></p>
<p>When I talk with users, there&#8217;s usually a data management problem in the mix too. In such cases, I quickly ask about <strong>data-related metrics</strong>, starting with database size, ingest volumes (batch, if relevant, but especially continuous), and simultaneous query load /concurrent user count. Similarly important are requirements for various kinds of <a href="http://www.dbms2.com/2009/09/10/analytic-speed-latency/">latency</a>, the big two being <strong>query response time</strong> and <strong>how long it takes for data to first be available for query. </strong>Less numeric questions in a similar vein boil down to &#8220;What kinds of requests will you make against the database, in what volume?&#8221;</p>
<p><em>And this loops back to the analytic-user inventory. Suppose you had a near-real-time dashboard &#8212; would anybody actually look at it minute to minute?</em></p>
<p>Specialized metrics I request when considering analytic DBMS include &#8220;How many columns are there in your widest table?&#8221; and &#8220;How many joins &#8212; or lines of SQL &#8212; are there in your most complex query?&#8221;, both of which are tools for assessing &#8220;Is your use case naturally columnar?&#8221;. Another, more general <strong>&#8220;natural structure of data&#8221;</strong> kind of consideration is what structure the data is in before it gets to the database being discussed; candidates include relational batch, XML stream, log file, and many more.</p>
<p>Also crucial are requirements for <strong><a href="http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/">consistency</a>, availability, </strong>and<strong> data integrity.</strong> Those tell you your needs in <strong>high availability </strong>and<strong> disaster recovery,</strong> and perhaps even how picky you have to be about your brands of hardware, software, or cloud/hosting provider. They also indicate how much you should care about relational or ACID properties, and where you should come down on <a href="http://www.dbms2.com/2010/03/12/some-nosql-links/">CAP Theorem</a> trade-offs.</p>
<p><em>I could go on even longer, but those seem like a pretty good set of initial questions with which to start discussions of data management, data integration, and analytic tools and architectures. What do you think I left out? And what do you think I could make substantially clearer by just adding a few more words? Any comments will be much appreciated.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/26/what-to-think-about-before-you-make-a-technology-decision/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Vertica story (with soundbites!)</title>
		<link>http://www.dbms2.com/2011/06/20/vertica-release-5/</link>
		<comments>http://www.dbms2.com/2011/06/20/vertica-release-5/#comments</comments>
		<pubDate>Mon, 20 Jun 2011 06:14:56 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4777</guid>
		<description><![CDATA[I&#8217;ve blogged separately that: Vertica has a bunch of customers, including seven with 1 or more petabytes of data each. Vertica has progressed down the analytic platform path, with Monday&#8217;s release of Vertica 5.0. And of course you know: Vertica (the product) is columnar, MPP, and fast.* Vertica (the company) was recently acquired by HP.** [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve blogged separately that:</p>
<ul>
<li><a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">Vertica      has a bunch of customers</a>, including <strong>seven with 1 or more petabytes of      data each.</strong></li>
<li><a href="http://www.dbms2.com/2011/06/20/vertica-as-an-analytic-platform/">Vertica      has progressed down the analytic platform path</a>, with Monday&#8217;s release      of Vertica 5.0.</li>
</ul>
<p>And of course you know:</p>
<ul>
<li>Vertica (the product) is columnar, MPP, and fast.*</li>
<li>Vertica (the company) was recently acquired by HP.**</li>
</ul>
<p><span id="more-4777"></span><em>*Similar things seem true of ParAccel, but most of the other serious columnar analytic DBMS aren&#8217;t actually MPP (Massively Parallel Processing) yet. More precisely, they have  shared-everything architectures, especially on the storage level.</em></p>
<p><em>** Vertica says it has a &#8220;staggering&#8221; pipeline now that it&#8217;s been with HP for a few months.  I also gather that the post-merger HP/Vertica appliance product line formally rolled out last week.</em></p>
<p><em> </em></p>
<p>As for product maturity:</p>
<ul>
<li><a href="../../../../../2010/02/22/vertica-4/">Vertica 4.0</a> cleaned up a lot of stuff.</li>
<li>Vertica 5.0 goes further in a variety of areas, notably clustering administration and database tuning/design.</li>
</ul>
<p>But here&#8217;s something I hadn&#8217;t fully realized &#8212; <strong>Vertica claims concurrent usage as a competitive strength</strong>. By this I mean:</p>
<ul>
<li>Vertica says that it      has some customers with 1000s of users, in BI/dashboarding kinds of      applications.</li>
<li>Vertica asserts it can      support 1000 users on a single appliance rack.</li>
<li>Vertica tries to drive      POCs (Proofs Of Concept) towards testing concurrency.</li>
</ul>
<p>This is all consistent with <a href="../../../../../2010/04/16/story-of-an-analytic-dbms-evaluation/">a user example I blogged about last year</a>.</p>
<p>That said, while Vertica introduced respectable workload management features in Vertica 4.0, its main claim to concurrency is simply speed &#8212; if each query ends quickly, you never have to execute all that many of them at once.</p>
<p>Anyhow, there will (or at least should be) articles written about Vertica 5.0, and I may not be that easy to find for comment, what with <a href="../../../../../2011/06/19/investigative-analytics-derived-data/">Enzee Universe</a> and all. So here are a few <strong>Vertica soundbites:</strong></p>
<ul>
<li>Having seven petabyte-level commercial      users is an impressive testament to Vertica&#8217;s scalability. I think only      Teradata could best that number among analytic DBMS, unless you want to      count Hadoop/Hive.</li>
<li>Vertica&#8217;s analytic platform capabilities      are new, and initially not as rich as <a href="../../../../../2010/02/22/aster-data-ncluster-4-5/">Aster      Data&#8217;s</a> or <a href="../../../../../2011/04/17/netezza-twinfin-i-class-overview/">Netezza&#8217;s</a>,      especially in the area of language support. But they&#8217;re a good first step.</li>
<li>Judging by the examples of EMC/Greenplum      and IBM/Netezza, Vertica&#8217;s honeymoon period at HP is likely to last for a      while. <em>(Edit: That said, not all is peachy at <a href="http://www.dbms2.com/2011/04/16/unpacking-the-emc-greenplum-q1-sales-disaster-rumors/">EMC/Greenplum</a>.)</em></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/20/vertica-release-5/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Comments on the Gartner 2010/2011 Data Warehouse Database Management Systems Magic Quadrant</title>
		<link>http://www.dbms2.com/2011/02/05/gartner-magic-quadrant-data-warehouse-database-management-2010/</link>
		<comments>http://www.dbms2.com/2011/02/05/gartner-magic-quadrant-data-warehouse-database-management-2010/#comments</comments>
		<pubDate>Sat, 05 Feb 2011 15:49:39 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[1010data]]></category>
		<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[EMC]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Ingres]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Workload management]]></category>
		<category><![CDATA[illuminate Solutions]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3744</guid>
		<description><![CDATA[Edit: Comments on the February, 2012 Gartner Magic Quadrant for Data Warehouse Database Management Systems &#8212; and on the companies reviewed in it &#8212; are now up. The Gartner 2010 Data Warehouse Database Management Systems Magic Quadrant is out. I shall now comment, just as I did to varying degrees on the 2009, 2008, 2007, [...]]]></description>
			<content:encoded><![CDATA[<p><em>Edit: Comments on the February, 2012 <a href="http://www.dbms2.com/2012/02/08/gartner-magic-quadrant-data-warehouse-2011-2012/">Gartner Magic Quadrant for Data Warehouse Database Management Systems</a> &#8212; and on the companies reviewed in it &#8212; are now up.</em></p>
<p>The <a href="http://www.gartner.com/technology/media-products/reprints/teradata/vol3/article1/article1.html">Gartner 2010 Data Warehouse Database Management Systems Magic Quadrant</a> is out. I shall now comment, just as I did to varying degrees on the <a href="../../../../../2010/02/10/gartner-magic-quadrant-data-warehouse-2009-2010/">2009</a>, <a href="../../../../../2009/01/12/gartners-2008-data-warehouse-database-management-system-magic-quadrant-is-out/">2008</a>, <a href="../../../../../2007/10/19/gartner-2007-magic-quadrant-for-data-warehouse-database-management-systems/">2007</a>, and <a href="../../../../../2006/10/03/vendor-segmentation-for-data-warehouse-dbms/">2006</a> Gartner Data Warehouse Database Management System Magic Quadrants.</p>
<p><em>Note: Links to Gartner Magic Quadrants tend to be unstable. Please alert me if any problems arise; I&#8217;ll edit accordingly.</em></p>
<p>In <a href="../../../../../2009/01/12/gartners-2008-data-warehouse-database-management-system-magic-quadrant-is-out/">my comments on the 2008 Gartner Data Warehouse Database Management Systems Magic Quadrant</a>, I observed that <strong>Gartner&#8217;s &#8220;completeness of vision&#8221; scores were generally pretty reasonable,</strong> but their<strong> &#8220;ability to execute&#8221; rankings were somewhat bizarre;</strong> the same remains true this year. For example, Gartner ranks Ingres higher by that metric than Vertica, Aster Data, ParAccel, or Infobright. Yet each of those companies is growing nicely and delivering products that meet serious cutting-edge analytic DBMS needs, neither of which has been true of Ingres since about 1987.  <span id="more-3744"></span></p>
<p>The general list of &#8220;market forces, end-user expectations and vendors&#8217; resulting solution approaches&#8221; at the top of the 2010 Gartner Data Warehouse Database Management System Magic Quadrant article is a mixed bag. Following Gartner&#8217;s order, I&#8217;ll address those first, and particular companies cited afterwards. Specific items and comments include:</p>
<ul>
<li><strong>&#8220;Increased demand for optimization techniques and performance enhancement.</strong><strong>&#8220;</strong> Gartner seems to be saying that data warehouse DBMS buyers want lists of specific, esoteric performance features. Well, buyers always want their DBMS to run fast, and they&#8217;d like the products to be mature enough to have been through a few rounds of <a href="../../../../../2009/08/21/bottleneck-whack-a-mole/">Bottleneck Whack-A-Mole</a>, but otherwise I&#8217;m not sure I&#8217;d put that at the top of my list.</li>
<li><strong>&#8220;</strong><strong>The argument made by purchasing departments that buying power increases when dealing with a single, incumbent vendor.</strong><strong>&#8220;</strong><strong> </strong>I agree that <a href="../../../../../2011/02/02/exadata-notes/">vendor consolidation and account control</a> are a huge part of the Oracle, Microsoft, IBM and even Teradata stories. (Vertica can prove it&#8217;s 10X more price-performant than Oracle and still not get the business.) But it&#8217;s not just about price negotiations; once annual maintenance is included, one has to squint pretty hard to see Oracle as a low-cost alternative. Also important is reducing the number of total product-specific skill-sets needed on the IT staff.</li>
<li><strong>&#8220;</strong><strong>Prepackaged, prebalanced warehouse environments delivered using data warehouse appliances.</strong><strong>&#8220;</strong> Yep. To varying extents, Oracle, Microsoft, Teradata, and IBM are all committed to designed-hardware strategies.</li>
<li><strong>&#8220;</strong><strong>Expectations for the delivery of on-site POCs.</strong><strong>&#8220;</strong> Honestly, not as many buyers insist on on-site Proofs of Concept as should. Still, Oracle is shameful in its reluctance to do them. (Teradata tries to avoid them too, for obvious reasons of expense, but is much more gracious about capitulating when the buyer insists.)</li>
<li><strong>&#8220;</strong><strong>Cost controls and data warehouse performance management.</strong><strong>&#8220;</strong><strong> </strong>See next comment.</li>
<li><strong>&#8220;</strong><strong>Demands for delivering a fully mixed workload.</strong><strong>&#8220;</strong><strong> </strong>I&#8217;d have phrased the workload management and administrative tools points rather differently than this, but so be it.<strong> </strong></li>
<li><strong>&#8220;</strong><strong>Demands for departmental analytics delivered quickly via data marts.</strong><strong>&#8220;</strong><strong> </strong>Agreed. Data-mart-only installations are a huge part of the market of the analytic DBMS market. <a href="../../../../../2009/06/08/the-future-of-data-marts/">Data mart spin-out</a> is also important.</li>
<li><strong>&#8220;</strong><strong>Wider indexing and fast performance within clusters of data, delivered via column-based solutions.</strong><strong>&#8220;</strong> This bizarrely seems to conflate column stores and parallel processing (both of which are of course highly important).</li>
<li><strong>&#8220;</strong><strong>A wave of new data warehouse implementers seeking fast-track, low-risk delivery.</strong><strong>&#8220;</strong> Well, yes. Netezza noticed that quite some years ago. And by now the <a href="../../../../../2010/04/12/enterprise-data-warehouse-edw-myt/">long-gestation EDW (Enterprise Data Warehouse)</a> is widely disliked.</li>
<li><strong>&#8220;</strong><strong>Global organizations seeking distributed solutions as potential architecture.</strong><strong>&#8220;</strong> If this is the MPP point, it&#8217;s oddly phrased. If this is a suggestion that data warehouses should be partitioned across wide-area networks, it&#8217;s just plain odd. If it&#8217;s a reiteration that departments like to control their own data marts, I agree. And if it&#8217;s a comment on keep-data-in-the-country privacy laws, it could be the most prescient thing Donald Feinberg has said in many years.</li>
</ul>
<p>Long though it is, that list of general items and issues for the 2010 Gartner Data Warehouse Database Management System Magic Quadrant has some gaps. Most glaringly, I don&#8217;t see any references to <a href="../../../../../2011/01/24/analytic-computing-system/">advanced analytics</a> in general, or even to the specific case of <a href="../../../../../2010/05/15/further-clarifying-in-database-mpp-sas/">integrated predictive analytics</a>. There&#8217;s also nothing about solid-state memory or other storage-technology considerations, although in fairness it&#8217;s still early days for much of what vendors conceive of as competitive differentiation in those respects.</p>
<p>Here are some vendor-specific comments on the 2010 Gartner Data Warehouse Database Management System Magic Quadrant:</p>
<ul>
<li>It&#8217;s pretty bizarre to compare <strong>1010data</strong> to database.com or Microsoft Azure. Kognitio would be a better choice. So would cloud-hosted instances of Vertica, Aster Data nCluster, or others.</li>
<li>Gartner&#8217;s comments on <strong>Aster Data</strong> and nCluster are actually pretty reasonable.</li>
<li>Gartner&#8217;s comments on <strong>EMC/Greenplum</strong> are a bit Kool-Aid-drinky, and don&#8217;t account for the inevitable flailing that occurs right after an acquisition. But otherwise they&#8217;re pretty reasonable.</li>
<li>I don&#8217;t take <strong>IBM&#8217;s</strong> super-comprehensive-all-inclusive architectural stories as seriously as Gartner does.</li>
<li>I don&#8217;t take <strong>Netezza&#8217;s</strong> small stable of OEM partners as seriously as Gartner does. I also don&#8217;t share Gartner&#8217;s optimism for the continuation of Netezza&#8217;s NEC partnership in the face of IBM&#8217;s Netezza ownership.</li>
<li>I&#8217;m even more skeptical about <a href="../../../../../2008/03/27/the-illuminate-guys-have-a-cto-blog/">illuminate</a> than Gartner is.</li>
<li>I&#8217;m delighted that Gartner has adopted my phrase <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> <strong>(Infobright</strong> is one of several firms pushing that one).</li>
<li>&#8220;Only open-source column-store DBMS&#8221; is a bit exaggerated, but Infobright is indeed the only one with serious traction, or offered by a serious analytic DBMS vendor.</li>
<li>What Gartner said in connection with <strong>Ingres</strong> is too inaccurate to deserve detailed attention.</li>
<li>While Gartner&#8217;s write-up of <strong>Kognitio</strong> is a bit confused, that&#8217;s excusable. Kognitio&#8217;s strategy changes often.</li>
<li>I&#8217;m not persuaded by the claim of low <strong>Microsoft</strong> TCO. The days when Microsoft&#8217;s tools were vastly better than the competition&#8217;s are long gone. And using an OLTP DBMS for data warehousing generally takes more people effort than using something more purpose-built.</li>
<li>Gartner is right to ding <strong>Oracle</strong> for high prices, high people costs, and unwillingness to do onsite POCs.</li>
<li>Gartner is right that <strong>Exadata</strong> is a huge improvement over non-Exadata Oracle data warehousing.</li>
<li>Gartner is right to suggest that Exadata can easily handle data warehouses over 20 terabytes in size, but wrong to suggest that software-only Oracle also can. Just because the pain is less than it was with earlier releases of Oracle doesn&#8217;t mean it isn&#8217;t still bad.</li>
<li>Gartner&#8217;s comments on <strong>ParAccel</strong> are pretty reasonable.</li>
<li>Gartner&#8217;s comments on compression in connection with <strong>SAND</strong> make no technical sense (tokenization is a key form of columnar compression, not an alternative to it). Also, SAP&#8217;s acquisition of Sybase is a business challenge for SAND, not a technical one.</li>
<li>Unless I&#8217;m forgetting something, <strong>Sybase IQ</strong> has no more in-database data mining than any other Fuzzy Logix partner does.</li>
<li>Gartner failed to note that, like other DBMS dating back to the 1990s and before, Sybase IQ is more complex to administer than some newer products are.</li>
<li>Gartner&#8217;s take on <strong>Teradata </strong>is pretty reasonable.</li>
<li>Gartner&#8217;s take on <strong>Vertica, </strong>while sloppy, is basically sensible. However, Gartner failed to note that Vertica is a laggard in non-query analytics. (I am sure those deficiencies are being addressed, but Vertica&#8217;s competitors are moving ahead as well.)</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/02/05/gartner-magic-quadrant-data-warehouse-database-management-2010/feed/</wfw:commentRss>
		<slash:comments>23</slash:comments>
		</item>
		<item>
		<title>Exadata notes</title>
		<link>http://www.dbms2.com/2011/02/02/exadata-notes/</link>
		<comments>http://www.dbms2.com/2011/02/02/exadata-notes/#comments</comments>
		<pubDate>Wed, 02 Feb 2011 07:05:53 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3715</guid>
		<description><![CDATA[It&#8217;s been a while since I penetrated Oracle&#8217;s tight message control and actually talked with them about Exadata. But Doug Henschen wrote a good article about Exadata based on an Andy Mendelsohn webcast. I agree with almost all of it. At first I was a little surprised that Exadata&#8217;s emphasis shift from data warehousing to [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s been a while since I penetrated Oracle&#8217;s tight message control and actually talked with them about Exadata. But Doug Henschen wrote <a href="http://www.informationweek.com/news/software/bi/showArticle.jhtml?articleID=229100353">a good article about Exadata based on an Andy Mendelsohn webcast</a>. I agree with almost all of it. At first I was a little surprised that <a href="http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/">Exadata&#8217;s emphasis shift from data warehousing to OLTP/generic consolidation</a> hasn&#8217;t gone more quickly, but on the other hand:</p>
<ul>
<li>On the data warehouse side Exadata can alleviate screaming pain points.</li>
<li>In OLTP consolidation, Exadata mainly can save money. (Yes, I just said a product from Oracle can save customers money, and I meant it. You may stop laughing at any time.)</li>
</ul>
<p>Doug did overstate when he said that columnar architectures give 100X or more compression. That doesn&#8217;t happen. Yes, columnar compression can be &gt;10X in a variety of use cases, while pre-Exadata Oracle index bloat can approach 10X at times; but even if you&#8217;re counting that way I doubt there are many instances in which it actually multiplies out to &gt;100.</p>
<p>In other Exadata news, the long-standing observation that <a href="http://www.dbms2.com/2009/02/01/oracle-says-they-do-onsite-exadata-pocs-after-all/">Oracle doesn&#8217;t like to do on-site Exadata POCs</a> still holds true. A couple of existing Oracle users &#8212; one rather well-known &#8212; recently told me that Oracle won&#8217;t let them text Exadata except on Oracle premises. In one case, this is a deal-breaker keeping Exadata from being considered for a purchase, and Oracle still won&#8217;t budge.</p>
<p>Finally, I&#8217;m pretty sure that this &#8220;new&#8221; Softbank Teradata replacement Oracle has been touting since September as competitive evidence &#8212; which Doug&#8217;s article also references &#8212; isn&#8217;t quite what it sounds like. I believe Teradata&#8217;s version of the story, which somewhat edited goes like this:  <span id="more-3715"></span></p>
<blockquote>
<ul>
<li>The  Oracle Exadata decision at Softbank Mobile was  driven by business management in spite of <strong> Teradata being recommended by the technical team. </strong></li>
<li>To reiterate, the  data  warehouse project team recommended Teradata over Oracle.  The Teradata  proposal was well received in terms of TCO, performance, ease of use and  safety  of transition, etc. against Oracle Exadata.  However,<strong> the technical  team&#8217;s recommendation was overruled due to the business mandate to  standardize on Oracle throughout the company. </strong></li>
<li>SoftBank  Mobile has over 800  Oracle specialists in IT departments and Software subsidiaries.</li>
<li>The  Exadata performance is being compared to the existing production  system.   Teradata was NOT invited to benchmark a current generation  system.</li>
<li>Also, <strong>Softbank Mobile is a  reseller of Oracle.</strong></li>
</ul>
</blockquote>
<p>Teradata went on to clarify:</p>
<blockquote><p>Here are some  key points  regarding the Teradata systems at SoftBank:</p>
<ul>
<li>Two  Teradata systems:  Production #1 - 32 nodes.  Production #2  - 12 nodes.</li>
<li>Production #1 had nodes ranging from <strong>~3-7  years old.</strong></li>
<li>Production #2 had nodes that were <strong>~8 years  old.</strong></li>
<li>Teradata  V2R5 was <strong>end of life</strong> at the time of replacement.</li>
<li>We <strong>did  not get a chance to compete for this  business.</strong></li>
</ul>
</blockquote>
<p><strong>Bottom line: Oracle&#8217;s big competitive replacement of Teradata systems was against 3-8 year old boxes that the customer&#8217;s technical staff recommended be replaced by more Teradata gear.</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/02/02/exadata-notes/feed/</wfw:commentRss>
		<slash:comments>25</slash:comments>
		</item>
		<item>
		<title>Architectural options for analytic database management systems</title>
		<link>http://www.dbms2.com/2011/01/18/architectural-options-for-analytic-database-management-systems/</link>
		<comments>http://www.dbms2.com/2011/01/18/architectural-options-for-analytic-database-management-systems/#comments</comments>
		<pubDate>Tue, 18 Jan 2011 14:22:09 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data pipelining]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Michael Stonebraker]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3588</guid>
		<description><![CDATA[Mike Stonebraker recently kicked off some discussion about desirable architectural features of a columnar analytic DBMS. Let&#8217;s expand the conversation to cover desirable architectural characteristics of analytic DBMS in general.  But first, a few housekeeping notes: This is a very long post. Even so, to keep it somewhat manageable, I&#8217;ve cut corners on completeness. Most [...]]]></description>
			<content:encoded><![CDATA[<p>Mike Stonebraker recently kicked off some discussion about <a href="../../../../../2011/01/12/mike-stonebraker-on-real-column-stores/">desirable architectural features of a columnar analytic</a> DBMS. Let&#8217;s expand the conversation to cover desirable architectural characteristics of analytic DBMS in general.  <span id="more-3588"></span>But first, a few housekeeping notes:</p>
<ul>
<li>This is a very long post.</li>
<li>Even so, to keep it somewhat manageable, I&#8217;ve cut corners on completeness. Most notably, two important areas are entirely deferred to future posts &#8212; advanced-analytics-specific architecture, and in-memory processing (including CEP).</li>
<li>The subjects here are not strictly parallel. The distinction between major add-on modules and &#8220;turtles all the way down&#8221; core architectural choices is rarely crystal-clear &#8212; Mike Stonebraker&#8217;s recent post notwithstanding <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  &#8212; and I&#8217;ve mixed subjects of varying degrees of &#8220;fundamentalness&#8221; pretty freely.</li>
<li>There&#8217;s a long list of links at the end, pointing at posts that help explain or give examples of specific features named in the body of the text, somewhat like unnumbered footnotes.</li>
</ul>
<p>OK. In my opinion, the four drop-dead requirements for an analytic DBMS are:</p>
<ul>
<li><strong>Relational/SQL support.</strong> That&#8217;s how you get great flexibility in more or less easily constructing queries, as well as connectivity to a vast number of tools. In a few cases, I guess <strong>MDX</strong> might suffice as an alternative.</li>
<li>Sufficiently <strong>great query performance,</strong> on the queries you&#8217;re actually going to run, for however many concurrent users you actually will have.</li>
<li>Sufficiently high <strong>data loading throughput</strong> and sufficiently low <strong>data loading latency.</strong></li>
<li>Sufficiently favorable <strong>TCO </strong>(Total Cost of Ownership), all things considered, where &#8220;all things&#8221; at a minimum includes software license, software maintenance, hardware, power, people costs for administration, and people costs for development.</li>
</ul>
<p>Depending on your use case, you might have additional make-or-break requirements. Possible areas include:</p>
<ul>
<li>Additional <strong>query functionality,</strong> of course with good performance. Specific examples include:
<ul>
<li>ANSI-standard SQL features that are not universally supported (e.g. windowing).</li>
<li>Geospatial datatype support.</li>
</ul>
</li>
<li>Further high-performing <strong>integrated analytics</strong>, such as:
<ul>
<li>Data mining/machine learning modeling and scoring.</li>
<li>Other mathematical functions, such as linear algebra, optimization, or Monte Carlo simulation.</li>
<li>Extensibility via MapReduce and/or sufficiently robust user-defined function (UDF) capabilities.</li>
</ul>
</li>
<li>Platform support that matches your needs.</li>
<li>Security, auditability, and/or high-performance encryption.</li>
</ul>
<p>Other possibly important features &#8212; but ones that would usually go on &#8220;nice to have&#8221; rather than &#8220;must have&#8221; lists &#8212; include:</p>
<ul>
<li>Yet more <strong>query functionality,</strong> in areas such as:
<ul>
<li>Non-standard SQL extensions (e.g. temporal ones)</li>
<li>Specific prepackaged UDFs.</li>
<li>Cross-column text search.</li>
</ul>
</li>
<li>Nice <strong>administrative tools,</strong> in areas such as:
<ul>
<li>Single-query performance/optimization.</li>
<li>Authorization/permission.</li>
<li>Workload management.</li>
<li>Data mart spin-out.</li>
</ul>
</li>
</ul>
<p>So what kinds of architectural choices (or major features) should one look to to support such features? On the performance side there are many candidates, including:</p>
<ul>
<li><strong>Specialized indexes</strong>, more commonly found in older DBMS. Leading examples include star and especially bitmap indices, both of which I was already writing about back in the 1990s. Ditto <strong>materialized views</strong>, which aren&#8217;t exactly indices, but are closely related.</li>
<li><strong>Partition elimination.</strong> Single- or multi-level range partitioning can cause whole regions of the database never to be checked in a particular query&#8217;s evaluation. (That&#8217;s a good thing.) The functionality popularized by Netezza as <strong>zone maps </strong>does something similar, without requiring the partitions to be chosen in advance.</li>
<li><strong>Scan-friendliness.</strong> If a query runs a long time, it may include a lot of (full or partial) table scanning. Assuming you rely on spinning disk &#8212; as opposed to solid-state memory &#8212; one way to improve your sequential-scan throughput far above your random-read throughput is to support <a href="../../../../../2006/09/20/teradata-netezza-datallegro-appliance/">large block sizes. </a></li>
<li><strong>Parallelism</strong>. It&#8217;s possible to screw up even multi-core parallelism, but the big issue is multi-server. In particular:
<ul>
<li>An analytic DBMS must <strong>avoid a &#8220;fat head&#8221; bottleneck,</strong> either because there is no head node at all directing things, or else because data redistribution algorithms are sufficiently mature as to not overload it. (In naive parallel DBMS implementations, intermediate query results get sent back to the head node to be, for example, JOINed together. This is not a good thing.)</li>
<li>Multiple analytic DBMS vendors have chosen to develop <strong>custom data transfer protocols,</strong> for more reliable performance than they can get from TCP/IP. Examples include Teradata, Netezza, and ParAccel.</li>
</ul>
</li>
<li><strong>Predicate pushdown. </strong>Predicate pushdown takes several forms, in all cases having the goal of executing certain simpler database operations &#8212; predicate evaluations &#8212; close to the data, thus minimizing I/O or upstream processing.<strong></strong>
<ul>
<li>Netezza famously offloads the first part of predicate evaluation to FPGA (Field-Programmable Gate Array) chips.</li>
<li>At least in theory, I like <a href="../../../../../2008/09/28/exadata-oracle-database-machine-parallelization/">the Exadata form of node specialization</a>, in which a tier of server nodes does the first part of the processing, with the results being sent to a second upstream database tier. But it&#8217;s not obvious that any RDBMS vendor has done a great job with it. Oracle is famously secretive about Exadata&#8217;s track record, and as of this writing apparently still resists on-site benchmarks. <a href="../../../../../2008/09/05/mpp-data-warehouse-nodes/">Calpont</a> hasn&#8217;t accomplished much. And <a href="../../../../../2010/11/29/marklogic-and-its-document-dbms/">MarkLogic</a> of course doesn&#8217;t sell an RDBMS.</li>
<li>There&#8217;s reason to think predicate pushdown would help exploit flash memory, although I&#8217;m not sure vendors are moving in a direction that will let us find out.</li>
</ul>
</li>
<li><strong>Columnar</strong> data storage. Columnar storage is pretty much the ultimate in predicate pushdown, and advantageous in many analytic query scenarios. (Main exception: When you&#8217;re bringing back the majority of a row anyway, you might as well fetch the thing pre-assembled.) As Mike Stonebraker points out, <a href="../../../../../2011/01/12/mike-stonebraker-on-real-column-stores/">columnar storage should not incur serious row-ID overhead</a>, and ideally should be available for multiple sort orders on each column.</li>
<li><strong>Compression.</strong> This, rightly, is another of Mike Stonebraker&#8217;s favorite features. Database compression is hugely important, for I/O and in silicon alike. (And it can also save money on storage.) There are a broad variety of compression techniques, suited for different kinds of data, different kinds of queries, or different points on the storage saving/decompression performance tradeoff spectrum.<strong></strong></li>
<li><strong>Flexible storage.</strong> Not all data is best stored the same way, even if it&#8217;s in the same database. Some is destined for columnar-friendly use cases, other for whole row. Some is compressed ideally by one technique, some by another. And so on. Some database managers do a good job of letting different parts of the database (even within the same table) be stored in different ways. <strong></strong></li>
<li><strong>Query pipelining. </strong>There are a lot of steps to query execution, in both the fine-grained sense (a whole lot of rows) and the coarse-grained (all but the simplest execution plans feature a number of operations each). FPGA-based vendors XtremeData and Kickfire used the innate parallelism of an FPGA to pipeline query execution. Kickfire failed, and XtremeData hasn&#8217;t sold many systems, but that doesn&#8217;t mean it isn&#8217;t a good idea. <a href="../../../../../2010/08/12/teradata-future-product-strategy/">Kickfire&#8217;s assets were sold to Teradata</a>. Meanwhile, VectorWise&#8217;s very name speaks to its (Intel-based) vector processing architecture.</li>
<li><strong>Result set reuse.</strong> Instead of mixing together different steps of the same query, how about mixing together the same step in different queries, so that you don&#8217;t have to repeat it? As a simple example, suppose two queries need to do the same table scan. Well then, it would be nice to only do the scan once. In most cases, query workloads are too diverse for result set reuse of that kind to be very important; still, it&#8217;s a cool feature, which Teradata calls <a href="../../../../../2006/09/20/teradata-netezza-datallegro-appliance/">synchronized scan</a>.</li>
<li><strong>Suitably optimized execution engine </strong>&#8211; column, row, whatever. (This is Mike Stonebraker&#8217;s &#8220;inner loop&#8221; point generalized.)<strong></strong></li>
<li>Well-factored<strong> query optimizer. </strong>No matter what, it&#8217;s good for a query optimizer to have been through a few rounds of <a href="../../../../../2009/08/21/bottleneck-whack-a-mole/">Bottleneck Whack-A-Mole</a>. Beyond that, an optimizer with sufficiently convenient hooks can have cool and occasionally valuable features such as:
<ul>
<li><strong>On-the-fly query re-planning. </strong>Do part of the query, rerun column statistics, and re-plan the query if appropriate.<strong></strong></li>
<li><strong>Not-so-black-box optimization. </strong>Work interactively with the DBA to find the best query plan.<strong></strong></li>
<li><strong>Query rewriting.</strong> Any decent optimizer will take a complex query and produce an execution plan that in some cases looks quite unlike the original query. Some optimizers go further in rewriting the query first, essentially to psych themselves into coming up with a better plan.<strong></strong></li>
</ul>
</li>
</ul>
<p>You can&#8217;t do much with an analytic database unless you get data into it in the first place. Thus, performance in writing and loading data are important, and there are a number of architectural decisions that can be helpful in those regards.</p>
<ul>
<li><strong>Row-based architecture.</strong> Column stores have obvious advantages for query, but in a naive column store implementation you have tremendous overhead, pulling the rows apart and storing them in many different columns. This is particularly the case for small, frequent updates.</li>
<li><strong>Batched writes. </strong>The classic way to deal with column stores&#8217; data writing challenges is to batch data in memory, then bang it to disk only occasionally. Hopefully the data is available seamlessly for query in RAM before the disk-banging occurs. This technique is by no means restricted to analytic and/or columnar use cases, but the single best-known example may be Vertica&#8217;s Read-Optimized Store (disk)/Write-Optimized Store (RAM) pairing.</li>
<li><strong>Lack of indices and materialized views</strong>. Indexes and materialized views can help query speed, albeit at the cost of disk space and administrative effort. But maintaining them multiplies the difficulty of loading data in the first place.</li>
<li><strong>Lockless or optimistic-locking concurrency model.</strong> Locking models suitable for OLTP  can be ridiculous for analytic databases, blocking queries for no good reason. Fortunately, there are alternatives.</li>
<li><strong>Append-only updating. </strong>When I/O volumes are high, append-only updating can give an important performance improvement over update-in-place, assuming you have sufficiently good algorithms for garbage-collection/clean-up. If I/O volumes are so low that you don&#8217;t care about the performance benefits, maybe it would be nice to have the &#8220;time-travel&#8221; feature that&#8217;s a potential byproduct of MVCC (Multi-Version Concurrency Control). Neither part of this observation applies solely to analytic DBMS.</li>
<li><strong>Parallel load (no fat head). </strong>It&#8217;s not just query execution that can get bottlenecked at a &#8220;head node;&#8221; the same can happen with loads, batch or otherwise. That&#8217;s not a good thing. Thus, various parallel analytic DBMS vendors have set up ways to load data directly to the nodes where it&#8217;s going to be stored.<strong></strong></li>
<li><strong>Specialized load nodes</strong>. <a href="../../../../../2008/10/22/aster-data-systems-ncluster/">Aster Data nCluster features specialized data loading nodes</a>, although Aster has introduced a more conventional kind of parallel load as well.</li>
</ul>
<p>And of course, all of the above need to be implemented in the context of well-configured combinations of hardware, networking, and software.</p>
<p>Topics I know I&#8217;ve left out include advanced-analytics functionality, and in-memory processing (CEP or otherwise). Also missing are specifics of compression algorithms &#8212; or indeed of anything else. I&#8217;m sure there&#8217;s much else missing besides, so please point out the most glaring omissions in the comment thread below. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p><strong><em>Related links:</em></strong></p>
<ul>
<li><a href="../../../../../2010/05/15/further-clarifying-in-database-mpp-sas/">Why even in-database scoring can be important</a> (May, 2010).</li>
<li><a href="../../../../../2009/10/18/three-big-myths-about-mapreduce/">Three big myths about MapReduce</a> (October, 2009).</li>
<li><a href="../../../../../2008/08/26/why-mapreduce-matters-to-sql-data-warehousing/">Why you might ever want to integrate MapReduce into your DBMS</a> (August, 2008).</li>
<li><a href="../../../../../2009/06/08/the-future-of-data-marts/">The future of data marts</a>, specifically data mart spin-out. (June, 2009).</li>
<li>Netezza offers both zone maps and <a href="../../../../../2010/06/21/netezza-database-software-technology-overview/">clustered base tables</a> (June, 2010).</li>
<li>Oracle Exadata <a href="../../../../../2010/01/22/oracle-database-hardware-strategy/">Storage Indexes</a> are like Netezza zone maps (January, 2010).</li>
<li><a href="../../../../../2009/08/08/netezza-fpga/">How Netezza uses the FPGA</a> (August, 2010).</li>
<li><a href="../../../../../2009/02/01/oracle-says-they-do-onsite-exadata-pocs-after-all/">Oracle is reluctant to do on-site Exadata POCs</a> (February, 2009). As of the end of 2010, that doesn&#8217;t seem to have changed.</li>
<li><a href="../../../../../2010/06/21/netezza-ibm-db2-compression/">The Netezza and IBM DB2 approaches to compression</a> (June, 2010, which is before IBM acquired Netezza).</li>
<li><a href="../../../../../2009/05/14/the-secret-sauce-to-clearpaces-compression/">The secret sauce to Rainstor&#8217;s extreme compression</a> (May, 2009, when Rainstor was still called Clearpace).</li>
<li><a href="../../../../../2009/08/04/pax-analytica-row-and-column-stores-begin-to-come-together/">The row-based/columnar distinction gets blurred</a>, e.g. by Vertica FlexStore (August, 2009).</li>
<li>And by <a href="../../../../../2009/10/14/greenplum-hybrid-columnar/">Greenplum</a> (October, 2009). Also contains the observation that even row-style compression works better when data is stored columnarly.</li>
<li>And by <a href="../../../../../2010/09/15/aster-data-ncluster-version-4-6/">Aster Data</a> (September, 2010).</li>
<li>Teradata is particularly aggressive about <a href="../../../../../2009/08/02/teradata-13-focuses-on-advanced-analytic-performance/">query rewrite</a> (August, 2009).</li>
<li><a href="../../../../../2006/09/27/logless-lockless-netezza-more-carefully-explained/">Netezza&#8217;s logless, lockless architecture</a> (September, 2006).<a href="../../../../../2006/09/27/logless-lockless-netezza-more-carefully-explained/"><br />
</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/01/18/architectural-options-for-analytic-database-management-systems/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Where ParAccel is at</title>
		<link>http://www.dbms2.com/2010/10/17/paraccel/</link>
		<comments>http://www.dbms2.com/2010/10/17/paraccel/#comments</comments>
		<pubDate>Sun, 17 Oct 2010 08:21:04 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3296</guid>
		<description><![CDATA[Until recently, I was extremely critical of ParAccel&#8217;s marketing. But there was an almost-clean sweep of the relevant ParAccel executives, and the specific worst practices I was calling out have for the most part been eliminated. So I was open to talking and working with ParAccel again, and that&#8217;s now happening. On my recent California [...]]]></description>
			<content:encoded><![CDATA[<p>Until recently, <a href="http://www.dbms2.com/2010/01/15/there-sure-seem-to-be-a-lot-of-inaccuracies-on-paraccels-website/">I was extremely critical of ParAccel&#8217;s marketing</a>. But there was an almost-clean sweep of the relevant ParAccel executives, and the specific worst practices I was calling out have for the most part been eliminated. So I was open to talking and working with ParAccel again, and that&#8217;s now happening. On my recent California trip, I chatted with three ParAccel folks for a few hours. Based on that and other conversation, here&#8217;s the current ParAccel story as I understand it.<br />
<span id="more-3296"></span><br />
I&#8217;ve already noted that <a href="http://www.dbms2.com/2010/08/09/links-and-observations/">PADB 3.0 is coming soon</a> (ParAccel Analytic DataBase), but pending its arrival, ParAccel&#8217;s technical story is primarily about <strong>query performance.</strong> More specifically:</p>
<ul>
<li>ParAccel asserts that PADB is much faster than other analytic DBMS &#8212; even close competitors such as Vertica &#8212; on <strong>especially complex queries. </strong>&#8220;60-way joins&#8221; were mentioned. So was the flattening of correlated subqueries.</li>
<li>ParAccel also claims industry-leading performance on simpler queries, but not by the same (or perhaps even particular large) margins.</li>
<li>Mercifully, ParAccel no longer <a href="http://www.dbms2.com/2009/07/08/progress-in-figuring-out-what-paraccel-is-doing/">claims to have never, ever lost on performance in a customer evaluation</a>. But it still says that is very close to being true.</li>
<li>Major reasons ParAccel gives for PADB&#8217;s high performance include:
<ul>
<li>Like Vertica, Sybase IQ, and others, PADB uses a <strong>columnar</strong> architecture.</li>
<li>ParAccel thinks PADB&#8217;s newest <strong>query optimizer</strong> &#8212; fondly named <a href="http://paraccel.com/technology/omne-optimizer/">Omne</a> &#8212; is outstanding.</li>
<li>ParAccel&#8217;s PADB <strong>compiles its queries.</strong></li>
<li>In general, ParAccel is just performance-obsessed.</li>
</ul>
</li>
<li>One could also mention:
<ul>
<li>ParAccel&#8217;s PADB runs smoothly in-memory, if that&#8217;s what you want.</li>
<li>ParAccel also offers a Flash option for PADB.</li>
<li>Like many other analytic DBMS vendors, ParAccel has created a custom networking protocol. (ParAccel has talked about that <a href="http://www.dbms2.com/2010/04/16/story-of-an-analytic-dbms-evaluation/">altogether too much</a> in the past.)</li>
<li>Like Vertica, ParAccel&#8217;s PADB generally decompresses data as late as the  particular compression scheme used allows. (Well, actually, that&#8217;s not  one ParAccel mentions unless asked.)</li>
<li>ParAccel has long encouraged one to put part of one&#8217;s database on direct-attached storage as a kind of persistent cache, plus all of it on a storage-area network, because PADB can optimize its scans to go against both physical stores.</li>
<li>ParAccel&#8217;s PADB does encryption a block at a time, rather than a row at a time, so there&#8217;s very little overhead to using the encryption feature.</li>
</ul>
</li>
<li>ParAccel says that PADB has no indexes, materialized views, etc., notwithstanding that <a href="http://www.dbms2.com/2008/02/18/paraccel-technical-overview/">I heard something different from Barry Zane a few years ago</a>. This is the basis for ParAccel&#8217;s claim that <strong>no tuning</strong> (or at least very little) is required, or indeed even possible &#8230;</li>
<li>&#8230; and similarly, it is the reason ParAccel encourages prospects to do ad-hoc queries in their POCs (Proofs Of Concept), at least when Vertica is the competitor.</li>
<li>However, ParAccel&#8217;s PADB has rather <strong>complex initial set-up.</strong> This has been the basis for widespread skepticism about ParAccel&#8217;s &#8220;no tuning&#8221; claim. ParAccel is working to automate that away, but admits to being only part-way through the process.</li>
<li>Highlights of ParAccel&#8217;s data writing strategy include:
<ul>
<li>PADB sends data transactionally to disk.</li>
<li>PADB usually sends data to disk a block at a time, because it is coming in fast enough for that to work out (either due to bulk load or streaming).</li>
<li>PADB is <strong>append-only</strong> &#8230;</li>
<li>&#8230; so PADB has a garbage-collection mechanism called Vacuum. Right now Vacuum has to be started manually, but doesn&#8217;t block reads and writes; full background garbage collection is of course a roadmap feature.</li>
<li>As is natural for append-only systems, ParAccel&#8217;s PADB has MVCC (MultiVersion Concurrency Control) and snapshot isolation.</li>
</ul>
</li>
<li>Name a <strong>compression</strong> method, and PADB probably has it &#8212; 13 in all by ParAccel&#8217;s count, including dictionary/token, run-length encoding, Delta, LZ, and so on.</li>
</ul>
<p>Tracking ParAccel&#8217;s customer success has long been difficult. The <a href="http://www.dbms2.com/2010/02/10/gartner-magic-quadrant-data-warehouse-2009-2010/">2009 Gartner Magic Quadrant</a> claim of ~20 ParAccel customers seems odd to everybody, including ParAccel. ParAccel&#8217;s own reporting of customer wins around then was <a href="http://www.dbms2.com/2010/01/15/there-sure-seem-to-be-a-lot-of-inaccuracies-on-paraccels-website/">quite confusing</a>. And ParAccel&#8217;s customer count a year before that was <a href="http://www.dbms2.com/2009/01/03/paraccels-market-momentum/">extremely low</a>. But ParAccel&#8217;s Michael Weir just rounded up some figures for me, namely:</p>
<ul>
<li>ParAccel has 30+ revenue-recognized customers, not counting OEMs, OEMs&#8217; customers, or paid POCs.</li>
<li>2 ParAccel customers have &gt; 100 TB of user data.</li>
<li>7 ParAccel customers have &gt; 10 TB of user data.</li>
<li>The largest ParAccel cluster is 28 nodes and growing.</li>
</ul>
<p>Naturally, Michael went on to note that even relatively small databases can have high value.</p>
<p>One last note: ParAccel has approximately 78 employees.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/10/17/paraccel/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Best practices for analytic DBMS POCs</title>
		<link>http://www.dbms2.com/2010/06/14/best-practices-analytic-database-poc/</link>
		<comments>http://www.dbms2.com/2010/06/14/best-practices-analytic-database-poc/#comments</comments>
		<pubDate>Mon, 14 Jun 2010 12:53:33 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2297</guid>
		<description><![CDATA[When you are selecting an analytic DBMS or appliance, most of the evaluation boils down to two questions: How quickly and cost-effectively does it execute SQL? What analytic functionality, SQL or otherwise, does it do a good job of executing? And so, in undertaking such a selection, you need to start by addressing three issues: [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">When you are selecting an analytic DBMS or appliance, most of the evaluation boils down to two questions:</p>
<ul>
<li>How q<span style="font-style: normal;">uickly 	and cost-effectively does it execute SQL?</span></li>
<li><span style="font-style: normal;">What 	analytic functionality, SQL or otherwise, does it do a good job of 	executing?</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">And so, in undertaking such a selection, you need to start by addressing three issues:</span></p>
<ul>
<li><a href="../2009/09/10/analytic-speed-latency/">What 	does “speed” mean to you</a>?</li>
<li>What does “cost” mean to you?</li>
<li>What analytic functionality do you 	need anyway?</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-2297"></span>Key elements of cost* include:</p>
<ul>
<li>Software license and maintenance</li>
<li>Hardware purchase cost, 	maintenance, electric power, and computer room burden</li>
<li>Database and system administration</li>
<li>(For some uses cases) Programming</li>
</ul>
<p style="margin-bottom: 0in;"><em>*Assuming a classical in-house IT shop, where products are typically bought rather than leased/rented. With outsourced and/or monthly-fee structures, the details change but the principles remain the same.</em></p>
<p style="margin-bottom: 0in;"><em></em>Most of that can be evaluated pretty well via a spreadsheet, although things can get a bit tricky when you get to people costs, which are a large fraction of the whole. In particular, different analytic DBMS product suites have great, high-performance support for different (and often rapidly growing) sets of functionality – basic and advanced SQL, statistics, and more. Figuring out which ones will be best for your programmers, and how significant the differences are &#8212; well, that&#8217;s a lot like any other programming language evaluation, and those are rarely neat or clean-cut.</p>
<p style="margin-bottom: 0in; font-style: normal;">But when it comes to evaluating speed, <strong>there&#8217;s no substitute for a well-designed proof of concept (POC).</strong> Many analytic DBMS and appliance vendors are happy to let you do a POC, on your own premises (or remotely if you prefer), under your control, at no cost to you. And that&#8217;s great. <strong>It is crucial that a POC be run either by you, by a consultant* answerable to you,</strong><span style="font-weight: normal;"> or – if you decide the vendor must run it for you – at least </span><strong>with you watching every step of the way</strong><span style="font-weight: normal;"> and knowing exactly what is being done. Applianc</span>e vendors do find it cheaper to run POCs on their own premises, so a certain reluctance to ship you a box is understandable. But <strong>make no compromises about the transparency of a POC, or about your control of exactly what it is that gets tested.</strong></p>
<p style="margin-bottom: 0in;"><em>*Since I sell <a href="http://www.monash.com/adviseusers.html">consulting services</a> for users evaluating analytic DBMS, I naturally am biased to think that consultants can be very useful in the process. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  But whether you should use them a little (sanity check), a medium amount (work with you through the process), or heavily (actually drive the process for you and/or execute the POCs) is very dependent upon your specific situation.</em></p>
<p style="margin-bottom: 0in; font-style: normal;">So far as I&#8217;ve been able to tell:</p>
<ul>
<li><span style="font-style: normal;">Netezza 	loves to ship boxes to prospects for POCs, and have them set up the 	boxes and do POCs themselves. That&#8217;s a big reason why <a href="../2009/02/18/the-netezza-guys-propose-a-poc-checklist/">Netezza 	wants to call attention to this subject</a>.</span></li>
<li><span style="font-style: normal;">Oracle 	has generally been pretty <a href="../2009/02/01/oracle-says-they-do-onsite-exadata-pocs-after-all/">reluctant 	to ship Exadata boxes out for POCs</a>. That&#8217;s the other reason 	Netezza wants to call attention to the issue. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </span></li>
<li><span style="font-style: normal;">Open 	source vendors make it easy for you to download and test at least 	their community editions.</span></li>
<li><span style="font-style: normal;">Vertica 	makes it pretty easy for you to test its software too (download or 	cloud).</span></li>
<li><span style="font-style: normal;">ParAccel 	has generally insisted on running POCs itself, although it will do 	so on your premises if you insist.</span></li>
<li><span style="font-style: normal;">Teradata 	naturally tries to do POCs on its own premises, but doesn&#8217;t insist 	too hard.<em> (Edit: Randy Lea of Teradata says that Teradata is now doing over half its POCs onsite.)</em><br />
</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">Most of the criticisms I&#8217;ve heard of vendors&#8217; POC practices have been directed at Oracle or ParAccel.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">For most POCs, it&#8217;s a good conceptual template to </span><span style="font-style: normal;"><strong>form and then test a hypothesis</strong></span><span style="font-style: normal;"> to the effect of:</span></p>
<ul>
<li><span style="font-style: normal;">For 	a given technology product assemblage (brand of DBMS, number of 	nodes, etc.), and</span></li>
<li><span style="font-style: normal;">For 	a given level of human effort (e.g., administrative effort), you can</span></li>
<li><span style="font-style: normal;">Run 	a given a workload, with</span></li>
<li><span style="font-style: normal;">Satisfactory 	and satisfactorily consistent response times</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">Sometimes absolute throughput and price/performance are important </span><em>secondary</em><span style="font-style: normal;"> considerations; sometimes they&#8217;re less germane. But either way, it&#8217;s almost always right to focus </span><em>primarily</em><span style="font-style: normal;"> on the questions of </span><span style="font-style: normal;"><strong>“What do I want this system to do?”</strong></span><span style="font-style: normal;"> and </span><span style="font-style: normal;"><strong>“What do I think we&#8217;re going to have to invest in it?</strong></span><span style="font-style: normal;">” By way of contrast, it&#8217;s often misleading to focus too much on questions like “<a href="../2008/11/19/data-warehouse-proof-of-concept-pocs/">What&#8217;s the one number that best describes the performance of this system?</a>” &#8212; even if you customize that calculation for your environment – or, even worse, “How much speed-up can I get on my single worst <a href="../2008/11/15/query-from-hell/">Query from Hell</a>?” </span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">The fundamental rule of POC construction is: </span><span style="font-style: normal;"><strong>Model your entire use case as best you can.</strong></span><span style="font-style: normal;"> That means you need to consider, at a minimum:</span></p>
<ul>
<li><span style="font-style: normal;">Your 	whole concurrent query, other analytic, and low-latency update 	workload (peak).</span></li>
<li><span style="font-style: normal;">Your 	whole query, analytic, load, backup, and maintenance workload 	(ongoing).</span></li>
<li><span style="font-style: normal;"><a href="../2008/12/14/the-%E2%80%9Cbaseball-bat%E2%80%9D-test-for-analytic-dbms-and-data-warehouse-appliances/">Partial-failure 	scenarios</a>.</span></li>
<li><span style="font-style: normal;">Your 	core SLAs (Service-Level Agreements).</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">Of course, that&#8217;s not as easy as it sounds. Presumably, the main reason you&#8217;re getting a new analytic DBMS is that you want to do new kinds of analysis. By the very nature of analytics, you won&#8217;t know what analytic operations are most useful until you try them out and see what their results are. On the other hand – if you haven&#8217;t done considerable thinking about how you&#8217;re going to use your new analytic database, how did you ever get funding for the project in the first place? <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">Seriously, I could write multiple posts, each as long as this one (but more application-oriented), about how to upgrade your analytic capabilities (and which fool&#8217;s gold to avoid). But this has gotten pretty long already, so for now I&#8217;ll just stop here.</span></p>
<p style="margin-bottom: 0in;"><em>Note: My clients at Netezza asked me to write something short about POCs they could use as a kind of foreword to some collateral, where by &#8220;short&#8221; they meant single-paragraph or something like that. They&#8217;re great clients, so I said yes, under the condition I could also use it as a blog post. Except … this post didn&#8217;t turn out to be nearly as short as they envisioned. Oops. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </em></p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">My 	February, 2009 <a href="../2009/02/25/even-more-final-version-of-my-tdwi-slide-deck/">slide 	deck on how to select an analytic DBMS</a> is in many parts still 	pretty current</span></p>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/14/best-practices-analytic-database-poc/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

