<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Netezza</title>
	<atom:link href="http://www.dbms2.com/category/products-and-vendors/netezza/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Wed, 08 Feb 2012 12:22:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Hope for a new PostgreSQL era?</title>
		<link>http://www.dbms2.com/2011/11/23/hope-for-a-new-postgresql-era/</link>
		<comments>http://www.dbms2.com/2011/11/23/hope-for-a-new-postgresql-era/#comments</comments>
		<pubDate>Wed, 23 Nov 2011 14:18:00 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[EnterpriseDB and Postgres Plus]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[salesforce.com]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5728</guid>
		<description><![CDATA[In a comedy of briefing errors, I&#8217;m not too clear on the details of my client salesforce.com&#8217;s new PostgreSQL-as-a-service offering, nor exactly on what my clients at VMware are bringing to the PostgreSQL virtualization/cloud party. That said: PostgreSQL is good technology. MySQL is narrowing the gap, but PostgreSQL is still ahead of MySQL in some [...]]]></description>
			<content:encoded><![CDATA[<p>In a comedy of briefing errors, I&#8217;m not too clear on the details of my client <a href="http://gigaom.com/cloud/heroku-launches-sql-database-as-a-service/">salesforce.com&#8217;s new PostgreSQL-as-a-service offering</a>, nor exactly on what my clients at VMware are bringing to the PostgreSQL virtualization/cloud party. That said:</p>
<ul>
<li>PostgreSQL is good technology.</li>
<li>MySQL is narrowing the gap, but PostgreSQL is still ahead of MySQL in some ways.  (Database extensibility if nothing else.)</li>
<li>PostgreSQL has a lot of users. (Many of them in academia and/or Russia.)</li>
<li>Neither EnterpriseDB (which now calls itself &#8220;The enterprise PostgreSQL company&#8221;) nor the PostgreSQL community leadership have covered themselves with stewardship glory.</li>
<li>A significant number of interesting DBMS products can be regarded as PostgreSQL forks (e.g. Greenplum, Aster Data nCluster, Netezza if you squint, and Vertica if you stand on your head*).</li>
<li>PostgreSQL advancement is not dead. For example, <a href="../../../../../2011/11/08/hadapt-is-moving-forward/">Hadapt beta users are running actual PostgreSQL on many nodes each</a>.</li>
<li><a href="../../../../../2009/12/14/oracle-mysql-storage-engine/">There&#8217;s no assurance that Oracle will be a benevolent MySQL steward forever</a>. (Specifically, Oracle&#8217;s &#8220;Play nicely with others&#8221; antitrust commitments expire in 2014.)</li>
</ul>
<p>So I think it would be cool if one or the other big company put significant wood behind the PostgreSQL arrow.</p>
<p><em>*While Vertica was originally released using little or no PostgreSQL code &#8212; reports varied &#8212; it featured high degrees of PostgreSQL compatibility.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/23/hope-for-a-new-postgresql-era/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Some big-vendor execution questions, and why they matter</title>
		<link>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/</link>
		<comments>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 11:01:20 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cognos]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[HP and Neoview]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[SAP AG]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5704</guid>
		<description><![CDATA[When I drafted a list of key analytics-sector issues in honor of look-ahead season, the first item was &#8220;execution of various big vendors&#8217; ambitious initiatives&#8221;.  By &#8220;execute&#8221; I mean mainly: &#8220;Deliver products that really meet customers&#8217; desires and needs.&#8221; &#8220;Successfully convince them that you&#8217;re doing so &#8230;&#8221; &#8220;&#8230; at an attractive overall cost.&#8221; Vendors mentioned [...]]]></description>
			<content:encoded><![CDATA[<p>When I drafted a list of key analytics-sector issues in honor of <a href="http://www.dbms2.com/2011/11/21/analytic-trends-in-2012-qa/">look-ahead season</a>, the first item was &#8220;execution of various big vendors&#8217; ambitious initiatives&#8221;.  By &#8220;execute&#8221; I mean mainly:</p>
<ul>
<li>&#8220;Deliver products that really meet customers&#8217; desires and needs.&#8221;</li>
<li> &#8220;Successfully convince them that you&#8217;re doing so &#8230;&#8221;</li>
<li>&#8220;&#8230; at an attractive overall cost.&#8221;</li>
</ul>
<p>Vendors mentioned here are Oracle, SAP, HP, and IBM. Anybody smaller got left out due to the length of this post. Among the bigger omissions were:</p>
<ul>
<li>salesforce.com (multiple subjects).</li>
<li><a href="../../../../../2011/04/21/sas-hpa-does-make-sense-after-all/">SAS HPA</a>.</li>
<li><a href="../../../../../2011/08/21/hadoop-evolution/">The evolution of Hadoop</a>.</li>
</ul>
<p><span id="more-5704"></span><strong>A (lingering) issue for SAP and Oracle alike</strong></p>
<p>As I noted in January of this year, <a href="../../../../../2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/">integration of business intelligence into operational apps is making very slow progress</a>. Even so, it&#8217;s a huge part of the apparent strategy at SAP and Oracle alike, as well it should be. Much of the benefit from automating routine desk work has already happened. The areas ripest for exploitation are the ones where analytics are part of the equation.</p>
<p>Given the lack of tangible progress, why do I think this is a genuine area of Oracle and SAP emphasis? Three reasons of many are:</p>
<ul>
<li>Why else did SAP buy Business Objects?</li>
<li>If they&#8217;re not trying to <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrate operational apps and analytics</a>, why else does SAP&#8217;s emphasis on HANA make sense?</li>
<li>Without business intelligence in the picture, how does Oracle&#8217;s integrated-stack story promise any direct user benefits?*</li>
</ul>
<p><em>*As opposed to IT concerns &#8212; integration, administration, TCO (Total Cost of Ownership), etc.</em></p>
<p>After so many years of disappointment, I&#8217;m not going to forecast 2012 as a pivotal year for <strong>the integration of business intelligence into operational applications.</strong> But if one of SAP or Oracle ever does get a significant BI/operational app integration advantage over the other, it could be a major competitive advantage in those application market segments that are still up for grabs. It also is an opportunity for both vendors to gain BI market share in their respective application customer bases.</p>
<p><strong>A more urgent issue for SAP</strong></p>
<p>SAP has put huge amounts of credibility on the line for HANA, the integration of two different and not particularly mature in-memory database technologies. So far, it is difficult to find evidence that HANA is robust enough for widespread adoption. Whether or not SAP can fix that is a huge open question, which could have significant impact on the course of several technology areas: applications, business intelligence, in-memory DBMS, and maybe even hardware.</p>
<p>Based on current information, which is admittedly partial, I&#8217;m a short-term pessimist on HANA. Longer-term, I&#8217;m on record as saying that <a href="../../../../../2011/05/23/databases-ram/">traditional databases will eventually wind up in RAM</a>. SAP will surely get that technology right some day, whether or not the way it does so has anything to do with present-day HANA code.</p>
<p><strong>Four more issues for Oracle </strong></p>
<p>Oracle&#8217;s ambitions are near-endless, and so also therefore is its list of execution challenges. Four in the analytics area that I find particularly interesting are:</p>
<ul>
<li><strong>True hybrid columnar DBMS.</strong> <a href="../../../../../2011/09/22/teradata-columnar-compression/">I was guessing that Oracle, like Teradata, would announce true hybrid columnar the week of Oracle OpenWorld</a>. I was wrong. But if Oracle can&#8217;t bring out true hybrid columnar DBMS functionality relatively soon, Exadata will lose credibility as a competitor to more specialized analytic DBMS.</li>
<li><strong>Oracle Exalytics.</strong> With Exalytics in the mix, Oracle&#8217;s technology stack has HANA-like potential. But will Exalytics even ship in 2012? (I think so.) Will it be good for much in the first release? (I&#8217;m skeptical.)</li>
<li><strong>Oracle&#8217;s Big Data Appliance</strong>. I&#8217;m skeptical both about <a href="../../../../../2011/10/20/more-notes-on-oracle-nosql/">Oracle&#8217;s NoSQL product</a> &#8212; <a href="http://www.infoworld.com/d/data-explosion/first-look-oracle-nosql-database-179107">a favorable InfoWorld review</a> notwithstanding &#8212; and <a href="../../../../../2011/09/23/hadoop-appliances/">Hadoop appliances</a>. But if I&#8217;m wrong, and Oracle can successfully embrace/extend the new non-relational paradigms, then it really might regain control over the evolution of data management.</li>
<li><strong><a href="../../../../../2011/10/18/oracle-is-buying-endeca/">Oracle&#8217;s Endeca acquisition</a></strong> &#8212; will Oracle prove me wrong and integrate Endeca effectively into its overall analytic product line? If it does, we might finally see effective text (and eventually speech) navigation of enterprise software. (But as with all Oracle issues cited here, this is something that probably won&#8217;t amount to much in 2012 even if it does later go well.)</li>
</ul>
<p><strong>Three issues for IBM</strong></p>
<p>Like Oracle, IBM is a huge company with many ambitions and hence many execution challenges. The biggest of those is surely: <strong>How effective can IBM be at selling outside its existing customer base?</strong> I don&#8217;t hear as much competitively about IBM DataStage, IBM SPSS or now IBM Netezza as I did when their vendors were independent companies. Even Cognos may not be much of an exception to the rule, although it has its own large customer base outside of IBM&#8217;s traditional one. (To lesser extents , the same is of course true of Netezza and numerous other IBM acquisitions.)</p>
<p>Another general issue for IBM is <strong>substantively integrating its various product lines,</strong> at least to the extent that makes sense. DB2/Netezza integration sounds good, but even that is a matter more of product marketing (the admirable part of that discipline) more than of actual technology. Other integrations (e.g. Cognos/DB2 in various bundles) have tended toward the dubious side.*</p>
<p><em>*I&#8217;m still waiting for IBM to get back to me with examples of how Cognos/DB2 joint tuning amounts to anything. It&#8217;s been more than a year, so I&#8217;m glad I didn&#8217;t hold my breath.</em></p>
<p>In a somewhat narrower vein, I wonder: <strong><a href="../../../../../2011/11/10/cep-streaming-catchup/">Will IBM be able to gain traction for InfoSphere Streams</a>? </strong>And if so, when and where will the traction be?</p>
<p><strong>Will HP screw up Vertica?</strong></p>
<p>Vertica has a very attractive product offering. It&#8217;s perhaps <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">the most scalable analytic DBMS outside of Teradata</a>, running on the hardware of your reasonable choice.  It&#8217;s also the one I recommend most often to clients in the 1-50 terabyte range.</p>
<p>So far HP doesn&#8217;t seem to have done much to leadfoot Vertica. (About all I&#8217;ve heard from competitors is that Vertica seems to have faded somewhat in the financial services market, and there could be multiple explanations if that is indeed true.) But if HP Vertica does somehow manage to botch things, opportunities will open up for a range of columnar analytic DBMS competitors.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Remote machine-generated data</title>
		<link>http://www.dbms2.com/2011/07/26/remote-machine-generated-data/</link>
		<comments>http://www.dbms2.com/2011/07/26/remote-machine-generated-data/#comments</comments>
		<pubDate>Tue, 26 Jul 2011 08:45:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Truviso]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5012</guid>
		<description><![CDATA[I refer often to machine-generated data, which is commonly generated inexpensively and in log-like formats, and is often best aggregated in a big bit bucket before you try to do much analysis on it. The term has caught on, to the point that perhaps it&#8217;s time to distinguish more carefully among different kinds of machine-generated [...]]]></description>
			<content:encoded><![CDATA[<p>I refer often to <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>, which is commonly generated inexpensively and in log-like formats, and is often best aggregated in a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> before you try to do much analysis on it. The term has caught on, to the point that perhaps it&#8217;s time to distinguish more carefully among different <em>kinds</em> of machine-generated data. In particular, I think it may be useful to distinguish between:</p>
<ul>
<li><strong>Log-stream machine-generated data,</strong> when what you&#8217;re looking at &#8212; at least initially &#8212; is the entire output of verbose logging systems.</li>
<li><strong>Remote machine-generated data.</strong></li>
</ul>
<p>Here&#8217;s what I&#8217;m thinking of for the second category. I rather frequently hear of cases in which data is generated by large numbers of remote machines, which occasionally send messages home. For example:  <span id="more-5012"></span></p>
<ul>
<li>I heard yesterday about a case with 10s of millions of machines, phoning home every 5 minutes, and another case with 10s of 1000s calling in every 5 seconds, both of them sending data initially to MySQL. (Application details weren&#8217;t given.)</li>
<li>I heard not long ago about a set-top box case that the vendor hoped would also grow to 10s of millions of machines, which I guessed might send a small number of messages per hour each.</li>
<li>I also heard recently about a remote security monitoring case whose data was destined for (probably) Netezza, although in that case I&#8217;m not sure about the &#8220;occasionally&#8221; aspect of the communication.</li>
<li>The last time I visited Splunk, I got the sense that energy-sensor use cases (especially in the electric grid) had finally emerged. I believe these sensors are periodic message senders &#8212; they wake up, take their temperature (figuratively or literally as the case may be), send a message, snooze, repeat.</li>
<li>I would guess that the <a href="../../../../../2009/10/14/infobright-notes/">energy use cases</a> Infobright talked about in 2009 were of a similar kind.</li>
<li>An April, 2010 comment on the post linked above talks about <a href="../../../../../2010/04/08/machine-generated-data-example/#comment-165006">many kinds of sensor data</a>.</li>
<li>Back in 2007, <a href="../../../../../2007/08/12/applications-for-not-so-low-latency-cep/">Coral8</a> talked of a truck phone-home use case (on-board GPS data and also, e.g., refrigeration level, sending messages once a minute or so). Truviso seemed to have one similar deal before one of its frequent changes in strategic direction, and not coincidentally cites UPS as an investor.</li>
<li>In principle, there are a lot of RFID use cases out there, even if I rarely seem to hear of any. (That would be a shorter &#8220;phone call&#8221; home than most of the other examples, of course, but might be otherwise technically similar.)</li>
</ul>
<p>Many technologies can be used to collect and manage remote machine-generated data, but a few common points are worth nothing.</p>
<ul>
<li>If a device takes the trouble to send a message across a wide-area network, that message may be somewhat more valuable than the average piece of log-vomit. Perhaps such information doesn&#8217;t need to be stored in the cheapest possible way.</li>
<li>Similarly, a message that is sent occasionally over time, or upon a specified event, may be more structured than a random log entry. Perhaps such data is suitable for sending straight to a <strong>relational database</strong>.</li>
<li>If there&#8217;s no central place the data originates, there may also be no favored place for the data to end up. It may make great sense to collect and analyze remote machine-generated data in the <strong>cloud. </strong>(Exceptions may of course arise if you want to use the data in connection with other information, and you hence want to bring it to that other information&#8217;s location.)</li>
<li>In a number of use cases, the whole point is to identify anomalies, and respond to them rapidly. I.e., remote machine-generated data use cases commonly raise challenges in low-latency <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integration of short-request and analytic processing</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/26/remote-machine-generated-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 2)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:18:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Rainstor]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4867</guid>
		<description><![CDATA[In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/">Part 1</a> of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list match that is even less clear.  <span id="more-4867"></span></p>
<p><strong><em>Bit bucket</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Logs, other technical/external</li>
<li><em>Likely use styles:</em> Staging/ETL, investigative</li>
<li><em>Canonical example: </em>Log files in a Hadoop cluster<em> </em></li>
<li><em>Stresses:</em> TCO, scale-out, transform/big-query performance, ETL functionality</li>
</ul>
<p>With the explosion of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> has come the need for a place to put it all, sometimes called the <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a>. This is like the investigative data mart for big databases, but more <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured</a>. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.</p>
<p>The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.</p>
<p><strong><em>Archival data store</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Operational, CDR (call detail record), security log</li>
<li><em>Likely use styles:</em> Archival, reporting (for compliance), possibly also investigative</li>
<li><em>Examples:</em> Any long-term detailed historical store</li>
<li><em>Stresses: </em>TCO, compression, scale-out, performance (if multi-use)<em> </em></li>
</ul>
<p><em> </em></p>
<p>Analytic DBMS vendors have been insulting each other with the claim &#8220;that&#8217;s just an archival data store,&#8221; dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only <a href="../../../../../2010/06/11/rainstor-update/">Rainstor</a> truly embraces the archival positioning, and I&#8217;ve become pretty dubious about their technical claims and their company alike.</p>
<p>Still, there&#8217;s a legitimate need for data stores &#8212; especially relational analytic DBMS that:</p>
<ul>
<li>Store data cheaply, with high rates of compression.</li>
<li>Have decent performance if you do want to query the data.</li>
<li>May have archiving/compliance-specific features as well.</li>
</ul>
<p>Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.</p>
<p><strong><em>Outsourced data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Traditional BI, investigative analytics, staging/ETL</li>
<li><em>Examples:</em> Advertising tracking, SaaS CRM</li>
<li><em>Stresses:</em> Performance, TCO, reliability, concurrency</li>
</ul>
<p>Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I&#8217;ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that&#8217;s an analytic data management challenge. The possibilities expand from there.</p>
<p>Data outsourcers are in the IT business, and so their IT development is &#8212; hopefully! &#8212; more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.*  <a href="../../../../../2011/06/26/what-to-think-about-before-you-make-a-technology-decision/">Multitenancy</a> is commonly an issue, as is running in the cloud.<em> </em></p>
<p><em>*Even so, there&#8217;s often That Guy who doesn&#8217;t want to migrate away from Oracle, no matter what.<strong> </strong></em></p>
<p>Vertica gets the nod in a number of these cases; it&#8217;s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you&#8217;re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.</p>
<p><strong><em>Operational analytic(s) server</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> Customer-centric, log, financial trade</li>
<li><em>Likely use styles:</em> Advanced operational analytics</li>
<li><em>Examples:</em>
<ul>
<li>Lower latency: Web or call-center personalization, anti-fraud</li>
<li>Higher latency: Customer profiling, Basel 3 risk analysis</li>
</ul>
</li>
<li><em>Stresses:</em> Performance, reliability, analytic functionality, perhaps concurrency</li>
</ul>
<p>Even with eight different choices, I need a &#8220;catch-all&#8221; category; this is it.</p>
<p>Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrating short-request and analytic processing</a>. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you&#8217;re set. Otherwise, you may want to pipe <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">derived data</a> into a more &#8220;industrial-strength&#8221; DBMS, ideally the one that runs your operational apps anyway</p>
<p>Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you&#8217;re starting out with the data in a convenient bit bucket.</p>
<p>Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.</p>
<p><em>So did I get them all? Or are there yet other analytic data management use cases that I don&#8217;t fit into my eight categories?</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 1)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:17:44 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4868</guid>
		<description><![CDATA[Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help. Let&#8217;s try eight categories instead. While no categorization [...]]]></description>
			<content:encoded><![CDATA[<p>Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help.</p>
<p>Let&#8217;s try eight categories instead. While <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">no categorization is ever perfect</a>, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need &#8212; and in most cases you&#8217;ll need several &#8212; is a great early step in your analytic technology planning.  <span id="more-4868"></span></p>
<p><strong><em>Enterprise data warehouse</em></strong> (Full or partial)</p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, but especially operational</li>
<li><em>Likely use styles:</em> All</li>
<li><em>Canonical example:</em> Central EDW for a big enterprise</li>
<li><em>Stresses:</em> Concurrency, reliability, workload management</li>
</ul>
<p>The enterprise data warehouse (EDW) ideal says that you copy all your data into one place, and drive all decision-making from there. <a href="../../../../../2011/06/21/its-official-the-grand-central-edw-will-never-happen/">Full EDWs are pipedreams</a>. Still, a partial EDW makes sense for most large enterprises, and many indeed already have one. The first product lines to consider for classical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL Server, especially if you&#8217;re going to stress concurrency and/or operational use cases.</p>
<p><strong><em>Traditional data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Business intelligence, budgeting/consolidation, investigative</li>
<li><em>Examples:</em> Reporting servers, planning/consolidation servers, anything MOLAP, etc.</li>
<li><em>Stresses:</em> Performance, concurrency, TCO</li>
</ul>
<p>Whether or not you have something like an enterprise data warehouse, it&#8217;s common to have lighter-weight data marts as well. A traditional data mart might drive reports and dashboards. Or it might be specialized for budgeting, planning, and/or consolidation.  Some <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> may be in the mix as well.</p>
<p>Any DBMS that can support an EDW can also support a data mart, but it may not be the most cost-effective way to do so. Columnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them &#8212; e.g. Sybase IQ and <a href="../../../../../2011/06/20/vertica-release-5/">Vertica</a> &#8212; have excellent track records in concurrent usage as well. <a href="../../../../../2011/05/29/when-to-use-relational-database-management-system/">Ted Codd</a> pushed what amounts to MOLAP (Multidimensional OnLine Analytic Processing) systems for these use cases. But relational DBMS commonly do a better job, which is one reason most major MOLAP products have wound up at RDBMS companies.</p>
<p><strong><em>Investigative data mart &#8212; agile</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> A few analysts getting a few TB to examine</li>
<li><em>Stresses:</em> Ease of setup/load, ease of admin, price/performance</li>
</ul>
<p>Besides the traditional data mart, there are at least two other kinds. Both are focused on investigative analytics, but they&#8217;re differentiated by database size.</p>
<p>If you have just a few analysts,* looking at no more than a few terabytes of data (perhaps even just some gigabytes) &#8212; and if that data is &#8220;single-subject&#8221; and fairly homogenous &#8212; your watchwords should be &#8220;cheap&#8221;, &#8220;easy&#8221;, and &#8220;fast&#8221;. You don&#8217;t need to invest in much hardware, in expensive software, in much administrative effort (the analysts can be their own DBAs),  nor should you endure much set-up time. Just grab a product, grab some data, and start running queries (or extracts into the statistical tool of your choice).</p>
<p><em>*If you have dozens or even hundreds of analysts hitting the same database, you&#8217;re probably back to the more concurrency-oriented scenarios outlined above.</em></p>
<p>Infobright is often cost-effective among columnar analytic DBMS. Other vendors might cut you a price break as well. If you have multiple terabytes of data, don&#8217;t rule out Netezza&#8217;s lowest-end products (even if they&#8217;d really rather sell you something bigger). Or, if you&#8217;re in the sub-terabyte range, maybe you can get by with an in-memory BI tool such as QlikView, and not do anything special on the DBMS side at all.</p>
<p><strong><em>Investigative data mart &#8212; big</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric, logs, financial trade, scientific</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> Single-subject 20 TB &#8211; 20 PB relational database<em></em></li>
<li><em>Stresses:</em> Performance, scale-out, analytic functionality</li>
</ul>
<p>But if you&#8217;re looking at tens of terabytes of relational data, or even more, you really do have a &#8220;big data&#8221; problem. Performance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum. Performance POCs (Proofs Of Concept) are a big part of the buying process. Vendor price negotiations are crucial too.</p>
<p><em>Actually, in the low tens of terabytes you might be able to get away with a shared-disk system that has excellent compression &#8212; e.g., columnar products like Sybase IQ, Infobright, or SAND, rather than just Vertica and ParAccel.</em></p>
<p>Assuming you have affordable, scalable query performance, the competitive differentiator can switch to additional analytic functionality. Aster, Netezza, ParAccel, Vertica, and Greenplum either offer full <a href="../../../../../2011/02/24/analytic-platforms/">analytic platforms</a>, or seem to be on the path to doing so. Teradata, which now owns Aster Data, offers substantial built-in analytic capability in its traditional products as well, and the same goes for Sybase IQ.</p>
<p><em>Continued in <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/">Part 2</a>,</em><em> where we cover some of the more difficult use cases.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>It&#8217;s official &#8212; the grand central EDW will never happen</title>
		<link>http://www.dbms2.com/2011/06/21/its-official-the-grand-central-edw-will-never-happen/</link>
		<comments>http://www.dbms2.com/2011/06/21/its-official-the-grand-central-edw-will-never-happen/#comments</comments>
		<pubDate>Wed, 22 Jun 2011 01:54:46 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Netezza]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4804</guid>
		<description><![CDATA[I pointed out last year that the grand central enterprise data warehouse couldn&#8217;t happen; the post started: An enterprise data warehouse should: Manage data to high standards of accuracy, consistency, cleanliness, clarity, and security. Manage all the data in your organization. Pick ONE. IBM&#8217;s main theme at the Enzee Universe conference has been to say [...]]]></description>
			<content:encoded><![CDATA[<p>I pointed out last year that <a href="http://www.dbms2.com/2010/04/12/enterprise-data-warehouse-edw-myt/">the grand central enterprise data warehouse couldn&#8217;t happen</a>; the post started:</p>
<blockquote><p>An <strong>enterprise data warehouse</strong> should:</p>
<ul>
<li>Manage data 	to high standards of  <strong>accuracy, consistency, cleanliness, 	clarity, and  security.</strong></li>
<li>Manage <strong>all the data in your 	organization.</strong></li>
</ul>
<p><strong>Pick ONE.</strong></p></blockquote>
<p>IBM&#8217;s main theme at the Enzee Universe conference has been to say the same thing.</p>
<p>Merv Adrian&#8217;s talk at the same conference made it clear that Gartner feels the same way, as does he personally. Indeed, like me, he&#8217;s racked up multiple decades of industry experience without ever finding a single theoretically ideal grand central EDW.</p>
<p>Forrester Research has been a little less clear on the point, but generally seems to be on the correct side of the issue as well.</p>
<p>If somebody is still saying that one central enterprise data warehouse can hold all the information or data you need on which to base your business decisions, they&#8217;re probably not somebody you should be listening to very hard.</p>
<p>Is that clear, or should I hammer home the point even harder? <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/21/its-official-the-grand-central-edw-will-never-happen/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Investigative analytics and derived data: Enzee Universe 2011 talk</title>
		<link>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/</link>
		<comments>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/#comments</comments>
		<pubDate>Sun, 19 Jun 2011 12:13:04 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4747</guid>
		<description><![CDATA[I&#8217;ll be speaking Monday, June 20 at IBM Netezza&#8217;s Enzee Universe conference. Thus, as is my custom: I&#8217;m posting draft slides. I&#8217;m encouraging comment (especially in the short time window before I have to actually give the talk). I&#8217;m offering links below to more detail on various subjects covered in the talk. The talk concept [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ll be speaking Monday, June 20 at IBM Netezza&#8217;s <a href="http://www.netezza.com/userconference/abstract.html#620_1200">Enzee Universe</a> conference. Thus, as is my custom:</p>
<ul>
<li>I&#8217;m posting draft <a href="http://www.monash.com/uploads/Enzee-Universe-2011-Monash.ppt">slides</a>.</li>
<li>I&#8217;m encouraging comment (especially in the short time window before I have to actually give the talk).</li>
<li>I&#8217;m offering links below to more detail on various subjects covered in the talk.</li>
</ul>
<p>The talk concept started out as &#8220;advanced analytics&#8221; (as opposed to fast query, a subject amply covered in the rest of any Netezza event), as a lunch break in what is otherwise a detailed &#8220;best practices&#8221; session. So I suggested we constrain the subject by focusing on a specific application area &#8212; customer acquisition and retention, something of importance to almost any enterprise, and which exploits most areas of analytic technology. Then I actually prepared the slides &#8212; and guess what? The mix of subjects will be skewed somewhat more toward generalities than I first intended, specifically in the areas of <strong>investigative analytics </strong>and<strong> derived data. </strong>And, as always when I speak, I&#8217;ll try to raise consciousness about the issues of <a href="../../../../../2011/01/10/privacy-dangers-an-overview/">liberty and privacy</a>, our <a href="../../../../../2010/07/04/fair-data-use/">options as a society for addressing them</a>, and the crucial role we play as an industry in <a href="../../../../../2010/04/04/privacy-liberty-continued/">helping policymakers deal with these technologically-intense subjects</a>.</p>
<p>Slide 3 refers back to a post I made last December, saying there are <a href="../../../../../2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/">six useful things you can do with analytic technology</a>:</p>
<ul>
<li><strong>Operational      BI/Analytically-infused operational apps:</strong> You can make an immediate      decision.</li>
<li><strong>Planning      and budgeting:</strong> You can plan in      support of future decisions.</li>
<li><strong>Investigative      analytics (multiple disciplines):</strong> You can research, investigate, and analyze in support of future decisions.</li>
<li><strong>Business      intelligence:</strong> You can monitor      what’s going on, to see when it necessary to decide, plan, or investigate.</li>
<li><strong>More      BI:</strong> You can communicate, to help      other people and organizations do these same things.</li>
<li><strong>DBMS,      ETL, and other &#8220;platform&#8221; technologies:</strong> You can provide support, in      technology or data gathering, for one of the other functions.</li>
</ul>
<p>Slide 4 observes that <a href="http://www.dbms2.com/2011/03/03/investigative-analytics/">investigative analytics</a>:</p>
<ul>
<li>Is the most rapidly advancing of the six areas &#8230;</li>
<li>&#8230; because it most directly exploits performance &amp; scalability.</li>
</ul>
<p>Slide 5 gives my simplest overview of investigative analytics technology to date:  <span id="more-4747"></span></p>
<ul>
<li>Fast query
<ul>
<li>Persistent storage (any data volume)</li>
<li>RAM (10s -100s of gigabytes, or more)</li>
</ul>
</li>
<li>Fast analytics
<ul>
<li>Predictive modeling</li>
<li>Transformation/tagging</li>
<li>Graph</li>
</ul>
</li>
</ul>
<p>Slide 6 points out that this is all supported by cheap data creation and acquisition, specifically in the area of <a href="http://www.dbms2.com/2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>, which gets the full benefit of Moore&#8217;s Law.</p>
<p>Slides 7-13 point out how the example problem domain involves lots of analytic tasks performed on lots of kinds of data. Specific examples cited include <a href="http://www.dbms2.com/2011/04/14/attensity-update/">text analytics</a> and <a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/">graph/relationship analytics</a>.</p>
<p>Slide 14 contains the punch line, so I&#8217;ll quote it in full:</p>
<blockquote><p>Derived data</p>
<ul>
<li>You can’t keep re-analyzing all that in raw form …</li>
<li>&#8230; so don’t.</li>
</ul>
<p><em>If you have one takeaway from this session, let it be the utter importance of derived data. </em></p></blockquote>
<p>Slide 16 lists kinds of <a href="http://www.dbms2.com/2011/05/30/another-category-of-derived-data/">derived data</a> that are important in the single application of reducing telco churn:</p>
<ul>
<li>Normalized data
<ul>
<li>Parsed/sessionized logs</li>
<li>Text/sentiment highlights</li>
<li>Social network graph(s)</li>
<li>Web de-anonymization</li>
<li>Household matching</li>
</ul>
</li>
<li>Scores and buckets
<ul>
<li>Demographic</li>
<li>Psychographic</li>
<li>Offer hot buttons</li>
<li>(Dis)satisfaction</li>
<li>Credit/fraud risk</li>
<li>Lifetime customer value</li>
<li>Influence on others!</li>
</ul>
</li>
</ul>
<p>And finally, Slide 17 is my first pass at best practices for dealing with derived data:</p>
<ul>
<li>Evolving data warehouse schema</li>
<li>Data marts
<ul>
<li>Physical or virtual</li>
<li>Inputs/outputs to “EDW”</li>
</ul>
</li>
<li>“Data science”
<ul>
<li>Research != production</li>
</ul>
</li>
<li>Multiple processing pipelines
<ul>
<li>Log parsing</li>
<li>Text</li>
<li>Predictive analytics</li>
<li>Generic ETL</li>
<li>Streaming “ETL”</li>
</ul>
</li>
</ul>
<p>That last list looks like a starting point for a whole set of interesting future posts.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Alternatives for Hadoop/MapReduce data storage and management</title>
		<link>http://www.dbms2.com/2011/05/14/hadoop-mapreduce-data-storage-management/</link>
		<comments>http://www.dbms2.com/2011/05/14/hadoop-mapreduce-data-storage-management/#comments</comments>
		<pubDate>Sat, 14 May 2011 05:00:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[EMC]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadapt]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Parallelization]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4438</guid>
		<description><![CDATA[There&#8217;s been a flurry of announcements recently in the Hadoop world. Much of it has been concentrated on Hadoop data storage and management. This is understandable, since HDFS (Hadoop Distributed File System) is quite a young (i.e. immature) system, with much strengthening and Bottleneck Whack-A-Mole remaining in its future. Known HDFS and Hadoop data storage [...]]]></description>
			<content:encoded><![CDATA[<p>There&#8217;s been a flurry of announcements recently in the Hadoop world. Much of it has been concentrated on Hadoop data storage and management. This is understandable, since HDFS (Hadoop Distributed File System) is quite a young (i.e. immature) system, with much strengthening and <a href="../../../../../2009/08/21/bottleneck-whack-a-mole/">Bottleneck Whack-A-Mole</a> remaining in its future.</p>
<p>Known HDFS and Hadoop data storage and management issues include but are not limited to:</p>
<ul>
<li>Hadoop is run by a master node, and specifically a namenode, that&#8217;s a single point of failure.</li>
<li>HDFS compression could be better.</li>
<li>HDFS likes to store three copies of everything, whereas many DBMS and file systems are satisfied with two.</li>
<li>Hive (the canonical way to do SQL joins and so on in Hadoop) is slow.</li>
</ul>
<p>Different entities have different ideas about how such deficiencies should be addressed.  <span id="more-4438"></span></p>
<p>For most practical purposes, <strong>Yahoo&#8217;s</strong> and <strong>IBM&#8217;s</strong> views about Hadoop have converged. Yahoo and IBM both believe that Hadoop data storage should be advanced solely through the <strong>Apache</strong> Hadoop open source process. In particular:</p>
<ul>
<li>IBM and Yahoo both talk of the great undesirability of Hadoop &#8220;forking&#8221; like Unix did.</li>
<li>Yahoo appeared on stage at IBM&#8217;s analyst event this week to reinforce the meeting-of-the-minds, even though there&#8217;s no IBM/Yahoo customer relationship involved.</li>
<li>IBM has disclaimed any intention of providing its own Hadoop distribution, but even so is committed to selling lots of <a href="http://www-01.ibm.com/software/data/bigdata/enterprise.html">IBM InfoSphere BigInsights</a>, which incorporates Apache Hadoop.*</li>
<li><a href="http://developer.yahoo.com/blogs/hadoop/posts/2011/01/announcement-yahoo-focusing-on-apache-hadoop-discontinuing-the-yahoo-distribution-of-hadoop/">Yahoo has stopped offering its own Hadoop distribution</a>, period.</li>
</ul>
<p><em>*IBM is emphatic about ruling out marketing terms whose connotation it doesn&#8217;t like. IBM&#8217;s Hadoop distribution isn&#8217;t a &#8220;distribution,&#8221; because that might make it sound too proprietary; IBM&#8217;s Oracle emulation offering <a href="../../../../../2009/04/24/ibms-oracle-emulation-strategy-reconsidered/#comment-118444">isn&#8217;t an &#8220;emulation&#8221; offering</a>, because that might make it sound too slow; and <a href="../../../../../2009/05/13/ibm-system-s-infosphere-streams-processing/">IBM&#8217;s CEP product InfoSphere Streams isn&#8217;t a &#8220;CEP&#8221; product</a>, because that might make it sound too non-functional.</em></p>
<p><strong>Cloudera</strong> can probably be regarded as part of the Yahoo/IBM camp, some stern looks from IBM in Cloudera&#8217;s direction notwithstanding. <a href="../../../../../2010/06/30/cloudera-enterprise-hadoop-evolution/">Cloudera Enterprise</a> &#8212; also an embrace-and-extend offering &#8212; remains the obvious choice for enterprises Hadoop users; meanwhile, nobody has convinced me of any bogosity in <a href="http://www.cloudera.com/hadoop/">the &#8220;no forking&#8221; claim Cloudera makes for its free/open source Hadoop distribution</a>. Indeed, when I visited Cloudera a couple of weeks ago, Mike Olson showed me a slide demonstrating that Cloudera might be supplanting Yahoo as the biggest ongoing contributor to Apache Hadoop.</p>
<p><strong>EMC&#8217;s Data Computing Division, </strong>nee&#8217; <strong>Greenplum,</strong> made a lot of Hadoop noise this week. Unlike Yahoo, IBM, and Cloudera, EMC really is forking Hadoop. <a href="../../../../../2011/04/05/comments-on-emc-greenplum/">I&#8217;m not talking with the EMC/Greenplum folks</a> these days, but the whole thing was covered from various angles by <a href="http://www.computerworld.com/s/article/9216541/EMC_unveils_Hadoop_appliance_BI_software">Lucas Mearian</a>, <a href="http://www.informationweek.com/news/software/info_management/229403178">Doug Henschen</a>, <a href="http://gigaom.com/cloud/emc-hadoop/">Derrick Harris</a>, and <a href="http://davidmenninger.ventanaresearch.com/2011/05/12/emc-enters-elephant-race-with-hadoop/">Dave Menninger</a>.</p>
<p>Another option is to entirely replace HDFS with a DBMS, whether distributed or just instanced at each node. <strong>DataStax</strong> is doing that with <a href="../../../../../2011/03/23/datastax-cassandrafs-hadoop-brisk/">Cassandra-based Brisk</a>; <strong><a href="../../../../../2011/03/23/hadapt-commercialized-hadoopdb/">Hadapt</a></strong> plans to do that with PostgreSQL and VectorWise <em>(edit: As per the comment below, Hadapt only plans a partial replacement of HDFS);</em> and <a href="../../../../../2011/04/17/netezza-twinfin-i-class-overview/">Netezza&#8217;s analytic platform</a> has a Hadoop-over-<strong>Netezza</strong> option as well. Mike Olson objects to such implementations being called &#8220;Hadoop&#8221;; but trademark issues aside, those vendors plan to support a broad variety of Hadoop-compatible tools. <strong>Aster Data</strong> has long taken that approach one step further, by offering an enhanced version of MapReduce &#8212; aka <a href="../../../../../2009/12/02/mapreduce-for-complex-analytics-webina/">SQL/MapReduce</a> &#8212; over its nCluster DBMS. And <a href="../../../../../2011/04/04/the-mongodb-story/"><strong>10gen</strong> offers a more primitive form of MapReduce with MongoDB</a>, but probably wouldn&#8217;t position it as addressing a &#8220;MapReduce market&#8221; at all.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/05/14/hadoop-mapreduce-data-storage-management/feed/</wfw:commentRss>
		<slash:comments>21</slash:comments>
		</item>
		<item>
		<title>In-memory, parallel, not-in-database SAS HPA does make sense after all</title>
		<link>http://www.dbms2.com/2011/04/21/sas-hpa-does-make-sense-after-all/</link>
		<comments>http://www.dbms2.com/2011/04/21/sas-hpa-does-make-sense-after-all/#comments</comments>
		<pubDate>Thu, 21 Apr 2011 08:23:41 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EMC]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4343</guid>
		<description><![CDATA[I talked with SAS about its new approach to parallel modeling. The two key points are: SAS no longer plans to go as far with in-database modeling as it previously intended. Rather, SAS plans to run in RAM on MPP DBMS appliances, exploiting MPI (Message Passing Interface). The whole thing is called SAS HPA (High-Performance [...]]]></description>
			<content:encoded><![CDATA[<p>I talked with SAS about its <a href="../../../../../2011/03/13/so-how-many-columns-can-a-single-table-have-anyway/">new approach to parallel modeling</a>. The two key points are:</p>
<ul>
<li><strong>SAS no longer plans to go as far with in-database modeling as it previously intended.</strong></li>
<li>Rather, <strong>SAS plans to run in RAM on MPP DBMS appliances,</strong> exploiting MPI (Message Passing Interface).</li>
</ul>
<p>The whole thing is called SAS HPA (High-Performance Analytics), in an obvious reference to HPC (High-Performance Computing). It will run initially on RAM-heavy appliances from Teradata and EMC Greenplum.</p>
<p>A lot of what&#8217;s going on here is that SAS found it annoyingly difficult to parallelize modeling within the framework of a massively parallel DBMS such as Teradata. Notes on that aspect include:</p>
<ul>
<li><strong>SAS wasn&#8217;t exploiting the capabilities of individual DBMS to their fullest;</strong> rather, it was looking for an approach that would work across multiple brands of DBMS. Thus, for example, the fact that Aster&#8217;s analytic platform architecture is more flexible or powerful than Teradata&#8217;s didn&#8217;t help much with making SAS run within the Aster nCluster database.</li>
<li>Notwithstanding everything else, <strong>SAS did make a certain set of modeling procedures run in-database.</strong></li>
<li><strong>SAS&#8217; previous plans to run in-database modeling in Aster and/or Netezza DBMS may never come to fruition.</strong></li>
</ul>
<p><span id="more-4343"></span>SAS&#8217; problems developing in-database modeling stem from, in essence, the limitations of UDFs (User Defined Functions). So why weren&#8217;t, for example, <a href="../../../../../2009/08/02/teradata-13-focuses-on-advanced-analytic-performance/">Teradata&#8217;s 2009 enhancements to its UDF capabilities</a> enough? The clearest example SAS gave me is that, while <a href="../../../../../2011/03/13/so-how-many-columns-can-a-single-table-have-anyway/">database tables are commonly limited to something on the order of 1000 columns</a> (their figure as well as mine), SAS might need 50-100,000 columns. One reason seems to be interactions between variables; SAS used the word &#8220;multiplied&#8221; a few times, but even so was coy about whether this could simply be regarded as quadratic terms in a regression. Another reason seems to be that in some cases, every value in a column spawns a new column in an intermediate table/array; indeed, this seems to be going on in the previously discussed case of <a href="../../../../../2011/04/06/so-can-logistic-regression-be-parallelized-or-not/">logistic regression</a>.</p>
<p>SAS code will be launched by the DBMS/data warehouse appliances, so potentially it can run under their native workload management. Teradata presumably has enough workload management richness to exploit that; EMC Greenplum, as of my August 2010 notes, probably did not.</p>
<p>SAS was gracious enough to let me post its slide deck, in both <a href="http://www.monash.com/uploads/SAS_HPA_2011-Shorter.pdf">shorter</a> and <a href="http://www.monash.com/uploads/SAS_HPA_2011-Longer.pdf">longer</a> versions. Due to a technical glitch during the call, I neither looked at the slides nor took notes. I think the biggest loss from those difficulties is that I didn&#8217;t learn what the futures at the end of the longer deck were all about.</p>
<p><strong><em>Related links</em></strong></p>
<ul>
<li><a href="http://www.dbms2.com/2011/04/21/application-areas-for-sas-hpa/">Application areas for SAS HPA</a> (April, 2011)</li>
<li><a href="../../../../../2010/05/15/further-clarifying-in-database-mpp-sas/">SAS&#8217; MPP story as of May, 2010</a></li>
<li><a href="../../../../../2007/10/10/sas-goes-mpp-on-teradata-first/">SAS&#8217; plans to run in-database on Teradata</a> (October, 2007)</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/04/21/sas-hpa-does-make-sense-after-all/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Netezza TwinFin i-Class overview</title>
		<link>http://www.dbms2.com/2011/04/17/netezza-twinfin-i-class-overview/</link>
		<comments>http://www.dbms2.com/2011/04/17/netezza-twinfin-i-class-overview/#comments</comments>
		<pubDate>Sun, 17 Apr 2011 13:59:59 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4315</guid>
		<description><![CDATA[I have long complained about difficulties in discussing Netezza&#8217;s TwinFin i-Class analytic platform. But I&#8217;m ready now, and in the grand sweep of the product&#8217;s history I&#8217;m not even all that late. The Netezza i-Class timing story goes something like this: Netezza i-Class was first foreshadowed in February, 2010. Netezza i-Class customer testing started in [...]]]></description>
			<content:encoded><![CDATA[<p>I have long complained about <a href="../../../../../2010/10/10/it-can-be-hard-to-analyze-analytics/">difficulties</a> in discussing Netezza&#8217;s TwinFin i-Class analytic platform. But I&#8217;m ready now, and in the grand sweep of the product&#8217;s history I&#8217;m not even all that late. The Netezza i-Class timing story goes something like this:</p>
<ul>
<li><a href="../../../../../2010/02/22/netezza-twinfin/">Netezza      i-Class was first foreshadowed in February, 2010</a>.</li>
<li>Netezza i-Class customer testing started in October, 2010 or so. Netezza      i-Class evidently has been shipped to 4-5 partners and a single-digit      number of end-user organizations, spread across some usual-suspect      industries (financial services, telecom, and so on).</li>
<li>Netezza i-Class 1.0 general availability is still in the (near)      future.</li>
</ul>
<p>My advice to Netezza as to how it should describe TwinFin i-Class boils down to:  <span id="more-4315"></span></p>
<blockquote><p>1.  The Netezza platform has been enhanced in two major ways:</p>
<ul>
<li>There&#8217;s a good way to run all kinds of analytic processes. This is very flexible and powerful, but tightly integrated with the SQL engine even so.</li>
<li>You are supplying some specific high-performing, highly parallel, big-data analytic process building blocks. More precisely, you have greatly extended the set of such building blocks; you had some cool building blocks (notably Spatial) even before this.</li>
</ul>
<p>2.   There are four main ways to get at this:</p>
<ul>
<li>Extended SQL.</li>
<li>Programming, in a bunch of languages and paradigms, integrated into the SQL.</li>
<li>Partner code, with them doing the programming for you.</li>
</ul>
</blockquote>
<p>Some of the rah-rah words aside, that&#8217;s a pretty fair overview. Here&#8217;s more detail.</p>
<p>To refresh your memory: <strong>Netezza TwinFin i-Class functionality basics</strong> include, as best I can tell (and there&#8217;s some more detail at the links above):</p>
<ul>
<li>You can run processes in a usual-suspect set of languages on      Netezza i-Class (even Fortran).</li>
<li>One notable example is R; indeed, there&#8217;s an R client for talking      to Netezza TwinFin.</li>
<li>Netezza provides its own Hadoop implementation, which differs from      standard Hadoop implementations most notably in that it manages data      relationally via the usual Netezza DBMS, not in anything like HDFS.</li>
<li>Anything written in any language except C/C++ (or of course SQL)      &#8212; and in particular anything involving Hadoop &#8212; runs out-of-process      versus the Netezza DBMS. C/C++ can run in-process, for maximum      performance.*</li>
<li>There&#8217;s an assortment of parallelized mathematical analytic packages      built into Netezza i-Class. The matrix algebra ones are called nzMatrix. Most      of the rest are part of a collection called nzAnalytics. Often these are      implemented as stored procedures, as they may make multiple passes through      the data.</li>
<li>Netezza has thoughtfully ported thousands of analytic procedures      for you to the Netezza platform (in essence, the basic R/CRAN and GNU      libraries). These are not promised to be parallel on their own, but you&#8217;re      welcome to invoke an instance on each node and parallelize that way.</li>
</ul>
<p>I forgot to check, but I&#8217;m guessing any extension of workload management to cover non-DBMS processes won&#8217;t be in the first release of Netezza i-Class.</p>
<p><em>*However, Netezza says that if you can batch requests to return even just 500-1000 records at a time, the out-of-process performance penalty &#8212; which is based on wait time for transferring data between processes &#8212; becomes insignificant.</em></p>
<p>None of that is particularly new information. But after a visit to Netezza on Tuesday, I&#8217;ve finally gotten some kind of handle on how i-Class is architected. <strong>Highlights of the Netezza i-Class architecture story,</strong> as I understand them, include:</p>
<ul>
<li>It all starts with UDtFs &#8212; User-Defined (table) Functions, which      are subject to the usual limitations.</li>
<li>To <strong>overcome the standard      limitations of UDtFs,</strong> Netezza built:
<ul>
<li>A set of UDtFs that, taken together, execute command-line       programs.</li>
<li>For each language (Java, Python, R, etc., and I think also C/C++),       a library that talks to the command-line executor. This library can talk       to multiple instances of the executor, so it&#8217;s not limited to a single       data stream. Similarly, it can persist past the life of a query.</li>
</ul>
</li>
<li>Similarly, Netezza built a C/C++ library that talks to the      command-line executor and also talks MPI (Message Passing Interface).
<ul>
<li>This has not yet been exposed outside Netezza.</li>
<li>Rather, MPI is used by nzMatrix, so that nzMatrix can invert (for       example) really, really big matrices.</li>
</ul>
</li>
<li>There are two* main ways to invoke all this.
<ul>
<li><strong>SQL.</strong> Any analytic process can be invoked via a SQL       UDtF. Netezza tends to use the term <strong>UDAP       (User-Defined Analytic Process)</strong> interchangeably for the process       itself and for the SQL UDtF that encapsulates it.</li>
<li>Netezza&#8217;s (interfaces to an) <strong>R</strong> client. More on that below.</li>
</ul>
</li>
<li>Netezza&#8217;s version of <strong>Hadoop </strong>is an important special case. The mappers and reducers you write in      Hadoop are UDAPs.</li>
</ul>
<p>I didn&#8217;t delve far enough into Netezza&#8217;s UDAP syntax to understand how it compares to, say, <a href="../../../../../2009/10/15/mapreduce-webinar-slides/">Aster&#8217;s SQL/MR</a>.</p>
<p><em>*From a marketing standpoint, Netezza might prefer to count partner code separately as a third way, but I&#8217;m focusing on the technology here, which is used by partners and end-user organizations alike.</em></p>
<p>Other Netezza/Hadoop notes include:</p>
<ul>
<li>Netezza has the usual kind of <a href="../../../../../2010/10/10/partnering-with-cloudera/">Cloudera      partnership</a>.</li>
<li>Since Netezza&#8217;s owner IBM has a Hadoop implementation, it seems obvious there will      be some partnership action with that too. But at this point it&#8217;s not so      far along.</li>
</ul>
<p>The Netezza TwinFin i-Class R story goes something like this:</p>
<ul>
<li>Assume you&#8217;re using R on a client. (I&#8217;m not sure whether Netezza      has an R client to give or recommend to you.)</li>
<li>There are three Netezza packages that change how R works, by      letting it use stuff on the Netezza box.
<ul>
<li>nzR translates between logical R memory structures and Netezza       tables. In particular, nzR allows R to run, not just in-memory, but       against the data on the Netezza box.</li>
<li>nzMatrix lets you do R matrix algebra against the data on the       Netezza box.</li>
<li>nzAnalytics lets you invoke various algorithms that run on the       Netezza box, against Netezza data.</li>
</ul>
</li>
</ul>
<p>A recently announced Netezza partnership with <a href="../../../../../2011/04/08/revolution-analytics-update/">Revolution Analytics</a> is meant to lead to Revolution replacing Netezza&#8217;s ports of R libraries with its own preferred distribution, and then supporting same.</p>
<p>Finally, there&#8217;s Netezza Spatial.</p>
<ul>
<li>Netezza claims multiple orders of magnitude of performance      advantage for Netezza Spatial vs. geospatial alternatives, which is always      a nice thing to be able to say.</li>
<li>Generally, <a href="../../../../../2008/09/23/peter-batty-on-netezza-spatial/">Netezza      Spatial</a> is now regarded as being part of i-Class.</li>
<li>However, the product timing and adoption comments above don&#8217;t      apply to Netezza Spatial.</li>
<li>Netezza Spatial has a couple of dedicated salespeople, and seems      to be well-liked by retailers.</li>
<li>Netezza surely wishes everybody would forget about some of <a href="../../../../../2010/10/03/notes-and-links-october-3-2010/">rewrites      and controversy</a> associated with Netezza Spatial.</li>
</ul>
<p>Perhaps there are yet more pieces of the Netezza TwinFin i-Class story I&#8217;m overlooking, but I hope I now have most of the major aspects at least partway right.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/04/17/netezza-twinfin-i-class-overview/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

