<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; EAI, EII, ETL, ELT, ETLT</title>
	<atom:link href="http://www.dbms2.com/category/data-integration-middleware/data-etl-etl-etlt-eai-eii/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Wed, 08 Feb 2012 22:51:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Departmental analytics &#8212; best practices</title>
		<link>http://www.dbms2.com/2012/01/25/departmental-analytics-best-practices/</link>
		<comments>http://www.dbms2.com/2012/01/25/departmental-analytics-best-practices/#comments</comments>
		<pubDate>Wed, 25 Jan 2012 16:47:59 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5867</guid>
		<description><![CDATA[I believe IT departments should support and encourage departmental analytics efforts, where &#8220;support&#8221; and &#8220;encourage&#8221; are not synonyms for &#8220;control&#8221;, &#8220;dominate&#8221;, &#8220;overwhelm&#8221;, or even &#8220;tame&#8221;. A big part of that is: Let, and indeed help, departments have the data they want, when they want it, served with blazing performance. Three things that absolutely should NOT [...]]]></description>
			<content:encoded><![CDATA[<p><a href="../../../../../2012/01/23/departmental-analytics-general-observations/">I believe IT departments should support and encourage departmental analytics efforts</a>, where &#8220;support&#8221; and &#8220;encourage&#8221; are not synonyms for &#8220;control&#8221;, &#8220;dominate&#8221;, &#8220;overwhelm&#8221;, or even &#8220;tame&#8221;. A big part of that is:<br />
<strong>Let, and indeed help, departments have the data they want, when they want it, served with blazing performance.</strong></p>
<p>Three things that absolutely should NOT be obstacles to these ends are:</p>
<ul>
<li>Corporate DBMS standards.</li>
<li>Corporate data governance processes.</li>
<li>The difficulties of ETL.</li>
</ul>
<p><span id="more-5867"></span>Reasons they shouldn&#8217;t or don&#8217;t need to be obstacles include:</p>
<ul>
<li>Analytic DBMS are often vastly more cost-effective than general-purpose ones.</li>
<li>In particular, analytic DBMS are often much easier to install and manage than general-purpose ones.</li>
<li>Heavy data governance bureaucracy is often unnecessary because:
<ul>
<li>The department should know what the limitations on the data&#8217;s accuracy are.</li>
<li>The department should know how much data accuracy is required.</li>
<li>The side-effects on other departments of any data inaccuracy would be minimal.</li>
</ul>
</li>
<li>There are multiple good schemes for populating data marts, managed by cost-effective analytic DBMS, with data from integrated data warehouses.
<ul>
<li>ELT (Extract/Load/Transform) almost always works, because data cleaning/data quality was handled at or before the IDW level, and because the analytic DBMS has the processing power to pull it off.</li>
<li>ETL (Extract/Transform/Load) should be easy as well. (If isn&#8217;t, something may be lacking in your ETL set-up.)</li>
<li>Analytic DBMS are increasingly adding capabilities for easy spin-out of real or virtual data marts. Other kinds of technology (e.g. virtualization) are having their database spin-out capabilities upgraded as well.</li>
</ul>
</li>
</ul>
<p>One point to remember in support of departmental autonomy <strong>is that departments&#8217; views of what data to use may be more expansive than central IT&#8217;s.</strong> One reason is that important data may be external to the company, outside IT&#8217;s natural realm  of concern. Examples of this include but are hardly limited to:</p>
<ul>
<li>Anything like &#8220;market data&#8221;.</li>
<li>Anything like &#8220;sentiment analysis&#8221;.</li>
<li>Data owned by supply chain partners.</li>
</ul>
<p>Further, even the more innovative internal data sources are commonly departmental, for example various kinds of multi-structured data (text verbatims from customers, log file data, and so on).</p>
<p>Whatever is true of data management (and ETL) is true for metadata management, even if it&#8217;s done by some kind of business intelligence tool. What I mean by that is:</p>
<ul>
<li><strong>Whoever manages data is also responsible for ingesting and emitting it &#8230;</strong></li>
<li>&#8230; and specifically for emitting it in<strong> understandable, well-organized, well-named formats, &#8230;</strong></li>
<li><strong>&#8230; </strong>so that <strong>departments can take responsibility for</strong> what amounts to <strong>lightweight analytic application development.</strong></li>
</ul>
<p>As for the &#8220;application development&#8221; itself, I&#8217;m envisioning at least three things:</p>
<ul>
<li>Math.</li>
<li>Sophisticated relational query.</li>
<li>Data visualization.</li>
</ul>
<p>I.e., I&#8217;m talking about what &#8220;analysts&#8221; and &#8220;quants&#8221; do. So to put the point even more simply:</p>
<ul>
<li><strong>Analysts and quants should be able to consume data that&#8217;s organized in a friendly manner.</strong></li>
<li><strong>Central IT should be friendly in how it serves data.</strong></li>
</ul>
<p>One corollary of this approach is that departments should try to adhere to corporate BI standards, at least for routine dashboard and reporting. Indeed, if a department brings in a business intelligence tool different from the corporate standard, there are three main possibilities:</p>
<ul>
<li>The tool is integrated with something else it makes sense to bring in, such as a third-party data supply or application.</li>
<li>The tool has an important capability the corporate standard doesn&#8217;t have, such as more flexible visualization and drilldown.</li>
<li>Central IT screwed up, making things much more difficult than they needed to be.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/25/departmental-analytics-best-practices/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Agile predictive analytics &#8212; the &#8220;easy&#8221; parts</title>
		<link>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-easy-parts/</link>
		<comments>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-easy-parts/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 19:38:58 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5747</guid>
		<description><![CDATA[I&#8217;m hearing a lot these days about agile predictive analytics, albeit rarely in those exact terms. The general idea is unassailable, in that it boils down to using data as quickly as reasonably possible. But discussing particulars is hard, for several reasons: Pundits tend to sketch castles in the air. Vendors tend to confuse part [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m hearing a lot these days about <strong>agile predictive analytics</strong>, albeit rarely in those exact terms. The general idea is unassailable, in that it boils down to <strong>using data as quickly as reasonably possible.</strong> But discussing particulars is hard, for several reasons:</p>
<ul>
<li><a href="http://www.column2.com/2011/11/agile-predictive-process-platforms-for-business-agility-with-jameskobielus/">Pundits tend to sketch castles in the air</a>.</li>
<li>Vendors tend to confuse part of the story &#8212; generally the part they happen to offer <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  &#8212; with the whole.</li>
<li>Different use cases give rise to different kinds of issues.</li>
</ul>
<p>At least three of the generic arguments for agility apply to predictive analytics:</p>
<ul>
<li>Doing the correct thing soon is usually better than doing the same correct thing later.</li>
<li>If it doesn&#8217;t take much time to do something, hopefully it doesn&#8217;t take that much expense (labor and so on) either.</li>
<li>It&#8217;s hard to get new stuff completely right on the first try. Often, the best strategy is to come close fast, then fix what&#8217;s still not ideal.</li>
</ul>
<p>But the reasons to want agile predictive analytics don&#8217;t stop there.</p>
<p><span id="more-5747"></span>Not only is it hard to get stuff right on the first try for a given information set, but the available information can also change quickly. For example:</p>
<ul>
<li>If you&#8217;re a consumer marketer, consumer tastes can change quickly, due to news (of many different kinds), seasonal trends, and so on. The most recent data you have contain information unavailable in your historical data sets. Also &#8230;</li>
<li>&#8230; if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn&#8217;t well-reflected in your previous models.</li>
<li>If you&#8217;re in capital markets, and you figure something out, probably so will rival investors. So whatever you knew three weeks ago may already be partially obsolete.</li>
</ul>
<p>What&#8217;s more, often you deliberately don&#8217;t want to test, model, or tune all your variables at once. First you determine whether the ad text should be &#8220;Would you be so kind as to allow us to supply you with our wares?&#8221; or &#8220;Buy it, dude!&#8221;; only afterwards do you decide whether the color scheme should rely on red or green.</p>
<p>With that as backdrop, how can you make your predictive analytics more agile? Let&#8217;s start by breaking predictive analytics into four pieces:</p>
<ul>
<li><a href="http://www.dbms2.com/2011/11/28/terminology-data-mustering/">Data mustering</a> for the analysts.</li>
<li>Actual analysis.</li>
<li>Data mustering for deployment.</li>
<li>Deployment.</li>
</ul>
<p><strong>Only the second of those has much excuse for being an agility bottleneck;</strong> the other three are well addressed by technology you can buy (or straightforwardly build) today.</p>
<p>The deployment part of the story can be pretty simple, at least technically &#8212; spit out some PMML (Predictive Modeling Markup Language), and if you&#8217;re deploying to a DBMS with good enough PMML support, you&#8217;re good to go. Any vendor who doesn&#8217;t offer that degree of simplicity had better be working toward it fast. That said, your applications that are infused with predictive analytics need to be modular enough to accommodate model changes; if not, some refactoring lies ahead. And the same can be said for the work processes that surround them.</p>
<p>The data mustering parts should be pretty straightforward too. Setting up a relational data mart tuned for <a href="http://www.dbms2.com/2011/03/03/investigative-analytics/">investigative analytics</a> isn&#8217;t all that hard or costly (perhaps unless your database is enormously large), and the same actually goes for a Hadoop cluster. Beyond that, if you can model and deploy from the same database, that&#8217;s great; if not, you have an ETL (Extract/Transform/Load) need. I guess you could have data quality/MDM (Master Data Management) issues as well, but offhand I&#8217;m not seeing why you wouldn&#8217;t push their solutions back to analysis time. And any decent analytic technology stack can give sub-hour latency; <a href="http://www.dbms2.com/2009/09/10/analytic-speed-latency/">while that may not suffice from all standpoints</a>, it&#8217;s plenty fast enough for analysis-time agility.</p>
<p>With those preliminaries out of the way, now let&#8217;s turn to <a href="http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/">the heart of the agile predictive analytics challenge</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-easy-parts/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>QlikView 11 and the rise of collaborative BI</title>
		<link>http://www.dbms2.com/2011/11/16/qlikview-collaborative-business-intelligence/</link>
		<comments>http://www.dbms2.com/2011/11/16/qlikview-collaborative-business-intelligence/#comments</comments>
		<pubDate>Wed, 16 Nov 2011 13:19:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5681</guid>
		<description><![CDATA[QlikView 11 came out last month. Let me start by pointing out: As one might expect, QlikView 11 contains fairly leading-edge stuff, but also some &#8220;better late than never&#8221; features. The leading-edge stuff is concentrated in the general area of &#8220;collaboration&#8221;. Additionally, QlikTech is always pushing the QlikView user interface ahead in various ways. The [...]]]></description>
			<content:encoded><![CDATA[<p>QlikView 11 came out last month. Let me start by pointing out:</p>
<ul>
<li>As one might expect, QlikView 11 contains fairly leading-edge stuff, but also some &#8220;better late than never&#8221; features.</li>
<li>The leading-edge stuff is concentrated in the general area of &#8220;collaboration&#8221;.</li>
<li>Additionally, QlikTech is always pushing the QlikView user interface ahead in various ways.</li>
<li>The &#8220;Well, it&#8217;s about time!&#8221; feature list starts with the ability to load QlikView via third-party ETL tools (Informatica now, others coming).</li>
<li>QlikTech is generally good at putting up pretty pictures of its product. You can find some in the &#8220;What&#8217;s New in QlikView 11&#8243; document via a general <a href="http://www.qlikview.com/us/explore/resources/brochures-datasheets?language=english&amp;page=1">QlikView resource page</a>.*</li>
<li>Stephen Swoyer wrote <a href="http://tdwi.org/articles/2011/11/01/QlikView-Update-Enterprise-Makeover.aspx">a good article summarizing QlikView 11</a>.</li>
</ul>
<p><em>*One confusing aspect to that paper:  non-standard uses of the terms &#8220;analytic app&#8221; and &#8220;document&#8221;.</em></p>
<p>As QlikTech tells it, QlikView 11 adds two kinds of collaboration features:</p>
<ul>
<li>Integration with social media, which QlikTech calls &#8220;asynchronous integration.&#8221;</li>
<li>Direct sharing of the QlikView UI, which QlikTech calls &#8220;synchronous integration.&#8221;</li>
</ul>
<p>I&#8217;d add a third kind, because QlikView 11 also takes some baby steps toward what I regard as a key aspect of BI collaboration &#8212; the ability to define and track your own metrics. It&#8217;s way, way short of what <a href="../../../../../2010/07/25/alerts-metrics-dashboards/">I called for in metric flexibility in a post last year</a>, but at least it&#8217;s a small start.</p>
<p><span id="more-5681"></span>That <strong>direct sharing of user interfaces is a cool feature, which every business intelligence vendor should offer. </strong>In an era of distributed workforces, when people can&#8217;t be assumed able to huddle around the same desk, it has value even for use among close coworkers. But it also should prove useful in a variety of more naturally remote use cases, multiple examples of which can be found in each of the areas of:</p>
<ul>
<li>Support (internal or external).</li>
<li>Faceoffs &#8212; I mean collaborations <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  &#8212; between two or more enterprise departments. Examples might include: manufacturing and purchasing, manufacturing and sales, or accounting and anybody else.</li>
</ul>
<p>As for <strong>social media being used for BI collaboration</strong> &#8212; that&#8217;s generally in the air. For example:</p>
<ul>
<li><a href="http://www.texttechnologies.com/2011/09/14/social-technology-in-the-enterprise/">salesforce.com is pushing enterprise social media use broadly</a>, and will surely increase its emphasis on the social media/BI intersection now that Dave Kellogg is there.</li>
<li>Spotfire has announced similar features in its latest release.</li>
<li>The more cumbersome side of the feature set (portal-based collaboration, emailing of individual reports) has been available from multiple vendors for years.</li>
<li>eBay open-sourced a more dataset-centric version of the idea, just as Oliver Ratzesberger left the firm.*</li>
</ul>
<p><em>*Umm &#8212; does anybody have a link to the project, or at least a name for it? <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </em></p>
<p>BI has been a communication tool since the first green paper report was dumped on the first desk. And there&#8217;s been collaboration in doing analysis at least since it&#8217;s been possible to email .XLS file attachments. Still<strong>, BI is too often used as bludgeon rather than binocular. Hopefully, the current generation of technology will finally serve to change that.</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/16/qlikview-collaborative-business-intelligence/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>MarkLogic&#8217;s Hadoop connector</title>
		<link>http://www.dbms2.com/2011/11/03/marklogic-hadoop-connector/</link>
		<comments>http://www.dbms2.com/2011/11/03/marklogic-hadoop-connector/#comments</comments>
		<pubDate>Fri, 04 Nov 2011 00:58:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5585</guid>
		<description><![CDATA[It&#8217;s time to circle back to a subject I skipped when I otherwise wrote about MarkLogic 5: MarkLogic&#8217;s new Hadoop connector. Most of what&#8217;s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you: Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s time to circle back to a subject I skipped when I otherwise wrote about <a href="http://www.dbms2.com/2011/11/01/marklogic-version-5/">MarkLogic 5</a>: MarkLogic&#8217;s new Hadoop connector.</p>
<p>Most of what&#8217;s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you:</p>
<ul>
<li>Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API for streaming documents into or out of a MarkLogic database.</li>
<li>Hadoop can make requests to MarkLogic in MarkLogic&#8217;s normal mode of operation, namely to address any node in the MarkLogic cluster, which then serves as a &#8220;head&#8221; node for the duration of that particular request. But alternatively, Hadoop can use a long-standing MarkLogic option to circumvent the whole DBMS cluster and only talk to one specific MarkLogic node.</li>
</ul>
<p>Otherwise, the whole thing is just what you would think:</p>
<ul>
<li>Hadoop can read from and write to MarkLogic, in parallel at both ends.</li>
<li>If Hadoop is just writing to MarkLogic, there&#8217;s a good chance the process is properly called &#8220;ETL.&#8221;</li>
<li>If Hadoop is reading a lot from MarkLogic, there&#8217;s a good chance the process is properly called &#8220;batch analytics.&#8221;</li>
</ul>
<p>MarkLogic said that it wrote this Hadoop connector itself.</p>
<p><span id="more-5585"></span>When I realized MarkLogic was claiming the ability to seamlessly integrate short-request and batch analytic processing, I asked about workload management. I gathered that:</p>
<ul>
<li>MarkLogic believes that MarkLogic 5 does a great job of granular workload monitoring.</li>
<li>However, MarkLogic doesn&#8217;t have a strong workload management administrative interface. Rather, you may have to do workload management programmatically.</li>
</ul>
<p>Overall, I think the MarkLogic Hadoop connector could prove pretty useful. The first question I ask somebody who wants to process relational data in Hadoop is &#8220;Why not just an analytic RDBMS?&#8221; But the natural use cases for MarkLogic are often ones in which you might as well do your analytics in Hadoop, including a 4 billion Word/PDF/image document insurance-industry example I recently encountered, and for which <a href="../../../../../2011/10/10/text-data-management-part-2-general-and-short-request/">I favor MarkLogic over MongoDB or straight Hadoop alike</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/03/marklogic-hadoop-connector/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Where Datameer is positioned</title>
		<link>http://www.dbms2.com/2011/10/25/where-datameer-is-positioned/</link>
		<comments>http://www.dbms2.com/2011/10/25/where-datameer-is-positioned/#comments</comments>
		<pubDate>Tue, 25 Oct 2011 21:53:31 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Datameer]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5514</guid>
		<description><![CDATA[I&#8217;ve chatted with Datameer a couple of times recently, mainly with CEO Stefan Groschupf, most recently after XLDB last Tuesday. Nothing I learned greatly contradicts what I wrote about Datameer 1 1/2 years ago.  In a nutshell, Datameer is designed to let you do simple stuff on large amounts of data, where &#8220;large amounts of [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve chatted with Datameer a couple of times recently, mainly with CEO Stefan Groschupf, most recently after XLDB last Tuesday. Nothing I learned greatly contradicts <a href="http://www.dbms2.com/category/products-and-vendors/datameer/">what I wrote about Datameer 1 1/2 years ago</a>.  In a nutshell, Datameer is designed to let you do simple stuff on large amounts of data, where &#8220;large amounts of data&#8221; typically means data in Hadoop, and &#8220;simple stuff&#8221; includes basic versions of a spreadsheet, of BI, and of EtL (Extract/Transform/Load, without much in the way of T).</p>
<p>Stefan reports that these capabilities are appealing to a significant fraction of enterprise or other commercial Hadoop users, especially the EtL and the BI. I don&#8217;t doubt him.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/25/where-datameer-is-positioned/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 2)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:18:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Rainstor]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4867</guid>
		<description><![CDATA[In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/">Part 1</a> of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list match that is even less clear.  <span id="more-4867"></span></p>
<p><strong><em>Bit bucket</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Logs, other technical/external</li>
<li><em>Likely use styles:</em> Staging/ETL, investigative</li>
<li><em>Canonical example: </em>Log files in a Hadoop cluster<em> </em></li>
<li><em>Stresses:</em> TCO, scale-out, transform/big-query performance, ETL functionality</li>
</ul>
<p>With the explosion of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> has come the need for a place to put it all, sometimes called the <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a>. This is like the investigative data mart for big databases, but more <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured</a>. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.</p>
<p>The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.</p>
<p><strong><em>Archival data store</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Operational, CDR (call detail record), security log</li>
<li><em>Likely use styles:</em> Archival, reporting (for compliance), possibly also investigative</li>
<li><em>Examples:</em> Any long-term detailed historical store</li>
<li><em>Stresses: </em>TCO, compression, scale-out, performance (if multi-use)<em> </em></li>
</ul>
<p><em> </em></p>
<p>Analytic DBMS vendors have been insulting each other with the claim &#8220;that&#8217;s just an archival data store,&#8221; dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only <a href="../../../../../2010/06/11/rainstor-update/">Rainstor</a> truly embraces the archival positioning, and I&#8217;ve become pretty dubious about their technical claims and their company alike.</p>
<p>Still, there&#8217;s a legitimate need for data stores &#8212; especially relational analytic DBMS that:</p>
<ul>
<li>Store data cheaply, with high rates of compression.</li>
<li>Have decent performance if you do want to query the data.</li>
<li>May have archiving/compliance-specific features as well.</li>
</ul>
<p>Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.</p>
<p><strong><em>Outsourced data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Traditional BI, investigative analytics, staging/ETL</li>
<li><em>Examples:</em> Advertising tracking, SaaS CRM</li>
<li><em>Stresses:</em> Performance, TCO, reliability, concurrency</li>
</ul>
<p>Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I&#8217;ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that&#8217;s an analytic data management challenge. The possibilities expand from there.</p>
<p>Data outsourcers are in the IT business, and so their IT development is &#8212; hopefully! &#8212; more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.*  <a href="../../../../../2011/06/26/what-to-think-about-before-you-make-a-technology-decision/">Multitenancy</a> is commonly an issue, as is running in the cloud.<em> </em></p>
<p><em>*Even so, there&#8217;s often That Guy who doesn&#8217;t want to migrate away from Oracle, no matter what.<strong> </strong></em></p>
<p>Vertica gets the nod in a number of these cases; it&#8217;s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you&#8217;re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.</p>
<p><strong><em>Operational analytic(s) server</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> Customer-centric, log, financial trade</li>
<li><em>Likely use styles:</em> Advanced operational analytics</li>
<li><em>Examples:</em>
<ul>
<li>Lower latency: Web or call-center personalization, anti-fraud</li>
<li>Higher latency: Customer profiling, Basel 3 risk analysis</li>
</ul>
</li>
<li><em>Stresses:</em> Performance, reliability, analytic functionality, perhaps concurrency</li>
</ul>
<p>Even with eight different choices, I need a &#8220;catch-all&#8221; category; this is it.</p>
<p>Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrating short-request and analytic processing</a>. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you&#8217;re set. Otherwise, you may want to pipe <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">derived data</a> into a more &#8220;industrial-strength&#8221; DBMS, ideally the one that runs your operational apps anyway</p>
<p>Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you&#8217;re starting out with the data in a convenient bit bucket.</p>
<p>Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.</p>
<p><em>So did I get them all? Or are there yet other analytic data management use cases that I don&#8217;t fit into my eight categories?</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>What to think about BEFORE you make a technology decision</title>
		<link>http://www.dbms2.com/2011/06/26/what-to-think-about-before-you-make-a-technology-decision/</link>
		<comments>http://www.dbms2.com/2011/06/26/what-to-think-about-before-you-make-a-technology-decision/#comments</comments>
		<pubDate>Sun, 26 Jun 2011 18:51:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4835</guid>
		<description><![CDATA[When you are considering technology selection or strategy, there are a lot of factors that can each have bearing on the final decision &#8212; a whole lot. Below is a very partial list. In almost any IT decision, there are a number of environmental constraints that need to be acknowledged. Organizations may have standard vendors, [...]]]></description>
			<content:encoded><![CDATA[<p>When you are considering technology selection or strategy, there are a lot of factors that can each have bearing on the final decision &#8212; a whole lot. Below is a very partial list.</p>
<p>In almost any IT decision, there are a number of <strong>environmental constraints</strong> that need to be acknowledged. Organizations may have <strong>standard vendors</strong>, favored vendors, or simply vendors who give them <a href="../../../../../2011/06/24/observations-on-oracle-pricing/">particularly deep discounts</a>. <strong>Legacy systems</strong> are in place, application and system alike, and may or may not be open to replacement. Enterprises may have on-premise or off-premise preferences; SaaS (Software as a Service) vendors probably have <strong>multitenancy</strong> concerns. Your organization can determine which aspects of your system you&#8217;d ideally like to see be tightly <strong>integrated </strong>with each other, and which you&#8217;d prefer to keep only loosely coupled. You may have biases for or against <strong>open-source software.</strong> You may be pro- or anti-<strong>appliance.</strong> Some applications have a substantial need for elastic scaling. And some kinds of issues cut across multiple areas, such as <strong>budget</strong>, <strong>timeframe, security, </strong>or<strong> trained personnel.</strong></p>
<p>Multitenancy is particularly interesting, because it has numerous implications. <span id="more-4835"></span>If you&#8217;re a SaaS vendor supporting multiple customers, you must keep each customer&#8217;s data inaccessible to other users* &#8212; even if you offer high levels of flexibility or customization. You probably also want to keep data logically partitioned by user, in a way that the DBMS recognizes; you may also want that partition to hunt as a pack for caching purposes, especially if no one customer occupies a large part of your database. Administratively, you need a way to measure customer-specific metrics of the sort that might go into SLAs (Service-Level Agreements).</p>
<p><em>*Of course, there are exceptions. One of my clients is a SaaS vendor facilitating commerce; the whole point of their app is to let two different customers see and update the same records.</em></p>
<p>Getting more specific now, I&#8217;m usually called upon to <a href="http://www.monash.com/adviseusers.html">advise users</a> in two categories &#8212; those that already know they want to upgrade analytic functionality, and those that quickly realize they do once I remind them of it. Even so, many organizations struggle with the question &#8220;What do you want to do analytically?&#8221; It&#8217;s tough to blame them, for the question is distressingly circular; <strong>a big part of analytics is figuring out which kinds of analytics are worth doing.</strong> Also, SaaS vendors often struggle with the same question for a different reason, responding &#8220;Well, we know we&#8217;ve only been giving them basic stuff to date. What else do you think they would like?&#8221;</p>
<p>There&#8217;s no perfect solution to those difficulties, but a good way to start the evaluation is by assessing:</p>
<ul>
<li>The<strong> nature and value of your decisions that analytics could reasonably affect.</strong></li>
<li>Your <strong>realistic scope for automation of analytic decisions.</strong></li>
<li>The <strong>number and training of your &#8220;full-time analysts&#8221;</strong> &#8212; statisticians, SQL jocks who can program, SQL jocks who can&#8217;t really program, full-time users of BI tools, whatever.</li>
<li>The <strong>number and training of your &#8220;part-time analysts&#8221;</strong> &#8212; normal business users who can get something out of a dashboard, and perhaps even drill down into it.</li>
</ul>
<p>That should at least tell you which broad categories of analytics you want to engage in, and roughly how advanced in those areas you should try to be.</p>
<p><em>Basic business intelligence/dashboarding? Surely. Visualization-centric BI? If nothing else, it demos well. Basic predictive modeling? Hmm, are you sure nobody will want that? Advanced predictive modeling? Um, are you sure your users can handle that, or that the results will be worth the investment?</em></p>
<p>When I talk with users, there&#8217;s usually a data management problem in the mix too. In such cases, I quickly ask about <strong>data-related metrics</strong>, starting with database size, ingest volumes (batch, if relevant, but especially continuous), and simultaneous query load /concurrent user count. Similarly important are requirements for various kinds of <a href="http://www.dbms2.com/2009/09/10/analytic-speed-latency/">latency</a>, the big two being <strong>query response time</strong> and <strong>how long it takes for data to first be available for query. </strong>Less numeric questions in a similar vein boil down to &#8220;What kinds of requests will you make against the database, in what volume?&#8221;</p>
<p><em>And this loops back to the analytic-user inventory. Suppose you had a near-real-time dashboard &#8212; would anybody actually look at it minute to minute?</em></p>
<p>Specialized metrics I request when considering analytic DBMS include &#8220;How many columns are there in your widest table?&#8221; and &#8220;How many joins &#8212; or lines of SQL &#8212; are there in your most complex query?&#8221;, both of which are tools for assessing &#8220;Is your use case naturally columnar?&#8221;. Another, more general <strong>&#8220;natural structure of data&#8221;</strong> kind of consideration is what structure the data is in before it gets to the database being discussed; candidates include relational batch, XML stream, log file, and many more.</p>
<p>Also crucial are requirements for <strong><a href="http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/">consistency</a>, availability, </strong>and<strong> data integrity.</strong> Those tell you your needs in <strong>high availability </strong>and<strong> disaster recovery,</strong> and perhaps even how picky you have to be about your brands of hardware, software, or cloud/hosting provider. They also indicate how much you should care about relational or ACID properties, and where you should come down on <a href="http://www.dbms2.com/2010/03/12/some-nosql-links/">CAP Theorem</a> trade-offs.</p>
<p><em>I could go on even longer, but those seem like a pretty good set of initial questions with which to start discussions of data management, data integration, and analytic tools and architectures. What do you think I left out? And what do you think I could make substantially clearer by just adding a few more words? Any comments will be much appreciated.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/26/what-to-think-about-before-you-make-a-technology-decision/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Investigative analytics and derived data: Enzee Universe 2011 talk</title>
		<link>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/</link>
		<comments>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/#comments</comments>
		<pubDate>Sun, 19 Jun 2011 12:13:04 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4747</guid>
		<description><![CDATA[I&#8217;ll be speaking Monday, June 20 at IBM Netezza&#8217;s Enzee Universe conference. Thus, as is my custom: I&#8217;m posting draft slides. I&#8217;m encouraging comment (especially in the short time window before I have to actually give the talk). I&#8217;m offering links below to more detail on various subjects covered in the talk. The talk concept [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ll be speaking Monday, June 20 at IBM Netezza&#8217;s <a href="http://www.netezza.com/userconference/abstract.html#620_1200">Enzee Universe</a> conference. Thus, as is my custom:</p>
<ul>
<li>I&#8217;m posting draft <a href="http://www.monash.com/uploads/Enzee-Universe-2011-Monash.ppt">slides</a>.</li>
<li>I&#8217;m encouraging comment (especially in the short time window before I have to actually give the talk).</li>
<li>I&#8217;m offering links below to more detail on various subjects covered in the talk.</li>
</ul>
<p>The talk concept started out as &#8220;advanced analytics&#8221; (as opposed to fast query, a subject amply covered in the rest of any Netezza event), as a lunch break in what is otherwise a detailed &#8220;best practices&#8221; session. So I suggested we constrain the subject by focusing on a specific application area &#8212; customer acquisition and retention, something of importance to almost any enterprise, and which exploits most areas of analytic technology. Then I actually prepared the slides &#8212; and guess what? The mix of subjects will be skewed somewhat more toward generalities than I first intended, specifically in the areas of <strong>investigative analytics </strong>and<strong> derived data. </strong>And, as always when I speak, I&#8217;ll try to raise consciousness about the issues of <a href="../../../../../2011/01/10/privacy-dangers-an-overview/">liberty and privacy</a>, our <a href="../../../../../2010/07/04/fair-data-use/">options as a society for addressing them</a>, and the crucial role we play as an industry in <a href="../../../../../2010/04/04/privacy-liberty-continued/">helping policymakers deal with these technologically-intense subjects</a>.</p>
<p>Slide 3 refers back to a post I made last December, saying there are <a href="../../../../../2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/">six useful things you can do with analytic technology</a>:</p>
<ul>
<li><strong>Operational      BI/Analytically-infused operational apps:</strong> You can make an immediate      decision.</li>
<li><strong>Planning      and budgeting:</strong> You can plan in      support of future decisions.</li>
<li><strong>Investigative      analytics (multiple disciplines):</strong> You can research, investigate, and analyze in support of future decisions.</li>
<li><strong>Business      intelligence:</strong> You can monitor      what’s going on, to see when it necessary to decide, plan, or investigate.</li>
<li><strong>More      BI:</strong> You can communicate, to help      other people and organizations do these same things.</li>
<li><strong>DBMS,      ETL, and other &#8220;platform&#8221; technologies:</strong> You can provide support, in      technology or data gathering, for one of the other functions.</li>
</ul>
<p>Slide 4 observes that <a href="http://www.dbms2.com/2011/03/03/investigative-analytics/">investigative analytics</a>:</p>
<ul>
<li>Is the most rapidly advancing of the six areas &#8230;</li>
<li>&#8230; because it most directly exploits performance &amp; scalability.</li>
</ul>
<p>Slide 5 gives my simplest overview of investigative analytics technology to date:  <span id="more-4747"></span></p>
<ul>
<li>Fast query
<ul>
<li>Persistent storage (any data volume)</li>
<li>RAM (10s -100s of gigabytes, or more)</li>
</ul>
</li>
<li>Fast analytics
<ul>
<li>Predictive modeling</li>
<li>Transformation/tagging</li>
<li>Graph</li>
</ul>
</li>
</ul>
<p>Slide 6 points out that this is all supported by cheap data creation and acquisition, specifically in the area of <a href="http://www.dbms2.com/2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>, which gets the full benefit of Moore&#8217;s Law.</p>
<p>Slides 7-13 point out how the example problem domain involves lots of analytic tasks performed on lots of kinds of data. Specific examples cited include <a href="http://www.dbms2.com/2011/04/14/attensity-update/">text analytics</a> and <a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/">graph/relationship analytics</a>.</p>
<p>Slide 14 contains the punch line, so I&#8217;ll quote it in full:</p>
<blockquote><p>Derived data</p>
<ul>
<li>You can’t keep re-analyzing all that in raw form …</li>
<li>&#8230; so don’t.</li>
</ul>
<p><em>If you have one takeaway from this session, let it be the utter importance of derived data. </em></p></blockquote>
<p>Slide 16 lists kinds of <a href="http://www.dbms2.com/2011/05/30/another-category-of-derived-data/">derived data</a> that are important in the single application of reducing telco churn:</p>
<ul>
<li>Normalized data
<ul>
<li>Parsed/sessionized logs</li>
<li>Text/sentiment highlights</li>
<li>Social network graph(s)</li>
<li>Web de-anonymization</li>
<li>Household matching</li>
</ul>
</li>
<li>Scores and buckets
<ul>
<li>Demographic</li>
<li>Psychographic</li>
<li>Offer hot buttons</li>
<li>(Dis)satisfaction</li>
<li>Credit/fraud risk</li>
<li>Lifetime customer value</li>
<li>Influence on others!</li>
</ul>
</li>
</ul>
<p>And finally, Slide 17 is my first pass at best practices for dealing with derived data:</p>
<ul>
<li>Evolving data warehouse schema</li>
<li>Data marts
<ul>
<li>Physical or virtual</li>
<li>Inputs/outputs to “EDW”</li>
</ul>
</li>
<li>“Data science”
<ul>
<li>Research != production</li>
</ul>
</li>
<li>Multiple processing pipelines
<ul>
<li>Log parsing</li>
<li>Text</li>
<li>Predictive analytics</li>
<li>Generic ETL</li>
<li>Streaming “ETL”</li>
</ul>
</li>
</ul>
<p>That last list looks like a starting point for a whole set of interesting future posts.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Metaphors amok</title>
		<link>http://www.dbms2.com/2011/06/15/metaphors-amok/</link>
		<comments>http://www.dbms2.com/2011/06/15/metaphors-amok/#comments</comments>
		<pubDate>Wed, 15 Jun 2011 07:26:30 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Fun stuff]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Humor]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4708</guid>
		<description><![CDATA[It all started when I disputed James Kobielus&#8217; blogged claim that Hadoop is the nucleus of the next-generation cloud EDW. Jim posted again to reiterate the claim, only this time he wrote that all EDW vendors [will soon] bring Hadoop into their heart of their architectures. (All emphasis mine.) That did it. I tweeted, in [...]]]></description>
			<content:encoded><![CDATA[<p>It all started when I disputed James Kobielus&#8217; blogged claim that <a href="http://www.dbms2.com/2011/06/05/hadoop-confusion-from-forrester-research/">Hadoop is the <strong>nucleus</strong> of the next-generation cloud EDW</a>. Jim posted again to reiterate the claim, only this time he wrote that <a href="http://blogs.forrester.com/james_kobielus/11-06-08-hadoop_future_of_enterprise_data_warehousing_are_you_kidding">all EDW  vendors [will soon] bring Hadoop into their <strong>heart</strong> of their architectures</a>. (All emphasis mine.)</p>
<p>That did it. I tweeted, in succession:</p>
<ul>
<li>Actually, I vote for Hadoop as the <strong>lungs</strong> of the EDW &#8212; first place of  entry for essential nutrients.</li>
<li>Data integration can be the <strong>heart</strong> of the EDW, pumping stuff around.  RDBMS/analytic platform can be the <strong>brain.</strong></li>
<li>iPad-based dashboards that may engender envy, but which actually are  only used occasionally and briefly &#8230; well, you get the picture.*</li>
</ul>
<p><em>*Woody Allen said in </em>Sleeper<em> </em><em>that the brain was his </em>second<em>-favorite organ.</em></p>
<p>Of course, that <strong>body</strong> of work was quickly challenged. Responses included:  <span id="more-4708"></span></p>
<ul>
<li><a href="http://twitter.com/#!/SethGrimes/status/80677365713350656">Re: your metaphor, oxygen is used in the combustion process that turns  nutrients into energy, so Hadoop is more <strong>mouth</strong>-ish</a>* (Seth Grimes)</li>
<li><a href="http://twitter.com/#!/JAdP/status/80685143047671808">Data Quality/Governance can be the <strong>liver,</strong> filtering out toxins</a> (Josep di Paolantonio)</li>
<li><a href="http://twitter.com/#!/NeilRaden/status/80685914497630209">what&#8217;s nice is that BI (DSS) used to be the <strong>colon,</strong> but now it&#8217;s the  Krebs Cycle</a> (Neil Raden)</li>
</ul>
<p><em>*<a href="http://www.monash.com/barlow.html">Linda</a> agrees with me that oxygen is a nutrient, and she&#8217;s taught both physiology and English at the college level.</em></p>
<p>But seriously, folks &#8212; I disagree with Jim&#8217;s follow-up post a lot less than I do the first one. He&#8217;s still overstating the case, of course, and he still seems confused about how some of the pieces of technology work. Even so, I agree that Hadoop is likely to play an important role in many enterprises&#8217; analytic ecosystems, both for its <a href="http://www.dbms2.com/2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> and parallel analytics capabilities.</p>
<p>Meanwhile, in unrelated puns, Chris Kanaracus and I had something of a <a href="http://www.pcworld.com/businesscenter/article/230032/midmarket_cios_have_analytics_on_the_brain.html">pizza party</a>:</p>
<blockquote><p>&#8220;Too many [midsized] businesses&#8217; ideas for <strong>slicing</strong> data don&#8217;t get far  past the <strong>pie</strong>-chart level,&#8221; said analyst Curt Monash of Monash Research.  &#8220;&#8216;Big data&#8217; isn&#8217;t even the issue for them; they could get much more  value than they do now just from <strong>personal-sized</strong> data sets.&#8221;</p>
<p>That&#8217;s not the case with Northeast pizza chain Papa Gino&#8217;s, which  is using IBM analytics software to crunch business data in many ways,  said Martha Lieber, director of business systems for the 280-restaurant  company, which also includes a number of D&#8217;Angelo sandwich shops.</p>
<p>This work has resulted in some <strong>delicious</strong> insights &#8230;</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/15/metaphors-amok/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop confusion from Forrester Research</title>
		<link>http://www.dbms2.com/2011/06/05/hadoop-confusion-from-forrester-research/</link>
		<comments>http://www.dbms2.com/2011/06/05/hadoop-confusion-from-forrester-research/#comments</comments>
		<pubDate>Sun, 05 Jun 2011 23:12:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4621</guid>
		<description><![CDATA[Jim Kobielus started a recent post Most Hadoop-related inquiries from Forrester customers come to me. These have moved well beyond the “what exactly is Hadoop?” phase to the stage where the dominant query is “which vendors offer robust Hadoop solutions?” What I tell Forrester customers is that, yes, Hadoop is real, but that it’s still [...]]]></description>
			<content:encoded><![CDATA[<p>Jim Kobielus started <a href="http://blogs.forrester.com/james_kobielus/11-06-03-hadoop_is_it_soup_yet">a recent post</a></p>
<blockquote><p>Most Hadoop-related inquiries from Forrester customers come to me. These have moved well beyond the “what exactly is Hadoop?” phase to the stage where the dominant query is “which vendors offer robust Hadoop solutions?”</p>
<p>What I tell Forrester customers is that, yes, Hadoop is real, but that it’s still quite immature.</p></blockquote>
<p>So far, so good. But I disagree with almost everything Jim wrote after that.</p>
<p>Jim&#8217;s thesis seems to be that Hadoop will only be mature when a significant fraction of analytic DBMS vendors have own-branded versions of Hadoop alongside their DBMS, possibly via acquisition. Based on this, he calls for a formal, presumably vendor-driven Hadoop standardization effort, evidently for the whole Hadoop stack. He also says that</p>
<blockquote><p>Hadoop is the nucleus of the next-generation cloud EDW, but that promise is still 3-5 years from fruition</p></blockquote>
<p>where by &#8220;cloud&#8221; I presume Jim means first and foremost &#8220;private cloud.&#8221; <strong></strong></p>
<p><strong>I don&#8217;t think any of that matches Hadoop&#8217;s actual strengths and weaknesses, </strong>whether now or in the 3-7 year future. My reasoning starts:</p>
<ul>
<li>Hadoop is well on its way to being a surviving <a href="http://www.dbms2.com/2011/06/04/dirty-data-stored-dirt-cheap/">data-storage-plus-processing system</a> &#8212; like an analytic DBMS or DBMS-imitating data integration tool &#8230;</li>
<li>&#8230; but Hadoop is best-suited for somewhat different use cases than those technologies are, and the gap won&#8217;t close as long as the others remain a moving target.</li>
<li>I don&#8217;t think MapReduce is going to fail altogether; it&#8217;s too well-suited for too many use cases.</li>
<li>Hadoop (as opposed to general MapReduce) has too much momentum to fizzle, perhaps unless it is supplanted by one or more embrace-and-extend MapReduce-plus systems that do a lot more than it does.</li>
<li>The way for Hadoop to avoid being a MapReduce afterthought is to evolve sufficiently quickly itself; ponderous standardization efforts are quite beside the point.</li>
</ul>
<p>As for the rest of Jim&#8217;s claim &#8212; I see <strong>three main candidates for the &#8220;nucleus of the next-generation enterprise data warehouse,&#8221;</strong> each with better claims than Hadoop:</p>
<ul>
<li><strong>Relational DBMS,</strong> much like today. (E.g., Teradata, DB2, Exadata or their successors.) This is the case in which robustness of the central data store matters most.</li>
<li>Grand cosmic <strong>data integration tools.</strong> (The descendants of Informatica PowerCenter, et al.) This is the case in which the logic of data relationships can safely be separated from physical storage.</li>
<li><strong>Nothing.</strong> (The architecture could have several strong members, none of which is truly the &#8220;nucleus.&#8221;) This is the case in which new ways keep being invented to extract high value from data, outrunning what grandly centralized solutions can adapt to. I think this is the most likely case of all.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/05/hadoop-confusion-from-forrester-research/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>

