<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Investment research and trading</title>
	<atom:link href="http://www.dbms2.com/category/applications/algorithmic-trading-investment-research/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Agile predictive analytics &#8211; the heart of the matter</title>
		<link>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/</link>
		<comments>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 19:40:26 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5746</guid>
		<description><![CDATA[I&#8217;ve already suggested that several apparent issues in predictive analytic agility can be dismissed by straightforwardly applying best-of-breed technology, for example in analytic data management. At first blush, the same could be said about the actual analysis, which comprises: Data preparation, which is tedious unless you do a good job of automating it. Running the [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve already suggested that several apparent issues in <a href="http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-easy-parts/">predictive analytic agility</a> can be dismissed by straightforwardly applying best-of-breed technology, for example in analytic data management. At first blush, the same could be said about the actual analysis, which comprises:</p>
<ul>
<li>Data preparation, which is tedious unless you do a good job of automating it.</li>
<li>Running the actual algorithms.</li>
</ul>
<p>Numerous statistical software vendors (or open source projects) help you with the second part; some make strong claims in the first area as well (e.g., my clients at KXEN). Even so, large enterprises typically have statistical silos, commonly featuring expensive annual SAS licenses and seemingly slow-moving SAS programmers.</p>
<p>As I see it, the predictive analytics workflow goes something like this<span id="more-5746"></span>:</p>
<ul>
<li>Business-knowledgeable people develop a theory as to what kinds of information and segmentation could be valuable in making better business micro-decisions.</li>
<li>Statistics-knowledgeable people determine a structure for modeling that reflects this theory.</li>
<li>Statistics-knowledgeable people tweak the model over time, within a fixed general structure, as new data comes in.</li>
<li>(Optional) Somebody sees to acquiring whatever data is needed that the organization doesn&#8217;t already have (and won&#8217;t get in the ordinary course of ongoing business).</li>
</ul>
<p>The optional last part can be a purchase of third-party information (relatively fast and easy) or the development of a business process (and if necessary associated software) to capture the information (not always so easy). But even if that&#8217;s taken care of, or not present, we have at least two hand-offs where agility can be lost:</p>
<ul>
<li>Businesspeople may throw a request &#8220;over the wall&#8221; to the statisticians, who then work on it as their schedule permits.</li>
<li>Once created, a model may be so set in stone that even small changes are as hard as building a new model from scratch.</li>
</ul>
<p>The second problem can be solved by the statisticians themselves, without outside involvement. Model research and model refinement should be separate processes. You can recheck your clustering on one schedule, but recalibrate your regressions against each cluster more frequently. If that all sounds forbiddingly difficult, perhaps your model recalibration process needs another level of automation.</p>
<p>So I&#8217;ve finally gotten to the point of saying what may have been obvious from the start: <strong>The only excusable impediment to predictive analytic agility is the hand-off from the people who know the business to the people who know the math.</strong> So let&#8217;s examine ways that difficulty can be resolved.</p>
<p>At big internet companies, the usual answer is something like</p>
<blockquote><p>Hey, it&#8217;s just data. From web logs. And network event logs. The data scientists know how to handle that.</p></blockquote>
<p>In financial trading firms, the answer is more</p>
<blockquote><p>The traders and analysts work closely together. Very closely. In fact, when the traders rip out their phones and throw them across the room, the analysts need to duck to avoid getting clobbered.</p></blockquote>
<p>In credit card or telecom marketing or insurance actuarial organizations, the answer may be</p>
<blockquote><p>Don&#8217;t worry; the stats geeks have been at this for a long time; they really do understand our business.</p></blockquote>
<p>All three approaches work.</p>
<p>But what about conventional enterprises, where line-of-business people may not be as math-savvy as internet developers or financial traders, and where the math experts may not have the business issues down cold? My flippant answer is that businesspeople should know some math too.* My more serious answer is that <strong>the &#8220;business analyst&#8221; role should be expanded </strong>beyond BI and planning<strong> to include lightweight predictive analytics as well.</strong></p>
<p><em>*I wasn&#8217;t being entirely flippant, of course. Statistics is even being taught in high school these days. And when I got a PhD in game theory, 2/3 of my thesis committee was at the Harvard Business School.</em></p>
<p>For example, at retailers:</p>
<ul>
<li>Market basket analysis is pretty simplistic (it only looks at small subsets of a basket at a time).</li>
<li>Seasonality is tricky. (Weather and so on can skew it.)</li>
<li>Each store or region can be its own universe.</li>
<li>Some of the results of analytics are rather coarse-grained &#8212; e.g., merchandise adjacencies &#8212; so precision in statistical analysis may not matter much anyway.</li>
</ul>
<p>And so truly rigorous statistical analysis may be both unfeasible and unnecessary; a lot of business-informed seat-of-the-pants reasoning needs to be mixed in. Consequently, there&#8217;s a lot to be said for pushing at least some retail predictive analytics pretty close to the merchandising department(s).</p>
<p>Similar stories could be told in many other industries and pursuits, including but emphatically not limited to:</p>
<ul>
<li>Event marketing.</li>
<li>College admissions.</li>
<li>Political campaigning.</li>
<li>Field maintenance at utility companies.</li>
<li>Price-setting (across many industries).</li>
</ul>
<p>In each case, it&#8217;s easy to see how statistical and predictive analytic techniques could add real value to the business. But it&#8217;s hard to imagine how the enterprise could support the kind of large, experienced, business-knowledge analytic operation one might find in hedge fund investing or telecom churn analysis. And absent that, it&#8217;s tough to see why the only people doing predictive analytics for the organization should sit in some silo of statistical expertise.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Agile predictive analytics &#8212; the &#8220;easy&#8221; parts</title>
		<link>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-easy-parts/</link>
		<comments>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-easy-parts/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 19:38:58 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5747</guid>
		<description><![CDATA[I&#8217;m hearing a lot these days about agile predictive analytics, albeit rarely in those exact terms. The general idea is unassailable, in that it boils down to using data as quickly as reasonably possible. But discussing particulars is hard, for several reasons: Pundits tend to sketch castles in the air. Vendors tend to confuse part [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m hearing a lot these days about <strong>agile predictive analytics</strong>, albeit rarely in those exact terms. The general idea is unassailable, in that it boils down to <strong>using data as quickly as reasonably possible.</strong> But discussing particulars is hard, for several reasons:</p>
<ul>
<li><a href="http://www.column2.com/2011/11/agile-predictive-process-platforms-for-business-agility-with-jameskobielus/">Pundits tend to sketch castles in the air</a>.</li>
<li>Vendors tend to confuse part of the story &#8212; generally the part they happen to offer <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  &#8212; with the whole.</li>
<li>Different use cases give rise to different kinds of issues.</li>
</ul>
<p>At least three of the generic arguments for agility apply to predictive analytics:</p>
<ul>
<li>Doing the correct thing soon is usually better than doing the same correct thing later.</li>
<li>If it doesn&#8217;t take much time to do something, hopefully it doesn&#8217;t take that much expense (labor and so on) either.</li>
<li>It&#8217;s hard to get new stuff completely right on the first try. Often, the best strategy is to come close fast, then fix what&#8217;s still not ideal.</li>
</ul>
<p>But the reasons to want agile predictive analytics don&#8217;t stop there.</p>
<p><span id="more-5747"></span>Not only is it hard to get stuff right on the first try for a given information set, but the available information can also change quickly. For example:</p>
<ul>
<li>If you&#8217;re a consumer marketer, consumer tastes can change quickly, due to news (of many different kinds), seasonal trends, and so on. The most recent data you have contain information unavailable in your historical data sets. Also &#8230;</li>
<li>&#8230; if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn&#8217;t well-reflected in your previous models.</li>
<li>If you&#8217;re in capital markets, and you figure something out, probably so will rival investors. So whatever you knew three weeks ago may already be partially obsolete.</li>
</ul>
<p>What&#8217;s more, often you deliberately don&#8217;t want to test, model, or tune all your variables at once. First you determine whether the ad text should be &#8220;Would you be so kind as to allow us to supply you with our wares?&#8221; or &#8220;Buy it, dude!&#8221;; only afterwards do you decide whether the color scheme should rely on red or green.</p>
<p>With that as backdrop, how can you make your predictive analytics more agile? Let&#8217;s start by breaking predictive analytics into four pieces:</p>
<ul>
<li><a href="http://www.dbms2.com/2011/11/28/terminology-data-mustering/">Data mustering</a> for the analysts.</li>
<li>Actual analysis.</li>
<li>Data mustering for deployment.</li>
<li>Deployment.</li>
</ul>
<p><strong>Only the second of those has much excuse for being an agility bottleneck;</strong> the other three are well addressed by technology you can buy (or straightforwardly build) today.</p>
<p>The deployment part of the story can be pretty simple, at least technically &#8212; spit out some PMML (Predictive Modeling Markup Language), and if you&#8217;re deploying to a DBMS with good enough PMML support, you&#8217;re good to go. Any vendor who doesn&#8217;t offer that degree of simplicity had better be working toward it fast. That said, your applications that are infused with predictive analytics need to be modular enough to accommodate model changes; if not, some refactoring lies ahead. And the same can be said for the work processes that surround them.</p>
<p>The data mustering parts should be pretty straightforward too. Setting up a relational data mart tuned for <a href="http://www.dbms2.com/2011/03/03/investigative-analytics/">investigative analytics</a> isn&#8217;t all that hard or costly (perhaps unless your database is enormously large), and the same actually goes for a Hadoop cluster. Beyond that, if you can model and deploy from the same database, that&#8217;s great; if not, you have an ETL (Extract/Transform/Load) need. I guess you could have data quality/MDM (Master Data Management) issues as well, but offhand I&#8217;m not seeing why you wouldn&#8217;t push their solutions back to analysis time. And any decent analytic technology stack can give sub-hour latency; <a href="http://www.dbms2.com/2009/09/10/analytic-speed-latency/">while that may not suffice from all standpoints</a>, it&#8217;s plenty fast enough for analysis-time agility.</p>
<p>With those preliminaries out of the way, now let&#8217;s turn to <a href="http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/">the heart of the agile predictive analytics challenge</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-easy-parts/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Terminology: Data mustering</title>
		<link>http://www.dbms2.com/2011/11/28/terminology-data-mustering/</link>
		<comments>http://www.dbms2.com/2011/11/28/terminology-data-mustering/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 19:10:11 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5736</guid>
		<description><![CDATA[I find myself in need of a word or phrase that means bring data together from various sources so that it&#8217;s ready to be used, where the use can be analysis or operations. The first words I thought of were &#8220;aggregation&#8221; and &#8220;collection,&#8221; but they both have other meanings in IT. Even &#8220;data marshalling&#8221; has [...]]]></description>
			<content:encoded><![CDATA[<p>I find myself in need of a word or phrase that means <strong>bring data together from various sources so that it&#8217;s ready to be used,</strong> where the use can be analysis or operations. The first words I thought of were &#8220;aggregation&#8221; and &#8220;collection,&#8221; but they both have other meanings in IT. Even &#8220;data marshalling&#8221; has a specific meaning different from what I want. So instead, I&#8217;ll go with <strong>data mustering.</strong></p>
<p>I mean for the term &#8220;data mustering&#8221; to encompass at least three scenarios:</p>
<ul>
<li>Integrated (relational) data warehouse.</li>
<li>Big bit bucket.</li>
<li>Big bit stream.</li>
</ul>
<p>Let me explain what I mean by each.  <span id="more-5736"></span></p>
<p><strong>&#8220;Integrated data warehouse&#8221;</strong> is a phrase Teradata has started using for enterprise data warehouses that, <a href="../../../../../2010/04/12/enterprise-data-warehouse-edw-myt/">like approximately every other EDW in the entire history of data warehousing</a>, aren&#8217;t truly enterprise-wide. In other words, it means &#8220;not just a data mart&#8221;. <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">No category name is perfect</a>, but I think that one works reasonably well.</p>
<p>I previously described the <strong><a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a></strong> use case as</p>
<blockquote><p>Users take a whole lot of data, often <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing.</p></blockquote>
<p>and quickly added</p>
<blockquote><p>Of course, there are various outfits who’d like to sell you not-so-cheap bit buckets. Contending technologies include <a href="../../../../../2011/06/02/why-you-would-want-an-appliance-and-when-you-wouldnt/">Hadoop appliances</a> (which I don’t believe in), <a href="../../../../../2009/10/18/technical-introduction-to-splunk/">Splunk</a> (which in many use cases I do), and <a href="../../../../../2010/11/29/marklogic-and-its-document-dbms/">MarkLogic</a> (ditto, but often the cases are different from Splunk’s). Cloudera and IBM, among other vendors, would also like to sell you some proprietary software to go with your standard Apache Hadoop code.</p></blockquote>
<p>I think I&#8217;ll stand pat on that explanation. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>By analogy, a <strong>big bit stream </strong>is various streams of data, assembled in the custody of a streaming engine. Sybase told me Wednesday that this scenario appears in both of the traditional markets for CEP/streaming &#8212; national intelligence, where it is a major use of streaming, and capital markets in some use cases as well. And it&#8217;s consistent with what I&#8217;ve heard from other CEP/streaming vendors as well.</p>
<p>As for where I got the word &#8220;mustering&#8221; &#8212; it&#8217;s a military term, for when you assemble your troops and their gear either for inspection or for actual use. The main modern usage I know of the word is as part of the phrase &#8220;pass muster&#8221;, which originally referred to the concept that the person being paid to put a regiment together should from time to time demonstrate that the regiment physically existed in the form that regimental records seemed to show.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/28/terminology-data-mustering/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Some big-vendor execution questions, and why they matter</title>
		<link>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/</link>
		<comments>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 11:01:20 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cognos]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[HP and Neoview]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[SAP AG]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5704</guid>
		<description><![CDATA[When I drafted a list of key analytics-sector issues in honor of look-ahead season, the first item was &#8220;execution of various big vendors&#8217; ambitious initiatives&#8221;.  By &#8220;execute&#8221; I mean mainly: &#8220;Deliver products that really meet customers&#8217; desires and needs.&#8221; &#8220;Successfully convince them that you&#8217;re doing so &#8230;&#8221; &#8220;&#8230; at an attractive overall cost.&#8221; Vendors mentioned [...]]]></description>
			<content:encoded><![CDATA[<p>When I drafted a list of key analytics-sector issues in honor of <a href="http://www.dbms2.com/2011/11/21/analytic-trends-in-2012-qa/">look-ahead season</a>, the first item was &#8220;execution of various big vendors&#8217; ambitious initiatives&#8221;.  By &#8220;execute&#8221; I mean mainly:</p>
<ul>
<li>&#8220;Deliver products that really meet customers&#8217; desires and needs.&#8221;</li>
<li> &#8220;Successfully convince them that you&#8217;re doing so &#8230;&#8221;</li>
<li>&#8220;&#8230; at an attractive overall cost.&#8221;</li>
</ul>
<p>Vendors mentioned here are Oracle, SAP, HP, and IBM. Anybody smaller got left out due to the length of this post. Among the bigger omissions were:</p>
<ul>
<li>salesforce.com (multiple subjects).</li>
<li><a href="../../../../../2011/04/21/sas-hpa-does-make-sense-after-all/">SAS HPA</a>.</li>
<li><a href="../../../../../2011/08/21/hadoop-evolution/">The evolution of Hadoop</a>.</li>
</ul>
<p><span id="more-5704"></span><strong>A (lingering) issue for SAP and Oracle alike</strong></p>
<p>As I noted in January of this year, <a href="../../../../../2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/">integration of business intelligence into operational apps is making very slow progress</a>. Even so, it&#8217;s a huge part of the apparent strategy at SAP and Oracle alike, as well it should be. Much of the benefit from automating routine desk work has already happened. The areas ripest for exploitation are the ones where analytics are part of the equation.</p>
<p>Given the lack of tangible progress, why do I think this is a genuine area of Oracle and SAP emphasis? Three reasons of many are:</p>
<ul>
<li>Why else did SAP buy Business Objects?</li>
<li>If they&#8217;re not trying to <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrate operational apps and analytics</a>, why else does SAP&#8217;s emphasis on HANA make sense?</li>
<li>Without business intelligence in the picture, how does Oracle&#8217;s integrated-stack story promise any direct user benefits?*</li>
</ul>
<p><em>*As opposed to IT concerns &#8212; integration, administration, TCO (Total Cost of Ownership), etc.</em></p>
<p>After so many years of disappointment, I&#8217;m not going to forecast 2012 as a pivotal year for <strong>the integration of business intelligence into operational applications.</strong> But if one of SAP or Oracle ever does get a significant BI/operational app integration advantage over the other, it could be a major competitive advantage in those application market segments that are still up for grabs. It also is an opportunity for both vendors to gain BI market share in their respective application customer bases.</p>
<p><strong>A more urgent issue for SAP</strong></p>
<p>SAP has put huge amounts of credibility on the line for HANA, the integration of two different and not particularly mature in-memory database technologies. So far, it is difficult to find evidence that HANA is robust enough for widespread adoption. Whether or not SAP can fix that is a huge open question, which could have significant impact on the course of several technology areas: applications, business intelligence, in-memory DBMS, and maybe even hardware.</p>
<p>Based on current information, which is admittedly partial, I&#8217;m a short-term pessimist on HANA. Longer-term, I&#8217;m on record as saying that <a href="../../../../../2011/05/23/databases-ram/">traditional databases will eventually wind up in RAM</a>. SAP will surely get that technology right some day, whether or not the way it does so has anything to do with present-day HANA code.</p>
<p><strong>Four more issues for Oracle </strong></p>
<p>Oracle&#8217;s ambitions are near-endless, and so also therefore is its list of execution challenges. Four in the analytics area that I find particularly interesting are:</p>
<ul>
<li><strong>True hybrid columnar DBMS.</strong> <a href="../../../../../2011/09/22/teradata-columnar-compression/">I was guessing that Oracle, like Teradata, would announce true hybrid columnar the week of Oracle OpenWorld</a>. I was wrong. But if Oracle can&#8217;t bring out true hybrid columnar DBMS functionality relatively soon, Exadata will lose credibility as a competitor to more specialized analytic DBMS.</li>
<li><strong>Oracle Exalytics.</strong> With Exalytics in the mix, Oracle&#8217;s technology stack has HANA-like potential. But will Exalytics even ship in 2012? (I think so.) Will it be good for much in the first release? (I&#8217;m skeptical.)</li>
<li><strong>Oracle&#8217;s Big Data Appliance</strong>. I&#8217;m skeptical both about <a href="../../../../../2011/10/20/more-notes-on-oracle-nosql/">Oracle&#8217;s NoSQL product</a> &#8212; <a href="http://www.infoworld.com/d/data-explosion/first-look-oracle-nosql-database-179107">a favorable InfoWorld review</a> notwithstanding &#8212; and <a href="../../../../../2011/09/23/hadoop-appliances/">Hadoop appliances</a>. But if I&#8217;m wrong, and Oracle can successfully embrace/extend the new non-relational paradigms, then it really might regain control over the evolution of data management.</li>
<li><strong><a href="../../../../../2011/10/18/oracle-is-buying-endeca/">Oracle&#8217;s Endeca acquisition</a></strong> &#8212; will Oracle prove me wrong and integrate Endeca effectively into its overall analytic product line? If it does, we might finally see effective text (and eventually speech) navigation of enterprise software. (But as with all Oracle issues cited here, this is something that probably won&#8217;t amount to much in 2012 even if it does later go well.)</li>
</ul>
<p><strong>Three issues for IBM</strong></p>
<p>Like Oracle, IBM is a huge company with many ambitions and hence many execution challenges. The biggest of those is surely: <strong>How effective can IBM be at selling outside its existing customer base?</strong> I don&#8217;t hear as much competitively about IBM DataStage, IBM SPSS or now IBM Netezza as I did when their vendors were independent companies. Even Cognos may not be much of an exception to the rule, although it has its own large customer base outside of IBM&#8217;s traditional one. (To lesser extents , the same is of course true of Netezza and numerous other IBM acquisitions.)</p>
<p>Another general issue for IBM is <strong>substantively integrating its various product lines,</strong> at least to the extent that makes sense. DB2/Netezza integration sounds good, but even that is a matter more of product marketing (the admirable part of that discipline) more than of actual technology. Other integrations (e.g. Cognos/DB2 in various bundles) have tended toward the dubious side.*</p>
<p><em>*I&#8217;m still waiting for IBM to get back to me with examples of how Cognos/DB2 joint tuning amounts to anything. It&#8217;s been more than a year, so I&#8217;m glad I didn&#8217;t hold my breath.</em></p>
<p>In a somewhat narrower vein, I wonder: <strong><a href="../../../../../2011/11/10/cep-streaming-catchup/">Will IBM be able to gain traction for InfoSphere Streams</a>? </strong>And if so, when and where will the traction be?</p>
<p><strong>Will HP screw up Vertica?</strong></p>
<p>Vertica has a very attractive product offering. It&#8217;s perhaps <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">the most scalable analytic DBMS outside of Teradata</a>, running on the hardware of your reasonable choice.  It&#8217;s also the one I recommend most often to clients in the 1-50 terabyte range.</p>
<p>So far HP doesn&#8217;t seem to have done much to leadfoot Vertica. (About all I&#8217;ve heard from competitors is that Vertica seems to have faded somewhat in the financial services market, and there could be multiple explanations if that is indeed true.) But if HP Vertica does somehow manage to botch things, opportunities will open up for a range of columnar analytic DBMS competitors.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>StreamBase catchup</title>
		<link>http://www.dbms2.com/2011/11/10/streambase-catchup/</link>
		<comments>http://www.dbms2.com/2011/11/10/streambase-catchup/#comments</comments>
		<pubDate>Fri, 11 Nov 2011 03:31:44 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[StreamBase]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5630</guid>
		<description><![CDATA[While I was cryptic in my general CEP/streaming catchup, I&#8217;ll say a bit more regarding StreamBase in particular. At the highest level, non-technically: StreamBase once planned to conquer the world. However, StreamBase really only sold effectively in the financial trading and intelligence markets. StreamBase retrenched, focusing almost exclusively on the financial trading market. With StreamBase [...]]]></description>
			<content:encoded><![CDATA[<p>While I was cryptic in my general <a href="http://www.dbms2.com/2011/11/10/cep-streaming-catchup/">CEP/streaming catchup</a>, I&#8217;ll say a bit more regarding StreamBase in particular. At the highest level, non-technically:</p>
<ul>
<li>StreamBase once planned to conquer the world.</li>
<li>However, StreamBase really only sold effectively in the financial trading and intelligence markets.</li>
<li>StreamBase retrenched, focusing almost exclusively on the financial trading market.</li>
<li>With <a href="http://www.dbms2.com/2011/11/10/streambase-liveview-push-based-real-time-bi/">StreamBase LiveView</a>, StreamBase is expanding from embedded <a href="../../../../../2011/11/08/terminology-operational-analytics/">operational analytics</a> to do (also operational) business intelligence as well.</li>
<li>StreamBase is hopeful that, perhaps starting with Version 2 or so, LiveView will be successful outside the financial trading market.</li>
</ul>
<p><span id="more-5630"></span><em>Not coincidental to these shifts in focus, StreamBase was our client, then stopped being one for a while, and now is a client again.</em></p>
<p>StreamBase (the product set) consists primarily of three things (LiveView aside):</p>
<ul>
<li>A development environment, whose output is in &#8230;</li>
<li>&#8230; a visual programming language called EventFlow &#8230;</li>
<li>&#8230; which is complied and executed by StreamBase&#8217;s execution layers.</li>
</ul>
<p>One important set of ancillary products are StreamBase&#8217;s connectors to various data sources &#8212; StreamBase offers about 125 of its own, a number that approaches 200 when <a href="../../../../../2010/02/16/quick-thoughts-on-the-streambase-component-exchange/">community contributions</a> are included.</p>
<p>StreamBase has a second programming language called StreamSQL, but that&#8217;s rarely used except for embedding in or connecting to third-party software. EventFlow and StreamSQL compile to nearly identical byte code. (The main difference seems to be that as a practical matter you&#8217;ll name things a bit differently in the two languages, focusing on verbs in EventFlow and nouns in StreamSQL.)</p>
<p>StreamBase says that in the financial trading market, great performance out of the box equates to better time-to-value, since you are spared time you&#8217;d otherwise have to spend tuning the system. Implicit in that is a claim &#8212; which competitors might dispute <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  &#8212; that StreamBase has great <a href="../../../../../2009/05/21/notes-on-cep-performance/">performance</a>. StreamBase fondly thinks that having a domain-specific language gives it a leg up in achieving great compiler optimization. (The same would presumably apply to StreamBase&#8217;s competitors, but only if they have optimizing compilers themselves.)</p>
<p>One point that&#8217;s a little unusual for me these days is that StreamBase favors big SMP (Symmetric MultiProcessing) boxes over blade-based scale-out. 16+ cores and 256 gigabytes of RAM are not uncommon. Clusters commonly include 4-8 machines, but rarely more; the largest StreamBase cluster evidently contains 36 machines.</p>
<p>And with that I&#8217;ll turn to StreamBase&#8217;s newest offering, <a href="http://www.dbms2.com/2011/11/10/streambase-liveview-push-based-real-time-bi/">LiveView</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/10/streambase-catchup/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>IBM is buying parallelization expert Platform Computing</title>
		<link>http://www.dbms2.com/2011/10/11/ibm-is-buying-parallelization-expert-platform-computing/</link>
		<comments>http://www.dbms2.com/2011/10/11/ibm-is-buying-parallelization-expert-platform-computing/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 16:13:05 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Scientific research]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5473</guid>
		<description><![CDATA[IBM is acquiring Platform Computing, a company with which I had one briefing, last August. Quick background includes:  Platform Computing started ~20 years ago. Platform Computing claimed close to $100 million in revenue and &#62;500 people. (This is Platform Computing&#8217;s most famous splash to date.) Platform Computing technology underlies SAS Institute&#8217;s preferred method of parallelization, [...]]]></description>
			<content:encoded><![CDATA[<p>IBM is acquiring Platform Computing, a company with which I had one briefing, last August. Quick background includes:  <span id="more-5473"></span></p>
<ul>
<li>Platform Computing started ~20 years ago.</li>
<li>Platform Computing claimed close to $100 million in revenue and &gt;500 people.</li>
<li><strong>(This is Platform Computing&#8217;s most famous splash to date.)</strong> Platform Computing technology underlies SAS Institute&#8217;s preferred method of parallelization, which may variously be called:
<ul>
<li>SAS Grid Manager (the more or less official brand name).</li>
<li><a href="../../../../../2011/04/21/sas-hpa-does-make-sense-after-all/">SAS HPA</a> (High Performance Analytics), sort of an alternate brand name.</li>
<li>MPI (Message Passing Interface), the industry&#8217;s name for the underlying semantics/syntax/API.</li>
</ul>
</li>
<li>Platform Computing&#8217;s original business was scientific grid computing.</li>
<li>Platform Computing&#8217;s second major business was its &#8220;Symphony&#8221; product line. According to Platform Computing, Symphony:
<ul>
<li>Debuted 6-7 years ago.</li>
<li>Is more commercially oriented.</li>
<li>Is what supports SAS HPA.</li>
<li>SAS aside, has been sold to Wall Street and so on.</li>
<li>Is sometimes used in conjunction with <a href="../../../../../2011/08/25/renaming-cep-or-not/">CEP/streaming</a>, mainly for backtesting.</li>
<li>Can be used to build global (parallel) persistent memory for R.</li>
</ul>
</li>
<li><strong>(This is probably why IBM is buying Platform Computing.)</strong> Platform Computing&#8217;s has a new MapReduce offering that:
<ul>
<li>Is based on Symphony.</li>
<li>Shipped last July, except that early access was a couple months before that.</li>
<li>Is focused on:
<ul>
<li>Lowering the latency of MapReduce.</li>
<li>Consolidating multiple MapReduce use cases into one high(er)-utilization cluster.</li>
<li>Offering workload management in support of those goals.</li>
<li>Reliability, availability, predictability, puppies, kittens, and apple pie.</li>
</ul>
</li>
</ul>
</li>
<li>Is most specifically a MapReduce run-time engine, with other stuff beyond that.</li>
</ul>
<p>Unfortunately, I&#8217;m not precisely clear as to how tied this offering is to Hadoop, but using it with Hadoop is at least the base case. But Platform Computing did say:</p>
<ul>
<li>It can support multiple virtual Hadoop clusters, which can be grown or shrunk at will.</li>
<li>Non-Hadoop workloads can be mixed in.</li>
</ul>
<p>Platform Computing said that key technical benefits of this offering included:</p>
<ul>
<li><strong>1-3 seconds to start a job, vs. 40-50 in generic Hadoop.</strong></li>
<li>Automatic recovery of JobTracker nodes.</li>
<li>Failover for NameNodes.</li>
<li>Workload management that:
<ul>
<li>Manages all of CPU, I/O, and RAM (this is quickly becoming an industry standard level of capability, although I&#8217;m judging more by the standards of the analytic DBMS world).</li>
<li>Monitors but doesn&#8217;t actively manage network resources.</li>
<li>Can reprioritize jobs that are in flight. (Also an industry-standard capability.)</li>
</ul>
</li>
</ul>
<p>This conflation of scientific, commercial analytic, streaming, and MapReduce is right in IBM&#8217;s philosophical wheelhouse. I base that comment on, among other factors:</p>
<ul>
<li>How IBM positions &#8220;Big Insights&#8221;.</li>
<li>IBM&#8217;s &#8220;smart consolidation&#8221; picture/pitch (which I really should get around to posting).</li>
<li>The fuss IBM makes about Watson, Blue Gene, and so on.</li>
</ul>
<p>The IBM acquisition probably obviates a lot of Platform Computing&#8217;s previous business comments, but at the time they included:</p>
<ul>
<li>POCs (Proofs of Concept):
<ul>
<li>Mainly in financial services, government, and telecom.</li>
<li>At both existing customers and new prospects.</li>
<li>Typically running 30-50 nodes, 2-50 terabytes.* The smallest databases evidently tended to be an financial services firms.</li>
</ul>
</li>
<li>Pricing that was starting out:
<ul>
<li>Perpetual license: $3450/server, 21% annual maintenance after the first year.</li>
<li>Subscription: $2070/server annually, or $3070 with HDFS support bundled in.</li>
</ul>
</li>
</ul>
<p><em><strong>*1 terabyte or less per node</strong> is probably the lowest data-per-node figure I&#8217;ve heard for anything Hadoop-like &#8212; even below Hadapt, and well below what <a href="../../../../../2011/07/06/hadoop-hardware-and-compression/">Cloudera and Hortonworks</a> usually see.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/11/ibm-is-buying-parallelization-expert-platform-computing/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Hadoop notes</title>
		<link>http://www.dbms2.com/2011/09/12/hadoop-notes/</link>
		<comments>http://www.dbms2.com/2011/09/12/hadoop-notes/#comments</comments>
		<pubDate>Mon, 12 Sep 2011 09:03:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Health care]]></category>
		<category><![CDATA[Hortonworks]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapR]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5218</guid>
		<description><![CDATA[I visited California recently, and chatted with numerous companies involved in Hadoop &#8212; Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I&#8217;ll defer further Hadoop technical discussions for now &#8212; my target to restart them is later this month &#8212; but that still leaves some other issues to discuss, namely adoption and partnering. The total number [...]]]></description>
			<content:encoded><![CDATA[<p>I visited California recently, and chatted with numerous companies involved in Hadoop &#8212; Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I&#8217;ll defer further <a href="../../../../../2011/08/21/hadoop-evolution/">Hadoop technical discussions</a> for now &#8212; my target to restart them is later this month &#8212; but that still leaves some other issues to discuss, namely adoption and partnering.</p>
<p>The total number of enterprises in the world paying subscription and license fees that they would regard as being for &#8220;Hadoop or something Hadoop-related&#8221; probably is not much over 100 right now, but I&#8217;d expect to see pretty rapid growth. Beyond that, let&#8217;s divide customers into three groups:</p>
<ul>
<li>Internet businesses.</li>
<li>Traditional enterprises &#8216; internet operations.</li>
<li>Traditional enterprises&#8217; other operations.</li>
</ul>
<p>Hadoop vendors, in different mixes, claim to be doing well in all three segments. Even so, almost all use cases involve some kind of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>, with one exception being a credit card vendor crunching a large database of transaction details. Multiple kinds of machine-generated data come into play &#8212; web/network/mobile device logs, financial trade data, scientific/experimental data, and more. In particular, pharmaceutical research got some mentions, which makes sense, in that it&#8217;s one area of scientific research that actually enjoys fat for-profit research budgets.</p>
<p><span id="more-5218"></span>On the partnering side, I heard things about a Hortonworks conference call that do not seem to have been contradicted by my visit to Hortonworks. Namely, Hortonworks promised prospective partners, such as analytic DBMS vendors, hardware vendors, or large system integrators, that it wouldn&#8217;t compete with them, in that Hortonworks pledges not to introduce its own products for at least two years. This is presumably targeted most directly at <a href="../../../../../2010/10/10/partnering-with-cloudera/">Cloudera</a>, which has lots of partners, but also some <a href="../../../../../2010/06/30/cloudera-enterprise-hadoop-evolution/">proprietary code</a> of its own. MapR, I&#8217;d think, would be the #2 target, but that&#8217;s just speculation.</p>
<p>The other big part of <a href="../../../../../2011/07/10/cloudera-and-hortonworks/">Hortonworks&#8217; story</a> is the claim that it holds the axe in Apache Hadoop development. Nobody doubts that a large fraction of the work on Hadoop&#8217;s core projects was done by Yahoo employees. Many of those indeed moved to Hortonworks; others left Yahoo earlier; Hadoop creator Doug Cutting is actually at Cloudera. So just how dominant Hortonworks really is in core Hadoop development is a bit unclear. Meanwhile, Cloudera people seem to be leading a number of Hadoop companion or sub-projects, including the first two I can think of that relate to Hadoop integration or connectivity, namely Sqoop and Flume. So I&#8217;m not persuaded that the &#8220;we know this stuff better&#8221; part of the Hortonworks partnering story really holds up.</p>
<p>What I am persuaded of is that the Hadoop platform competition is a good thing. Whichever vendors and projects win will be healthier from having had to outcompete worthy alternatives.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/12/hadoop-notes/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Petabyte-scale Hadoop clusters (dozens of them)</title>
		<link>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/</link>
		<comments>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/#comments</comments>
		<pubDate>Wed, 06 Jul 2011 05:15:21 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4886</guid>
		<description><![CDATA[I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop. Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more [...]]]></description>
			<content:encoded><![CDATA[<p>I recently learned that there are <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">7 Vertica clusters with a petabyte</a> (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.</p>
<p>Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo&#8217;s latest stated figures are:</p>
<ul>
<li>42,000 Hadoop nodes &#8230;</li>
<li>&#8230; holding 180-200 petabytes of data.</li>
</ul>
<p><span id="more-4886"></span>That works out near the low end of the range I came up with for Yahoo&#8217;s newest gear, namely <a href="http://www.dbms2.com/2011/07/06/hadoop-hardware-and-compression/">36-90 TB/node</a>. Yahoo&#8217;s biggest clusters are little over 4,000 nodes (a limitation that&#8217;s getting worked on), and Yahoo has over 20 clusters in total.</p>
<p>Based on those numbers, it would seem that 10 or more of Yahoo&#8217;s Hadoop clusters are probably in the petabyte range. Facebook no doubt has a few petabyte-scale Hadoop clusters as well. So we&#8217;re probably over 3 dozen petabyte+ Hadoop clusters, just counting Yahoo, Facebook, and CDH users. There surely are others too, running Apache Hadoop without Cloudera&#8217;s help.</p>
<p>We also have some more information about the scale of Hadoop usage, and the markets it is being used in, because Omer Trajman of Cloudera kindly wrote the following &#8212; lightly edited as usual &#8212; for quotation:</p>
<blockquote><p>The number of Petabyte+ Hadoop clusters expanded dramatically over the past year, with our recent count reaching 22 in production (in addition to the well-known clusters at Yahoo! and Facebook). Just as our poll back at Hadoop World 2010 showed the average cluster size at just over 60 nodes, today it tops 200. While mean is not the same as median (most clusters are under 30 nodes), there are some beefy ones pulling up that average. Outside of the well-known large clusters at Yahoo and Facebook, we count today 16 organizations running PB+ clusters running CDH across a diverse number of industries including online advertising, retail, government, financial services, online publishing, web analytics and academic research. We expect to see many more in the coming years, as Hadoop gets easier to use and more accessible to a wide variety of enterprise organizations.</p></blockquote>
<p>Omer went on to add:</p>
<blockquote><p>The biggest number of PB clusters are in the advertising space. I often tell people that every ad you see on the internet touched at least one Hadoop cluster (or the Google equivalent).</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 2)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:18:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Rainstor]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4867</guid>
		<description><![CDATA[In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/">Part 1</a> of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list match that is even less clear.  <span id="more-4867"></span></p>
<p><strong><em>Bit bucket</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Logs, other technical/external</li>
<li><em>Likely use styles:</em> Staging/ETL, investigative</li>
<li><em>Canonical example: </em>Log files in a Hadoop cluster<em> </em></li>
<li><em>Stresses:</em> TCO, scale-out, transform/big-query performance, ETL functionality</li>
</ul>
<p>With the explosion of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> has come the need for a place to put it all, sometimes called the <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a>. This is like the investigative data mart for big databases, but more <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured</a>. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.</p>
<p>The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.</p>
<p><strong><em>Archival data store</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Operational, CDR (call detail record), security log</li>
<li><em>Likely use styles:</em> Archival, reporting (for compliance), possibly also investigative</li>
<li><em>Examples:</em> Any long-term detailed historical store</li>
<li><em>Stresses: </em>TCO, compression, scale-out, performance (if multi-use)<em> </em></li>
</ul>
<p><em> </em></p>
<p>Analytic DBMS vendors have been insulting each other with the claim &#8220;that&#8217;s just an archival data store,&#8221; dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only <a href="../../../../../2010/06/11/rainstor-update/">Rainstor</a> truly embraces the archival positioning, and I&#8217;ve become pretty dubious about their technical claims and their company alike.</p>
<p>Still, there&#8217;s a legitimate need for data stores &#8212; especially relational analytic DBMS that:</p>
<ul>
<li>Store data cheaply, with high rates of compression.</li>
<li>Have decent performance if you do want to query the data.</li>
<li>May have archiving/compliance-specific features as well.</li>
</ul>
<p>Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.</p>
<p><strong><em>Outsourced data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Traditional BI, investigative analytics, staging/ETL</li>
<li><em>Examples:</em> Advertising tracking, SaaS CRM</li>
<li><em>Stresses:</em> Performance, TCO, reliability, concurrency</li>
</ul>
<p>Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I&#8217;ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that&#8217;s an analytic data management challenge. The possibilities expand from there.</p>
<p>Data outsourcers are in the IT business, and so their IT development is &#8212; hopefully! &#8212; more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.*  <a href="../../../../../2011/06/26/what-to-think-about-before-you-make-a-technology-decision/">Multitenancy</a> is commonly an issue, as is running in the cloud.<em> </em></p>
<p><em>*Even so, there&#8217;s often That Guy who doesn&#8217;t want to migrate away from Oracle, no matter what.<strong> </strong></em></p>
<p>Vertica gets the nod in a number of these cases; it&#8217;s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you&#8217;re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.</p>
<p><strong><em>Operational analytic(s) server</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> Customer-centric, log, financial trade</li>
<li><em>Likely use styles:</em> Advanced operational analytics</li>
<li><em>Examples:</em>
<ul>
<li>Lower latency: Web or call-center personalization, anti-fraud</li>
<li>Higher latency: Customer profiling, Basel 3 risk analysis</li>
</ul>
</li>
<li><em>Stresses:</em> Performance, reliability, analytic functionality, perhaps concurrency</li>
</ul>
<p>Even with eight different choices, I need a &#8220;catch-all&#8221; category; this is it.</p>
<p>Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrating short-request and analytic processing</a>. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you&#8217;re set. Otherwise, you may want to pipe <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">derived data</a> into a more &#8220;industrial-strength&#8221; DBMS, ideally the one that runs your operational apps anyway</p>
<p>Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you&#8217;re starting out with the data in a convenient bit bucket.</p>
<p>Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.</p>
<p><em>So did I get them all? Or are there yet other analytic data management use cases that I don&#8217;t fit into my eight categories?</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 1)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:17:44 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4868</guid>
		<description><![CDATA[Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help. Let&#8217;s try eight categories instead. While no categorization [...]]]></description>
			<content:encoded><![CDATA[<p>Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help.</p>
<p>Let&#8217;s try eight categories instead. While <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">no categorization is ever perfect</a>, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need &#8212; and in most cases you&#8217;ll need several &#8212; is a great early step in your analytic technology planning.  <span id="more-4868"></span></p>
<p><strong><em>Enterprise data warehouse</em></strong> (Full or partial)</p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, but especially operational</li>
<li><em>Likely use styles:</em> All</li>
<li><em>Canonical example:</em> Central EDW for a big enterprise</li>
<li><em>Stresses:</em> Concurrency, reliability, workload management</li>
</ul>
<p>The enterprise data warehouse (EDW) ideal says that you copy all your data into one place, and drive all decision-making from there. <a href="../../../../../2011/06/21/its-official-the-grand-central-edw-will-never-happen/">Full EDWs are pipedreams</a>. Still, a partial EDW makes sense for most large enterprises, and many indeed already have one. The first product lines to consider for classical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL Server, especially if you&#8217;re going to stress concurrency and/or operational use cases.</p>
<p><strong><em>Traditional data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Business intelligence, budgeting/consolidation, investigative</li>
<li><em>Examples:</em> Reporting servers, planning/consolidation servers, anything MOLAP, etc.</li>
<li><em>Stresses:</em> Performance, concurrency, TCO</li>
</ul>
<p>Whether or not you have something like an enterprise data warehouse, it&#8217;s common to have lighter-weight data marts as well. A traditional data mart might drive reports and dashboards. Or it might be specialized for budgeting, planning, and/or consolidation.  Some <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> may be in the mix as well.</p>
<p>Any DBMS that can support an EDW can also support a data mart, but it may not be the most cost-effective way to do so. Columnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them &#8212; e.g. Sybase IQ and <a href="../../../../../2011/06/20/vertica-release-5/">Vertica</a> &#8212; have excellent track records in concurrent usage as well. <a href="../../../../../2011/05/29/when-to-use-relational-database-management-system/">Ted Codd</a> pushed what amounts to MOLAP (Multidimensional OnLine Analytic Processing) systems for these use cases. But relational DBMS commonly do a better job, which is one reason most major MOLAP products have wound up at RDBMS companies.</p>
<p><strong><em>Investigative data mart &#8212; agile</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> A few analysts getting a few TB to examine</li>
<li><em>Stresses:</em> Ease of setup/load, ease of admin, price/performance</li>
</ul>
<p>Besides the traditional data mart, there are at least two other kinds. Both are focused on investigative analytics, but they&#8217;re differentiated by database size.</p>
<p>If you have just a few analysts,* looking at no more than a few terabytes of data (perhaps even just some gigabytes) &#8212; and if that data is &#8220;single-subject&#8221; and fairly homogenous &#8212; your watchwords should be &#8220;cheap&#8221;, &#8220;easy&#8221;, and &#8220;fast&#8221;. You don&#8217;t need to invest in much hardware, in expensive software, in much administrative effort (the analysts can be their own DBAs),  nor should you endure much set-up time. Just grab a product, grab some data, and start running queries (or extracts into the statistical tool of your choice).</p>
<p><em>*If you have dozens or even hundreds of analysts hitting the same database, you&#8217;re probably back to the more concurrency-oriented scenarios outlined above.</em></p>
<p>Infobright is often cost-effective among columnar analytic DBMS. Other vendors might cut you a price break as well. If you have multiple terabytes of data, don&#8217;t rule out Netezza&#8217;s lowest-end products (even if they&#8217;d really rather sell you something bigger). Or, if you&#8217;re in the sub-terabyte range, maybe you can get by with an in-memory BI tool such as QlikView, and not do anything special on the DBMS side at all.</p>
<p><strong><em>Investigative data mart &#8212; big</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric, logs, financial trade, scientific</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> Single-subject 20 TB &#8211; 20 PB relational database<em></em></li>
<li><em>Stresses:</em> Performance, scale-out, analytic functionality</li>
</ul>
<p>But if you&#8217;re looking at tens of terabytes of relational data, or even more, you really do have a &#8220;big data&#8221; problem. Performance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum. Performance POCs (Proofs Of Concept) are a big part of the buying process. Vendor price negotiations are crucial too.</p>
<p><em>Actually, in the low tens of terabytes you might be able to get away with a shared-disk system that has excellent compression &#8212; e.g., columnar products like Sybase IQ, Infobright, or SAND, rather than just Vertica and ParAccel.</em></p>
<p>Assuming you have affordable, scalable query performance, the competitive differentiator can switch to additional analytic functionality. Aster, Netezza, ParAccel, Vertica, and Greenplum either offer full <a href="../../../../../2011/02/24/analytic-platforms/">analytic platforms</a>, or seem to be on the path to doing so. Teradata, which now owns Aster Data, offers substantial built-in analytic capability in its traditional products as well, and the same goes for Sybase IQ.</p>
<p><em>Continued in <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/">Part 2</a>,</em><em> where we cover some of the more difficult use cases.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

