<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; IBM and DB2</title>
	<atom:link href="http://www.dbms2.com/category/products-and-vendors/ibm-and-db2/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Wed, 08 Feb 2012 17:17:32 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Microsoft SQL Server 2012 and enterprise database choices in general</title>
		<link>http://www.dbms2.com/2012/01/24/microsoft-sql-server-2012/</link>
		<comments>http://www.dbms2.com/2012/01/24/microsoft-sql-server-2012/#comments</comments>
		<pubDate>Tue, 24 Jan 2012 14:42:34 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Mid-range]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5859</guid>
		<description><![CDATA[Microsoft is launching SQL Server 2012 on March 7. An IM chat with a reporter resulted, and went something like this. Reporter: [Care to comment]? CAM: SQL Server is an adequate product if you don&#8217;t mind being locked into the Microsoft stack. For example, the ColumnStore feature is very partial, given that it can&#8217;t be [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.sqlserverlaunch.com/ww/Home">Microsoft is launching SQL Server 2012 on March 7</a>. An IM chat with a reporter resulted, and went something like this.</p>
<p><strong>Reporter: [Care to comment]?</strong><br />
<strong>CAM:</strong> SQL Server is an adequate product if you don&#8217;t mind being locked into the Microsoft stack. For example, the ColumnStore feature is very partial, given that <a href="http://msdn.microsoft.com/en-us/library/gg492088%28v=sql.110%29.aspx#Update">it can&#8217;t be updated</a>; but Oracle doesn&#8217;t have columnar storage at all.</p>
<p><strong>Reporter: Is the lock-in overall worse than IBM DB2, Oracle?</strong><br />
<strong>CAM:</strong> Microsoft locks you into an operating system, so yes.</p>
<p><strong>Reporter: Is this release something larger Oracle or IBM shops could consider as a lower-cost alternative a co-habitation scenario, in the event they&#8217;re mulling whether to buy more Oracle or IBM licenses?</strong><br />
<strong>CAM:</strong> If they have a strong Microsoft-stack investment already, sure. Otherwise, why?</p>
<p><strong>Reporter: [How about] just cost?</strong><br />
<strong>CAM:</strong> DB2 works just as well to keep Oracle honest as SQL Server does, and without a major operating system commitment. For analytic databases you want an analytic DBMS or appliance anyway.</p>
<p>Best is to have one major vendor of OTLP/general-purpose DBMS, a web DBMS, a DBMS for disposable projects (that may be the same as one of the first two), plus however many different analytic data stores you need to get the job done.</p>
<p>By &#8220;web DBMS&#8221; I mean MySQL, NewSQL, or NoSQL. Actually, you might need more than one product in that area.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/24/microsoft-sql-server-2012/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Some big-vendor execution questions, and why they matter</title>
		<link>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/</link>
		<comments>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 11:01:20 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cognos]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[HP and Neoview]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[SAP AG]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5704</guid>
		<description><![CDATA[When I drafted a list of key analytics-sector issues in honor of look-ahead season, the first item was &#8220;execution of various big vendors&#8217; ambitious initiatives&#8221;.  By &#8220;execute&#8221; I mean mainly: &#8220;Deliver products that really meet customers&#8217; desires and needs.&#8221; &#8220;Successfully convince them that you&#8217;re doing so &#8230;&#8221; &#8220;&#8230; at an attractive overall cost.&#8221; Vendors mentioned [...]]]></description>
			<content:encoded><![CDATA[<p>When I drafted a list of key analytics-sector issues in honor of <a href="http://www.dbms2.com/2011/11/21/analytic-trends-in-2012-qa/">look-ahead season</a>, the first item was &#8220;execution of various big vendors&#8217; ambitious initiatives&#8221;.  By &#8220;execute&#8221; I mean mainly:</p>
<ul>
<li>&#8220;Deliver products that really meet customers&#8217; desires and needs.&#8221;</li>
<li> &#8220;Successfully convince them that you&#8217;re doing so &#8230;&#8221;</li>
<li>&#8220;&#8230; at an attractive overall cost.&#8221;</li>
</ul>
<p>Vendors mentioned here are Oracle, SAP, HP, and IBM. Anybody smaller got left out due to the length of this post. Among the bigger omissions were:</p>
<ul>
<li>salesforce.com (multiple subjects).</li>
<li><a href="../../../../../2011/04/21/sas-hpa-does-make-sense-after-all/">SAS HPA</a>.</li>
<li><a href="../../../../../2011/08/21/hadoop-evolution/">The evolution of Hadoop</a>.</li>
</ul>
<p><span id="more-5704"></span><strong>A (lingering) issue for SAP and Oracle alike</strong></p>
<p>As I noted in January of this year, <a href="../../../../../2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/">integration of business intelligence into operational apps is making very slow progress</a>. Even so, it&#8217;s a huge part of the apparent strategy at SAP and Oracle alike, as well it should be. Much of the benefit from automating routine desk work has already happened. The areas ripest for exploitation are the ones where analytics are part of the equation.</p>
<p>Given the lack of tangible progress, why do I think this is a genuine area of Oracle and SAP emphasis? Three reasons of many are:</p>
<ul>
<li>Why else did SAP buy Business Objects?</li>
<li>If they&#8217;re not trying to <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrate operational apps and analytics</a>, why else does SAP&#8217;s emphasis on HANA make sense?</li>
<li>Without business intelligence in the picture, how does Oracle&#8217;s integrated-stack story promise any direct user benefits?*</li>
</ul>
<p><em>*As opposed to IT concerns &#8212; integration, administration, TCO (Total Cost of Ownership), etc.</em></p>
<p>After so many years of disappointment, I&#8217;m not going to forecast 2012 as a pivotal year for <strong>the integration of business intelligence into operational applications.</strong> But if one of SAP or Oracle ever does get a significant BI/operational app integration advantage over the other, it could be a major competitive advantage in those application market segments that are still up for grabs. It also is an opportunity for both vendors to gain BI market share in their respective application customer bases.</p>
<p><strong>A more urgent issue for SAP</strong></p>
<p>SAP has put huge amounts of credibility on the line for HANA, the integration of two different and not particularly mature in-memory database technologies. So far, it is difficult to find evidence that HANA is robust enough for widespread adoption. Whether or not SAP can fix that is a huge open question, which could have significant impact on the course of several technology areas: applications, business intelligence, in-memory DBMS, and maybe even hardware.</p>
<p>Based on current information, which is admittedly partial, I&#8217;m a short-term pessimist on HANA. Longer-term, I&#8217;m on record as saying that <a href="../../../../../2011/05/23/databases-ram/">traditional databases will eventually wind up in RAM</a>. SAP will surely get that technology right some day, whether or not the way it does so has anything to do with present-day HANA code.</p>
<p><strong>Four more issues for Oracle </strong></p>
<p>Oracle&#8217;s ambitions are near-endless, and so also therefore is its list of execution challenges. Four in the analytics area that I find particularly interesting are:</p>
<ul>
<li><strong>True hybrid columnar DBMS.</strong> <a href="../../../../../2011/09/22/teradata-columnar-compression/">I was guessing that Oracle, like Teradata, would announce true hybrid columnar the week of Oracle OpenWorld</a>. I was wrong. But if Oracle can&#8217;t bring out true hybrid columnar DBMS functionality relatively soon, Exadata will lose credibility as a competitor to more specialized analytic DBMS.</li>
<li><strong>Oracle Exalytics.</strong> With Exalytics in the mix, Oracle&#8217;s technology stack has HANA-like potential. But will Exalytics even ship in 2012? (I think so.) Will it be good for much in the first release? (I&#8217;m skeptical.)</li>
<li><strong>Oracle&#8217;s Big Data Appliance</strong>. I&#8217;m skeptical both about <a href="../../../../../2011/10/20/more-notes-on-oracle-nosql/">Oracle&#8217;s NoSQL product</a> &#8212; <a href="http://www.infoworld.com/d/data-explosion/first-look-oracle-nosql-database-179107">a favorable InfoWorld review</a> notwithstanding &#8212; and <a href="../../../../../2011/09/23/hadoop-appliances/">Hadoop appliances</a>. But if I&#8217;m wrong, and Oracle can successfully embrace/extend the new non-relational paradigms, then it really might regain control over the evolution of data management.</li>
<li><strong><a href="../../../../../2011/10/18/oracle-is-buying-endeca/">Oracle&#8217;s Endeca acquisition</a></strong> &#8212; will Oracle prove me wrong and integrate Endeca effectively into its overall analytic product line? If it does, we might finally see effective text (and eventually speech) navigation of enterprise software. (But as with all Oracle issues cited here, this is something that probably won&#8217;t amount to much in 2012 even if it does later go well.)</li>
</ul>
<p><strong>Three issues for IBM</strong></p>
<p>Like Oracle, IBM is a huge company with many ambitions and hence many execution challenges. The biggest of those is surely: <strong>How effective can IBM be at selling outside its existing customer base?</strong> I don&#8217;t hear as much competitively about IBM DataStage, IBM SPSS or now IBM Netezza as I did when their vendors were independent companies. Even Cognos may not be much of an exception to the rule, although it has its own large customer base outside of IBM&#8217;s traditional one. (To lesser extents , the same is of course true of Netezza and numerous other IBM acquisitions.)</p>
<p>Another general issue for IBM is <strong>substantively integrating its various product lines,</strong> at least to the extent that makes sense. DB2/Netezza integration sounds good, but even that is a matter more of product marketing (the admirable part of that discipline) more than of actual technology. Other integrations (e.g. Cognos/DB2 in various bundles) have tended toward the dubious side.*</p>
<p><em>*I&#8217;m still waiting for IBM to get back to me with examples of how Cognos/DB2 joint tuning amounts to anything. It&#8217;s been more than a year, so I&#8217;m glad I didn&#8217;t hold my breath.</em></p>
<p>In a somewhat narrower vein, I wonder: <strong><a href="../../../../../2011/11/10/cep-streaming-catchup/">Will IBM be able to gain traction for InfoSphere Streams</a>? </strong>And if so, when and where will the traction be?</p>
<p><strong>Will HP screw up Vertica?</strong></p>
<p>Vertica has a very attractive product offering. It&#8217;s perhaps <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">the most scalable analytic DBMS outside of Teradata</a>, running on the hardware of your reasonable choice.  It&#8217;s also the one I recommend most often to clients in the 1-50 terabyte range.</p>
<p>So far HP doesn&#8217;t seem to have done much to leadfoot Vertica. (About all I&#8217;ve heard from competitors is that Vertica seems to have faded somewhat in the financial services market, and there could be multiple explanations if that is indeed true.) But if HP Vertica does somehow manage to botch things, opportunities will open up for a range of columnar analytic DBMS competitors.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Very brief CEP/streaming catchup</title>
		<link>http://www.dbms2.com/2011/11/10/cep-streaming-catchup/</link>
		<comments>http://www.dbms2.com/2011/11/10/cep-streaming-catchup/#comments</comments>
		<pubDate>Fri, 11 Nov 2011 03:29:37 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[StreamBase]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Truviso]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5632</guid>
		<description><![CDATA[When I agreed to launch the StreamBase LiveView product via DBMS 2, I planned to catch up on the whole CEP/streaming area first. Due to the power and internet outages last week, that didn&#8217;t entirely happen. So I&#8217;ll do a bit of that now, albeit more cryptically than I hoped and intended. The upshot of [...]]]></description>
			<content:encoded><![CDATA[<p>When I agreed to launch the StreamBase LiveView product via <em>DBMS 2,</em> I planned to catch up on the whole CEP/streaming area first. Due to the power and internet outages last week, that didn&#8217;t entirely happen. So I&#8217;ll do a bit of that now, albeit more cryptically than I hoped and intended.</p>
<ul>
<li>The upshot of my <a href="../../../../../2011/08/25/renaming-cep-or-not/">what to call CEP thread</a> in August was that &#8220;streaming&#8221; and &#8220;event processing&#8221; are not the same concept, but it so happens that they have the most traction where they intersect. That said, I both observe and endorse an apparent shift from &#8220;event&#8221; to &#8220;stream&#8221; as the core of the terminology, in <a href="../../../../../2008/03/19/what-to-call-cep/">a reversal of my opinion of several years ago</a>.</li>
<li>IBM continues to throw a lot of resources at its <a href="../../../../../2009/05/13/ibm-system-s-infosphere-streams-processing/">System S/ InfoSphere Streams</a> product, but I haven&#8217;t heard yet of much marketplace success. That said, I believe IBM is still pretty serious about Streams, as one would expect from an effort whose code name so cheekily references <a href="http://www.softwarememories.com/2008/10/02/a-bit-of-db2-history-per-ibm/">System R</a>. In particular, Streams shows up prominently on IBM&#8217;s top-level analytic architecture slide.</li>
<li>Sybase recently released its ESP (Event Stream Processor) 5.0, which it says is the full merger of the Aleri and Coral8 predecessors. You can still get Sybase ESP without buying into the full <a href="../../../../../2010/02/05/sybase-aleri-rap/">Sybase RAP</a> stack, and Sybase has no plans to change that.</li>
<li>Sybase has discontinued all <a href="../../../../../2009/03/25/aleri-update/">the business intelligence types of products Aleri and Coral8 were developing</a>. Rather, Sybase is OEMing Panopticon, which it reports has been well received. Other than the discontinuation of the BI efforts, there seem to be few Aleri or Coral8 features missing from the merged Sybase ESP product.</li>
<li>Truviso continues to be <a href="../../../../../2010/05/04/truviso-evidently-reinvents-itself/">out of the picture</a>.</li>
<li>I have more to say about <a href="http://www.dbms2.com/2011/11/10/streambase-catchup/">StreamBase</a> separately.</li>
<li>I have more to say about Sybase and IBM, which I&#8217;ll get to when I can.</li>
<li>I have nothing new on Progress Apama. I also know little about any of the open source efforts.</li>
</ul>
<p>Meanwhile, if you want to see technically nitty-gritty posts about the CEP/streaming area, you may want to look at <a href="../../../../../category/memory-centric-data-management/event-stream-processing/page/4/">my CEP/streaming coverage circa 2007-9</a>, based on conversations with (among others) <a href="../../../../../2007/06/18/mike-stonebraker-on-financial-stream-processing/">Mike Stonebraker</a>, <a href="../../../../../2007/08/03/a-deeper-dive-into-apama/">John Bates</a>, and <a href="../../../../../2007/08/10/the-essence-of-cep-according-to-coral8/">Mark Tsimelzon</a>.</p>
<p><strong> </strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/10/cep-streaming-catchup/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Transparent relational OLTP scale-out</title>
		<link>http://www.dbms2.com/2011/10/23/transparent-relational-oltp-scale-out/</link>
		<comments>http://www.dbms2.com/2011/10/23/transparent-relational-oltp-scale-out/#comments</comments>
		<pubDate>Mon, 24 Oct 2011 04:19:09 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Schooner Information Technology]]></category>
		<category><![CDATA[dbShards and CodeFutures]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5521</guid>
		<description><![CDATA[There’s a perception that, if you want (relatively) worry-free database scale-out, you need a non-relational/NoSQL strategy. That perception is false. In the analytic case it’s completely ridiculous, as has been demonstrated by Teradata, Vertica, Netezza, and various other MPP (Massively Parallel Processing) analytic DBMS vendors. And now it’s false for short-request/OLTP (OnLine Transaction Processing) use [...]]]></description>
			<content:encoded><![CDATA[<p>There’s a perception that, if you want (relatively) worry-free database scale-out, you need a non-relational/NoSQL strategy. That perception is false. In the analytic case it’s completely ridiculous, as has been demonstrated by <a href="../../../../../2011/09/24/confusion-about-teradatas-big-customers/">Teradata</a>, <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">Vertica</a>, Netezza, and various other MPP (Massively Parallel Processing) analytic DBMS vendors. And now it’s false for <a href="../../../../../2011/03/02/short-request-processing/">short-request</a>/OLTP (OnLine Transaction Processing) use cases as well.</p>
<p>My favorite relational OLTP scale-out choice these days is <a href="http://www.dbms2.com/2011/10/23/schooner-pivots-further/">the SchoonerSQL/dbShards partnership</a>. Schooner Information Technology (SchoonerSQL) and Code Futures (dbShards) are young, small companies, but I’m not too concerned about that, because the APIs they want you to write to are just MySQL’s. The main scenarios in which I can see them failing are ones in which they are competitively leapfrogged, either by other small competitors – e.g. ScaleBase, Akiban, TokuDB, or ScaleDB &#8212; or by Oracle/MySQL itself. While that could suck for my clients Schooner and Code Futures, it would still provide users relying on MySQL scale-out with one or more good product alternatives.</p>
<p>Relying on non-MySQL NewSQL startups, by way of contrast, would leave me somewhat more concerned. (However, if their code is open sourced. you have at least some vendor-failure protection.) And big-vendor scale-out offerings, such as Oracle RAC or <a href="../../../../../2011/05/06/db2-oltp-scale-out-purescale/">DB2 pureScale</a>, may be more complex to deploy and administer than the MySQL and NewSQL alternatives.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/23/transparent-relational-oltp-scale-out/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>IBM is buying parallelization expert Platform Computing</title>
		<link>http://www.dbms2.com/2011/10/11/ibm-is-buying-parallelization-expert-platform-computing/</link>
		<comments>http://www.dbms2.com/2011/10/11/ibm-is-buying-parallelization-expert-platform-computing/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 16:13:05 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Scientific research]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5473</guid>
		<description><![CDATA[IBM is acquiring Platform Computing, a company with which I had one briefing, last August. Quick background includes:  Platform Computing started ~20 years ago. Platform Computing claimed close to $100 million in revenue and &#62;500 people. (This is Platform Computing&#8217;s most famous splash to date.) Platform Computing technology underlies SAS Institute&#8217;s preferred method of parallelization, [...]]]></description>
			<content:encoded><![CDATA[<p>IBM is acquiring Platform Computing, a company with which I had one briefing, last August. Quick background includes:  <span id="more-5473"></span></p>
<ul>
<li>Platform Computing started ~20 years ago.</li>
<li>Platform Computing claimed close to $100 million in revenue and &gt;500 people.</li>
<li><strong>(This is Platform Computing&#8217;s most famous splash to date.)</strong> Platform Computing technology underlies SAS Institute&#8217;s preferred method of parallelization, which may variously be called:
<ul>
<li>SAS Grid Manager (the more or less official brand name).</li>
<li><a href="../../../../../2011/04/21/sas-hpa-does-make-sense-after-all/">SAS HPA</a> (High Performance Analytics), sort of an alternate brand name.</li>
<li>MPI (Message Passing Interface), the industry&#8217;s name for the underlying semantics/syntax/API.</li>
</ul>
</li>
<li>Platform Computing&#8217;s original business was scientific grid computing.</li>
<li>Platform Computing&#8217;s second major business was its &#8220;Symphony&#8221; product line. According to Platform Computing, Symphony:
<ul>
<li>Debuted 6-7 years ago.</li>
<li>Is more commercially oriented.</li>
<li>Is what supports SAS HPA.</li>
<li>SAS aside, has been sold to Wall Street and so on.</li>
<li>Is sometimes used in conjunction with <a href="../../../../../2011/08/25/renaming-cep-or-not/">CEP/streaming</a>, mainly for backtesting.</li>
<li>Can be used to build global (parallel) persistent memory for R.</li>
</ul>
</li>
<li><strong>(This is probably why IBM is buying Platform Computing.)</strong> Platform Computing&#8217;s has a new MapReduce offering that:
<ul>
<li>Is based on Symphony.</li>
<li>Shipped last July, except that early access was a couple months before that.</li>
<li>Is focused on:
<ul>
<li>Lowering the latency of MapReduce.</li>
<li>Consolidating multiple MapReduce use cases into one high(er)-utilization cluster.</li>
<li>Offering workload management in support of those goals.</li>
<li>Reliability, availability, predictability, puppies, kittens, and apple pie.</li>
</ul>
</li>
</ul>
</li>
<li>Is most specifically a MapReduce run-time engine, with other stuff beyond that.</li>
</ul>
<p>Unfortunately, I&#8217;m not precisely clear as to how tied this offering is to Hadoop, but using it with Hadoop is at least the base case. But Platform Computing did say:</p>
<ul>
<li>It can support multiple virtual Hadoop clusters, which can be grown or shrunk at will.</li>
<li>Non-Hadoop workloads can be mixed in.</li>
</ul>
<p>Platform Computing said that key technical benefits of this offering included:</p>
<ul>
<li><strong>1-3 seconds to start a job, vs. 40-50 in generic Hadoop.</strong></li>
<li>Automatic recovery of JobTracker nodes.</li>
<li>Failover for NameNodes.</li>
<li>Workload management that:
<ul>
<li>Manages all of CPU, I/O, and RAM (this is quickly becoming an industry standard level of capability, although I&#8217;m judging more by the standards of the analytic DBMS world).</li>
<li>Monitors but doesn&#8217;t actively manage network resources.</li>
<li>Can reprioritize jobs that are in flight. (Also an industry-standard capability.)</li>
</ul>
</li>
</ul>
<p>This conflation of scientific, commercial analytic, streaming, and MapReduce is right in IBM&#8217;s philosophical wheelhouse. I base that comment on, among other factors:</p>
<ul>
<li>How IBM positions &#8220;Big Insights&#8221;.</li>
<li>IBM&#8217;s &#8220;smart consolidation&#8221; picture/pitch (which I really should get around to posting).</li>
<li>The fuss IBM makes about Watson, Blue Gene, and so on.</li>
</ul>
<p>The IBM acquisition probably obviates a lot of Platform Computing&#8217;s previous business comments, but at the time they included:</p>
<ul>
<li>POCs (Proofs of Concept):
<ul>
<li>Mainly in financial services, government, and telecom.</li>
<li>At both existing customers and new prospects.</li>
<li>Typically running 30-50 nodes, 2-50 terabytes.* The smallest databases evidently tended to be an financial services firms.</li>
</ul>
</li>
<li>Pricing that was starting out:
<ul>
<li>Perpetual license: $3450/server, 21% annual maintenance after the first year.</li>
<li>Subscription: $2070/server annually, or $3070 with HDFS support bundled in.</li>
</ul>
</li>
</ul>
<p><em><strong>*1 terabyte or less per node</strong> is probably the lowest data-per-node figure I&#8217;ve heard for anything Hadoop-like &#8212; even below Hadapt, and well below what <a href="../../../../../2011/07/06/hadoop-hardware-and-compression/">Cloudera and Hortonworks</a> usually see.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/11/ibm-is-buying-parallelization-expert-platform-computing/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>The Ted Codd guarantee</title>
		<link>http://www.dbms2.com/2011/07/31/the-ted-codd-guarantee/</link>
		<comments>http://www.dbms2.com/2011/07/31/the-ted-codd-guarantee/#comments</comments>
		<pubDate>Sun, 31 Jul 2011 22:44:21 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[NoSQL]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5044</guid>
		<description><![CDATA[I write a lot about whether or not to use relational DBMS. For example: In May I surveyed relational vs. non-relational pros and cons at some length. Last November I mused about when it might be OK to do without joins. The question is implicit in a variety of posts about, say, document-oriented or object-oriented [...]]]></description>
			<content:encoded><![CDATA[<p>I write a lot about whether or not to use relational DBMS. For example:</p>
<ul>
<li>In May I surveyed <a href="../../../../../2011/05/29/when-to-use-relational-database-management-system/">relational vs. non-relational pros and cons</a> at some length.</li>
<li>Last November I mused about <a href="../../../../../2010/11/29/document-database-without-joins/">when it might be OK to do without joins</a>.</li>
<li>The question is implicit in a variety of posts about, say, <a href="../../../../../2011/02/07/notes-on-document-oriented-nosql/">document-oriented</a> or <a href="../../../../../2011/05/21/object-oriented-database-management-systems-oodbms/">object-oriented</a> DBMS.</li>
</ul>
<p>Before going further in that vein, I&#8217;d like to do a quick review of what E. F. &#8220;Ted&#8221; Codd was getting at with the relational model in the first place.  <span id="more-5044"></span></p>
<p>The first sentence of Codd&#8217;s famous 1970 <a href="http://www.seas.upenn.edu/%7Ezives/03f/cis550/codd.pdf">paper introducing the relational database concept</a> reads:</p>
<blockquote><p>Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).</p></blockquote>
<p>In modern terms, that means <strong>&#8220;all you have to know to use the database is its logical schema; you don&#8217;t need to know anything about its physical representation.&#8221;</strong></p>
<p>Over the next 15 years, Codd&#8217;s thinking &#8212; and his employer IBM&#8217;s technology &#8212; evolved to the point that Codd proposed <a href="http://www.cse.ohio-state.edu/%7Esgomori/570/coddsrules.html">12 rules for a relational DBMS</a>, the three most fundamental of which are:</p>
<blockquote><p><em><strong>Foundation Rule</strong><br />
</em>A relational database management system must manage its stored data using only its relational capabilities.</p>
<p><em><strong>Information Rule</strong><br />
</em>All information in the database should be represented in one and only one way &#8212; as values in a table.</p>
<p><em><strong>Guaranteed Access Rule</strong><br />
</em>Each and every datum (atomic value) is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.</p></blockquote>
<p>I.e., Codd was positively asserting that <strong>a database should have a fixed logical schema, </strong>in a<strong> tabular form. </strong>The clear implication was that programmers could or should be able to write anything they wanted to against that schema, without database performance being unduly compromised.</p>
<p>Of course, things never quite worked out that way. For most of the history of tabular DBMS, the best-performing <a href="http://www.dbms2.com/2011/03/30/short-request-and-analytic-processing/">short-request and analytic DBMS</a> have been designed quite differently from each other.* Non-relational systems &#8212; from IBM&#8217;s own IMS to various object-oriented DBMS &#8212; outperformed relational DBMS on particular applications. Designers of high-performance applications were sensitive to the database&#8217;s physical design, sometimes even going to the extreme of <a href="../../../../../2011/02/24/transparent-sharding/">non-transparent sharding</a>. But on the whole, it was generally agreed that programming against a fixed logical schema is a good thing.</p>
<p><em>*Codd acknowledged this himself by promoting multidimensional OLAP over traditional RDBMS. (I regard the multidimensional/relational divide to be a distinction without significant difference; it&#8217;s all just fixed-logical-schema tabular processing with different data manipulation languages.)</em></p>
<p>In my next post, I&#8217;ll return to the subject of <a href="http://www.dbms2.com/2011/07/31/dynamic-fixed-schema-databases/">why fixed schemas might not always be such a good idea after all</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/31/the-ted-codd-guarantee/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Cloudera and Hortonworks</title>
		<link>http://www.dbms2.com/2011/07/10/cloudera-and-hortonworks/</link>
		<comments>http://www.dbms2.com/2011/07/10/cloudera-and-hortonworks/#comments</comments>
		<pubDate>Mon, 11 Jul 2011 03:13:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hortonworks]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4939</guid>
		<description><![CDATA[My clients at Cloudera have been around for a while, in effect positioned as &#8220;the Hadoop company.&#8221; Their business, in a nutshell, consists of: Packaging up a Cloudera distribution of Apache Hadoop. This distribution doesn&#8217;t have proprietary code; it&#8217;s just packaged by Cloudera from Apache projects (with a decent minority of the code happening to [...]]]></description>
			<content:encoded><![CDATA[<p>My clients at Cloudera have been around for a while, in effect positioned as &#8220;the Hadoop company.&#8221; Their business, in a nutshell, consists of:</p>
<ul>
<li>Packaging up <strong>a Cloudera distribution of Apache Hadoop.</strong> This distribution doesn&#8217;t have proprietary code; it&#8217;s just packaged by Cloudera from Apache projects (with a decent minority of the code happening to have been contributed by Cloudera engineers).</li>
<li>Paid subscription <strong>support for Apache Hadoop</strong> and, in connection with that &#8230;</li>
<li>&#8230;  <strong>proprietary software</strong> that all support customers automatically get. There are two points to this proprietary software:
<ul>
<li>It adds value for the customer.</li>
<li>It makes Cloudera&#8217;s support job easier.</li>
</ul>
</li>
<li><strong>Professional services</strong> around Hadoop.</li>
<li><strong>Training and conferences</strong> around Hadoop, which probably don&#8217;t generate all that much money, but are great marketing in terms of visibility, thought leadership, and lead generation.</li>
</ul>
<p><strong>Hortonworks</strong> spun out of Yahoo last week, with parts of the Cloudera business model, namely Hadoop support, training, and I guess conferences. Hortonworks emphatically rules out professional services, and says that it will contribute all code back to Apache Hadoop. Hortonworks does grudgingly admit that it might get into the proprietary software business at some point &#8212; but evidently hopes that day will never actually come.</p>
<p><span id="more-4939"></span>Hortonworks&#8217; two main initial marketing messages &#8212; and there&#8217;s some synergy between these &#8212; boil down to:</p>
<ul>
<li>Open source purism</li>
<li>&#8220;We have most of the Hadoop developers, so we&#8217;re better&#8221;*</li>
</ul>
<p>Frankly, the open source purism part sounds like doubletalk to me, in that Hortonworks has trouble articulating what supposedly-less-pure Cloudera does wrong that Hortonworks will do better. However, I&#8217;ve been hearing for a long time that Yahoo&#8217;s MapReduce developers feel very strongly about open source, so perhaps this is in part an emotional issue for them. More substantively, it fits well with the pro-Hortonworks story I&#8217;ve outlined below.</p>
<p><em>*&#8221;We have most of the Hadoop developers&#8221; seems fairly defensible, give or take dueling definitions of &#8220;committer,&#8221; &#8220;core developer,&#8221; &#8220;patch&#8221; or for that matter &#8220;Hadoop.&#8221;</em></p>
<p>The other branch of the Hortonworks marketing message can be lampooned as &#8220;We&#8217;re the right folks to identify your bugs, since we&#8217;re probably the ones who put them there in the first place.&#8221; More darkly, that pitch could be &#8220;If you want the bugs fixed that bother you, we&#8217;re the ones who have control over whether or not that happens.&#8221; Well, maybe. But I also see Cloudera having a couple years experience supporting Hadoop, as well as shipping some code that perhaps makes Hadoop more supportable.</p>
<p>That&#8217;s the skeptical view. <strong>A more favorable view of Hortonworks&#8217; prospects </strong>would go something like this:</p>
<ul>
<li>One version of Apache Hadoop is plenty.</li>
<li>Cloudera (and arguably other Hadoop platform software vendors) sell capabilities that will soon be eclipsed by core Apache Hadoop. Folks should just please wait.</li>
<li>Now that Hortonworks is an independent company focused on the task, it will speedily solve the packaging problems that have made Cloudera&#8217;s Hadoop distribution (perceived to be) necessary.</li>
<li>Yahoo and IBM both back Hortonworks&#8217; approach. That&#8217;s got to count for something.</li>
<li>Apache Hadoop will be quickly enhanced, and Hortonworks will be driving the enhancements. Hortonworks simply is the top Hadoop authority.</li>
</ul>
<p>We&#8217;ll see. Cloudera&#8217;s been around for a couple years, has smart people, and by definition has no technical inferiority to Hortonworks (since it has access to all Hortonworks&#8217; code). What&#8217;s more, it will be a long time before Hadoop technology is so mature that there&#8217;s nothing left to do; add-on software should long prove to be useful. As for &#8220;We&#8217;re purer about open source than the other guys&#8221; &#8212; well, I&#8217;m dubious that that will turn out to be a great marketing message.</p>
<p>And so I think Cloudera is the early favorite in the competition. But perhaps Hadoop users will be able to play Cloudera and Hortonworks off  against each other in price negotiations. Perhaps, notwithstanding <a href="../../../../../2011/06/02/why-you-would-want-an-appliance-and-when-you-wouldnt/">my skepticism about Hadoop appliances</a>, some hardware vendors will play them against each other for appliance partnerships.</p>
<p>Meanwhile, whatever else happens, I&#8217;m pretty psyched about <a href="http://www.dbms2.com/2011/07/10/hadoop-futures-and-enhancements/">some enhancements the Hortonworks folks plan to lead for Hadoop</a>.</p>
<p><strong><em>Related links</em></strong></p>
<ul>
<li>A <a href="http://www.monash.com/uploads/Hortonworks-Apache-Hadoop-July-2011.pptx">Hortonworks/Apache Hadoop slide deck</a> Hortonworks graciously allowed me to post</li>
<li>Cloudera&#8217;s post about it&#8217;s recent <a href="http://www.cloudera.com/blog/2011/07/the-only-full-lifecycle-management-for-apache-hadoop-introducing-cloudera-enterprise-3-5-and-scm-express/">3.5 release of Cloudera Enterprise</a></li>
<li>Pros and cons of <a href="http://www.softwarememories.com/2011/07/10/when-professional-services-and-software-mix/">professional services efforts at young software companies</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/10/cloudera-and-hortonworks/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 1)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:17:44 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4868</guid>
		<description><![CDATA[Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help. Let&#8217;s try eight categories instead. While no categorization [...]]]></description>
			<content:encoded><![CDATA[<p>Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help.</p>
<p>Let&#8217;s try eight categories instead. While <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">no categorization is ever perfect</a>, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need &#8212; and in most cases you&#8217;ll need several &#8212; is a great early step in your analytic technology planning.  <span id="more-4868"></span></p>
<p><strong><em>Enterprise data warehouse</em></strong> (Full or partial)</p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, but especially operational</li>
<li><em>Likely use styles:</em> All</li>
<li><em>Canonical example:</em> Central EDW for a big enterprise</li>
<li><em>Stresses:</em> Concurrency, reliability, workload management</li>
</ul>
<p>The enterprise data warehouse (EDW) ideal says that you copy all your data into one place, and drive all decision-making from there. <a href="../../../../../2011/06/21/its-official-the-grand-central-edw-will-never-happen/">Full EDWs are pipedreams</a>. Still, a partial EDW makes sense for most large enterprises, and many indeed already have one. The first product lines to consider for classical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL Server, especially if you&#8217;re going to stress concurrency and/or operational use cases.</p>
<p><strong><em>Traditional data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Business intelligence, budgeting/consolidation, investigative</li>
<li><em>Examples:</em> Reporting servers, planning/consolidation servers, anything MOLAP, etc.</li>
<li><em>Stresses:</em> Performance, concurrency, TCO</li>
</ul>
<p>Whether or not you have something like an enterprise data warehouse, it&#8217;s common to have lighter-weight data marts as well. A traditional data mart might drive reports and dashboards. Or it might be specialized for budgeting, planning, and/or consolidation.  Some <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> may be in the mix as well.</p>
<p>Any DBMS that can support an EDW can also support a data mart, but it may not be the most cost-effective way to do so. Columnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them &#8212; e.g. Sybase IQ and <a href="../../../../../2011/06/20/vertica-release-5/">Vertica</a> &#8212; have excellent track records in concurrent usage as well. <a href="../../../../../2011/05/29/when-to-use-relational-database-management-system/">Ted Codd</a> pushed what amounts to MOLAP (Multidimensional OnLine Analytic Processing) systems for these use cases. But relational DBMS commonly do a better job, which is one reason most major MOLAP products have wound up at RDBMS companies.</p>
<p><strong><em>Investigative data mart &#8212; agile</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> A few analysts getting a few TB to examine</li>
<li><em>Stresses:</em> Ease of setup/load, ease of admin, price/performance</li>
</ul>
<p>Besides the traditional data mart, there are at least two other kinds. Both are focused on investigative analytics, but they&#8217;re differentiated by database size.</p>
<p>If you have just a few analysts,* looking at no more than a few terabytes of data (perhaps even just some gigabytes) &#8212; and if that data is &#8220;single-subject&#8221; and fairly homogenous &#8212; your watchwords should be &#8220;cheap&#8221;, &#8220;easy&#8221;, and &#8220;fast&#8221;. You don&#8217;t need to invest in much hardware, in expensive software, in much administrative effort (the analysts can be their own DBAs),  nor should you endure much set-up time. Just grab a product, grab some data, and start running queries (or extracts into the statistical tool of your choice).</p>
<p><em>*If you have dozens or even hundreds of analysts hitting the same database, you&#8217;re probably back to the more concurrency-oriented scenarios outlined above.</em></p>
<p>Infobright is often cost-effective among columnar analytic DBMS. Other vendors might cut you a price break as well. If you have multiple terabytes of data, don&#8217;t rule out Netezza&#8217;s lowest-end products (even if they&#8217;d really rather sell you something bigger). Or, if you&#8217;re in the sub-terabyte range, maybe you can get by with an in-memory BI tool such as QlikView, and not do anything special on the DBMS side at all.</p>
<p><strong><em>Investigative data mart &#8212; big</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric, logs, financial trade, scientific</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> Single-subject 20 TB &#8211; 20 PB relational database<em></em></li>
<li><em>Stresses:</em> Performance, scale-out, analytic functionality</li>
</ul>
<p>But if you&#8217;re looking at tens of terabytes of relational data, or even more, you really do have a &#8220;big data&#8221; problem. Performance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum. Performance POCs (Proofs Of Concept) are a big part of the buying process. Vendor price negotiations are crucial too.</p>
<p><em>Actually, in the low tens of terabytes you might be able to get away with a shared-disk system that has excellent compression &#8212; e.g., columnar products like Sybase IQ, Infobright, or SAND, rather than just Vertica and ParAccel.</em></p>
<p>Assuming you have affordable, scalable query performance, the competitive differentiator can switch to additional analytic functionality. Aster, Netezza, ParAccel, Vertica, and Greenplum either offer full <a href="../../../../../2011/02/24/analytic-platforms/">analytic platforms</a>, or seem to be on the path to doing so. Teradata, which now owns Aster Data, offers substantial built-in analytic capability in its traditional products as well, and the same goes for Sybase IQ.</p>
<p><em>Continued in <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/">Part 2</a>,</em><em> where we cover some of the more difficult use cases.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>It&#8217;s official &#8212; the grand central EDW will never happen</title>
		<link>http://www.dbms2.com/2011/06/21/its-official-the-grand-central-edw-will-never-happen/</link>
		<comments>http://www.dbms2.com/2011/06/21/its-official-the-grand-central-edw-will-never-happen/#comments</comments>
		<pubDate>Wed, 22 Jun 2011 01:54:46 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Netezza]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4804</guid>
		<description><![CDATA[I pointed out last year that the grand central enterprise data warehouse couldn&#8217;t happen; the post started: An enterprise data warehouse should: Manage data to high standards of accuracy, consistency, cleanliness, clarity, and security. Manage all the data in your organization. Pick ONE. IBM&#8217;s main theme at the Enzee Universe conference has been to say [...]]]></description>
			<content:encoded><![CDATA[<p>I pointed out last year that <a href="http://www.dbms2.com/2010/04/12/enterprise-data-warehouse-edw-myt/">the grand central enterprise data warehouse couldn&#8217;t happen</a>; the post started:</p>
<blockquote><p>An <strong>enterprise data warehouse</strong> should:</p>
<ul>
<li>Manage data 	to high standards of  <strong>accuracy, consistency, cleanliness, 	clarity, and  security.</strong></li>
<li>Manage <strong>all the data in your 	organization.</strong></li>
</ul>
<p><strong>Pick ONE.</strong></p></blockquote>
<p>IBM&#8217;s main theme at the Enzee Universe conference has been to say the same thing.</p>
<p>Merv Adrian&#8217;s talk at the same conference made it clear that Gartner feels the same way, as does he personally. Indeed, like me, he&#8217;s racked up multiple decades of industry experience without ever finding a single theoretically ideal grand central EDW.</p>
<p>Forrester Research has been a little less clear on the point, but generally seems to be on the correct side of the issue as well.</p>
<p>If somebody is still saying that one central enterprise data warehouse can hold all the information or data you need on which to base your business decisions, they&#8217;re probably not somebody you should be listening to very hard.</p>
<p>Is that clear, or should I hammer home the point even harder? <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/21/its-official-the-grand-central-edw-will-never-happen/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>What to do about &#8220;unstructured data&#8221;</title>
		<link>http://www.dbms2.com/2011/05/15/what-to-do-about-unstructured-data/</link>
		<comments>http://www.dbms2.com/2011/05/15/what-to-do-about-unstructured-data/#comments</comments>
		<pubDate>Sun, 15 May 2011 21:54:30 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4449</guid>
		<description><![CDATA[We hear much these days about unstructured or semi-structured (as opposed to) structured data. Those are misnomers, however, for at least two reasons. First, it&#8217;s not really the data that people think is un-, semi-, or fully structured; it&#8217;s databases.* Relational databases are highly structured, but the data within them is unstructured &#8212; just lists [...]]]></description>
			<content:encoded><![CDATA[<p>We hear much these days about <em>unstructured</em> or <em>semi-structured</em> (as opposed to) <em>structured data.</em> Those are misnomers, however, for at least two reasons. First<strong>, it&#8217;s not really the data that people think is un-, semi-, or fully structured; it&#8217;s databases.</strong>* Relational databases are highly structured, but the data within them is unstructured &#8212; just lists of numbers or character strings, whose only significance derives from the structure that the database imposes.</p>
<p><em>*Here I&#8217;m using the term &#8220;database&#8221; literally, rather than as a concise  synonym for &#8220;database management system&#8221;. But see below.<br />
</em></p>
<p>Second, a more accurate distinction is<strong> not whether a database has one structure or none </strong>&#8211; it&#8217;s<strong> whether a database has one structure or many.</strong> The easiest way to see this is for databases that have clearly-defined schemas. A relational database has one schema (even if it is just the union of various unrelated sub-schemas); an XML database, however, can have as many schemas as it contains documents.</p>
<p>One small terminological problem is easily handled, namely that people don&#8217;t talk about true databases very often, at least when they&#8217;re discussing generalities; rather, they talk about data and DBMS.* So let&#8217;s talk of DBMS being &#8220;structured&#8221; singly or multiply or whatever, just as the databases they&#8217;re designed to manage are.</p>
<p><em>*And they refer to the DBMS as &#8220;databases,&#8221; because they don&#8217;t have much other use for the word. </em></p>
<p>All that said &#8212; I think that <strong>single vs. multiple database structures isn&#8217;t a bright-line binary distinction</strong>; rather, it&#8217;s a <strong>spectrum.</strong> For example:  <span id="more-4449"></span></p>
<ul>
<li>IMS is the most structured DBMS of all. The data in an IMS database is in a hierarchy, and that&#8217;s that.</li>
<li>CODASYL and other kinds of what used to be called <em>network</em> DBMS (before the word got so overloaded) &#8212; e.g. RDB, IDMS, or TOTAL &#8212; are/were almost as structured as IMS.</li>
<li>Relational databases were invented because their structure was more flexible than that of linked-list databases. The whole point of relational DBMS is that you can view the data in a multitude of ways. Still, I see classical relational databases as being toward the single-structure end of the spectrum. (I say &#8220;classical&#8221; because Oracle and DB2 actually can manage combinations of XML, text, and traditional relational tables, if you choose.)</li>
<li>A multivalue DBMS is a little more multi-structured than a relational one, because of how a field can be filled one or multiple times.</li>
<li><a href="../../../../../2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay Singularity</a> (as implemented on Teradata gear) has, in essence, two structures (that I know of). One structure is just the relational schema. The other is the structure you would get if each kind of name-value pair truly had its own column.</li>
<li>A Splunk collection of log data can reasonably be said to have a different structure for every type or source of log. It further can be said to have multiple structures in somewhat the same way that eBay Singularity does.</li>
<li>So-called <a href="../../../../../2011/02/07/notes-on-document-oriented-nosql/">document stores</a> can be very multi-structured. MongoDB, Couchbase, et al. let you have a different structure for every document, if you choose. The same goes for XML-based MarkLogic.</li>
<li>HBase and Cassandra are also very multi-structured. Theoretically, each record gets to decide which column sets it does or doesn&#8217;t fit into.</li>
</ul>
<p>As a general rule &#8212; the more structures a database can have at once, the easier it is to change those structures, even on the fly (e.g., by inserting yet another bit of self-describing data). Thus, I sometimes use the term <strong>polystructured </strong>instead of<strong> multi-structured </strong>or <strong>multistructured.</strong> Thoughts as to which term I should choose going forward would be much appreciated.</p>
<p>As for an actual definition &#8212; well, here&#8217;s something I drafted 3 1/2 years ago but never published:</p>
<blockquote><p>These problems with the relational paradigm are big enough to be worth coining a word for – polystructured. Polystructured data is data with structure that:</p>
<ul>
<li>Can be exploited to provide most of the benefits of a highly structured database (e.g., a tabular/relational one) &#8230;</li>
<li>&#8230; but cannot be described in the concise, consistent form such highly structured systems require.</li>
</ul>
<p>Specifically, we’ll call a database “polystructured” if it is characterized by at least two of the following:</p>
<ol>
<li>Data suitable for being queried by      simple predicate-based matching (e.g., equality to certain values, falling      with in ranges, etc.)</li>
<li>(Other) data suitable for being queried      by more complex matching (e.g., text search relevancy rankings)</li>
<li>Subsets that are more neatly structured      than the whole.</li>
</ol>
<p>Equivalently, we’ll just say that <strong>polystructured data is data that has considerable structure, but whose structure is in some important way unpredictable.</strong></p></blockquote>
<p>NoSQL document or &#8220;column&#8221; stores would satisfy #1 and #3, as would Splunk. MarkLogic would satisfy all three criteria. #1 + #2 is sort of like what happens when text queries are allowed to go against (groups of) relational columns &#8230; and the vagueness with which I&#8217;m saying that makes me suspect that at least the unbolded/first definition doesn&#8217;t really fly.</p>
<p>Finally, here&#8217;s what led up to those definitions (the whole thing is from the introduction to a never-completed white paper). Please forgive any  anachronisms in it. A number of the points in it have also been addressed in posts  here; e.g.,</p>
<ul>
<li>In December, 2005 I expounded on <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">the  mismatch between text data and the relational model</a>.</li>
<li>In June, 2010 I elucidated <a href="http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/">the  variety of data that could go into an individual&#8217;s marketing-oriented  profile</a>.</li>
<li>In February, 2008 I predicted that <a href="../2008/02/15/non-relational-database-management/">flexible-schema   DBMS would gain share</a>.</li>
</ul>
<blockquote><p><strong>The case for polystructured data</strong></p>
<p>Traditional computer databases amount to sets of records.   There usually are a limited number of record formats, which each instance of a particular format containing parallel kinds of information.  Business transactions, web page visits, instrument readings&#8211; whatever the nature of the information, application designers stick it into the simplest structure they think makes sense.</p>
<p>These records are arranged into a variety of data structures.</p>
<ul>
<li>Log files are widely used, especially to track web site visits, in other networking uses, and for other kinds of instrument readings.</li>
<li>Computer user administration is commonly in LDAP (Lightweight Directory Access Protocol) format.</li>
<li>There are still a lot of installations of legacy “linked-list” DBMS (DataBase Management Systems) such as IBM&#8217;s IMS.</li>
<li>Some decision support applications use data in multidimensional arrays.</li>
</ul>
<p>Even so, most new business applications are written over relational DBMS, in the well-known rows-and-tables paradigm.</p>
<p>There are good reasons for the dominance of the relational model and of rows and tables.  (Strictly speaking, “relational” equates neither to “rows and tables” nor to “SQL”, but in practice the three concepts are closely linked.) In particular:</p>
<ul>
<li>Data integrity is (fairly) easy to ensure.</li>
<li>From some standpoints, relational databases are flexible; you can construct almost any kind of query, without having to do any kind of database reorganization (except perhaps for performance).</li>
<li>SQL programmers are easy to find.</li>
<li>There&#8217;s simply been much more engineering effort invested in making good relational DBMS than in any other kind.</li>
</ul>
<p>But the relational database paradigm also has some major drawbacks.  Three of the big ones are:</p>
<ul>
<li>Queries must have strictly match/fail answers; there&#8217;s no natural way for a relational DBMS to handle “somewhat relevant” hits.</li>
<li>Relational databases can get cumbersome when large fractions of the potential data happen to be missing. (Hence the decades-long debates about the problems with NULL values.)</li>
<li>While you have good flexibility in querying against any particular data structure, you do have to predefine your structure before you start accepting input.</li>
</ul>
<p>The last point is why you wind up with all those NULL values in the first place; if a kind of information can be in any record in a set, the database is set up to assume that its present in all of them.  Or if you normalize your database so highly as to avert missing values, then you wind up with a huge number of tables, making queries (and updates) complicated from both the programmer&#8217;s and the machine&#8217;s standpoint.</p>
<p>Text apps suffer from RDBMS&#8217; inelegant handing of relevancy. What&#8217;s more, documents can have almost unlimited internal structures, in three senses:</p>
<ol>
<li>They can have chapters, sections, subsections,      sidebars, footnotes, and so on, in any combination.</li>
<li>Semantic references can link words, phrases,      sentences, and paragraphs in a near-infinite number of ways.</li>
<li>Documents can explicitly contain fielded data, such      as numbers, addresses, dates, or geo-encodings.</li>
</ol>
<p>Another group of apps that suffer from RDBMS&#8217; limitations are in the area of personalization and similar fine-grained marketing analysis. Analysis of web clicks throws away most kinds of path information.  Analysis of written or verbal communication isn&#8217;t well-integrated with that of fielded data.  Different customers and prospects give different kinds of contact information, and are “touched” by different marketing initiatives; current systems do a poor job of integrating all that scattered information.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/05/15/what-to-do-about-unstructured-data/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
	</channel>
</rss>

