<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS2 -- DataBase Management System Services &#187; Structured documents</title>
	<atom:link href="http://www.dbms2.com/category/datatype/native-xml-database/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Fri, 30 Jul 2010 15:51:32 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Advice for some non-clients</title>
		<link>http://www.dbms2.com/2010/07/30/advice-for-some-non-clients/</link>
		<comments>http://www.dbms2.com/2010/07/30/advice-for-some-non-clients/#comments</comments>
		<pubDate>Fri, 30 Jul 2010 14:35:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[HP and Neoview]]></category>
		<category><![CDATA[Information Builders]]></category>
		<category><![CDATA[Ingres]]></category>
		<category><![CDATA[Kalido]]></category>
		<category><![CDATA[Mark Logic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Objectivity and Infinite Graph]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Tableau Software]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2699</guid>
		<description><![CDATA[Most of what I get paid for is in some form or other consulting. (The same would be true for many other analysts.) And so I can be a bit stingy with my advice toward non-clients. But my non-clients are a distinguished and powerful group, including in their number Oracle, IBM, Microsoft, and most of [...]]]></description>
			<content:encoded><![CDATA[<p>Most of what I get paid for is in some form or other consulting. (<a href="http://www.strategicmessaging.com/blurring-analyst-consultant-line/2010/07/28/" onclick="javascript:pageTracker._trackPageview('/www.strategicmessaging.com');">The same would be true for many other analysts</a>.) And so I can be a bit stingy with my advice toward non-clients. But my non-clients are a distinguished and powerful group, including in their number Oracle, IBM, Microsoft, and most of the BI vendors. So here&#8217;s a bit of advice for them too.</p>
<p><strong>Oracle. </strong>On the plus side, you guys have been making progress against your reputation for untruthfulness. Oh, I&#8217;ve dinged you for some <a href="http://www.dbms2.com/2008/09/30/oracle-crosses-the-line-on-integrity/" >past</a> <a href="http://www.dbms2.com/2008/06/28/response-to-rita-sallam-of-oracle/" >slip-ups</a>, but on the whole they&#8217;ve been no worse than other vendors.&#8217; But recently you pulled a doozy. The <a href="http://www.oracle.com/us/corporate/analystreports/infrastructure/index.html" onclick="javascript:pageTracker._trackPageview('/www.oracle.com');">analyst reports</a> section of your website fails to distinguish between unsponsored and sponsored work.* That is a horrible ethical stumble. Fix it fast. Then put processes in place to ensure nothing that dishonest happens again for a good long time.</p>
<p><em>*Merv Adrian&#8217;s &#8220;report&#8221; listed high on that page is actually a sponsored white paper. That Merv himself screwed up by not labeling it clearly as such in no way exonerates Oracle. Besides, I&#8217;m sure Merv won&#8217;t soon repeat the error &#8212; but for Oracle, this represents a whole pattern of behavior.</em></p>
<p><strong>Oracle.</strong> And while I&#8217;m at it, outright dishonesty isn&#8217;t your only unnecessary credibility problem. <a href="http://www.strategicmessaging.com/so-what-is-an-analyst-anyway/2010/07/25/" onclick="javascript:pageTracker._trackPageview('/www.strategicmessaging.com');">You&#8217;re also playing too many games in analyst relations</a>.</p>
<p><strong>HP.</strong> Neoview will never succeed. Admit it to yourselves. Go buy something that can.  <span id="more-2699"></span></p>
<p><strong>Smaller BI vendors.</strong> Analytic DBMS evaluations commonly include BI strategy and tool selection as well. If an analytic DBMS expert tells you he needs to learn more about your product line, don&#8217;t blow him off. In fact, you should be particularly embracing anybody who&#8217;s shown a fondness for small DBMS vendors; maybe he or his clients will like small BI vendors as well. That means (among others) you, <strong>Jaspersoft, Endeca, </strong>and <strong>Tableau.</strong></p>
<p><strong>Information Builders. </strong>Is there anything about your BI products that is in any way technologically differentiated? If so, you might want to mention some examples to somebody some time.</p>
<p><strong>Kalido.</strong> I&#8217;ve said this to you before, but it bears repeating &#8212; your positioning translates to &#8220;I-CASE for analytics,&#8221; and that&#8217;s not a good thing. If your product is not as cumbersome and entrapping as that sounds, you need to do a much better job of explaining why not.</p>
<p><strong>SenSage.</strong> You are what you are. Sell out while the selling is good. You don&#8217;t have the corporate personality to make it into the analytic DBMS mainstream on your own.</p>
<p><strong>Ingres. </strong>You need to be more engaged with analysts than you are. <a href="http://www.softwarememories.com/2010/07/25/ingres-history/" onclick="javascript:pageTracker._trackPageview('/www.softwarememories.com');">Ingres navel-gazed too much 25 years ago</a>, and evidently you haven&#8217;t outgrown it yet.</p>
<p><strong>TIBCO.</strong> You probably have a lot of cool analytic technology, but I don&#8217;t know of an influencer who has much relationship with or trust in you. Rethink how you&#8217;re approaching influencer relations top to bottom.</p>
<p><strong>Tableau.</strong> You had a lot of mindshare, but it&#8217;s fading. Do something.</p>
<p><strong>MarkLogic, graph DBMS vendors, etc.</strong> You&#8217;re clinging too hard to the NoSQL label. Nobody is out there deciding among Cassandra, neo4j, and MarkLogic. They might be deciding between MongoDB and MarkLogic, I guess, but if you admit to yourself that&#8217;s all it is you&#8217;ll probably change your messaging somewhat.</p>
<p><strong>Objectivity.</strong> Get real about marketing. Infinite Graph is a cool opportunity. But I didn&#8217;t even ping you for a meeting when I&#8217;m in your area next week, because I wouldn&#8217;t have known who to reach out to.</p>
<p><strong>Everybody (especially Objectivity).</strong> &#8220;First X deployed in the cloud&#8221; is almost surely an inaccurate claim. Don&#8217;t make it. And by the way, even if it were true, it probably wouldn&#8217;t be interesting.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/30/advice-for-some-non-clients/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Toward a NoSQL taxonomy</title>
		<link>http://www.dbms2.com/2010/03/14/nosql-taxonomy/</link>
		<comments>http://www.dbms2.com/2010/03/14/nosql-taxonomy/#comments</comments>
		<pubDate>Sun, 14 Mar 2010 23:24:45 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1708</guid>
		<description><![CDATA[I talked Friday with Dwight Merriman, founder of 10gen (the MongoDB company). He more or less convinced me of his definition of NoSQL systems, which in my adaptation goes:
NoSQL = HVSP (High Volume Simple Processing) without joins or explicit transactions
Within that realm, Dwight offered a two-part taxonomy of NoSQL systems, according to their data model [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I talked Friday with Dwight Merriman, founder of 10gen (the MongoDB company). He more or less convinced me of his definition of NoSQL systems, which in my adaptation goes:</p>
<p style="margin-bottom: 0in;"><strong>NoSQL = <a href="http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/" >HVSP (High Volume Simple Processing)</a> without joins or explicit transactions</strong></p>
<p style="margin-bottom: 0in;">Within that realm, Dwight offered a two-part taxonomy of NoSQL systems, according to their data model and replication/sharding strategy. I&#8217;d be happier, however, with at least three parts to the taxonomy:</p>
<ul>
<li>How data looks logically on a 	single node</li>
<li>How data is stored physically on a 	single node</li>
<li>How data is distributed, 	replicated, and reconciled across multiple nodes, and whether 	applications have to be aware of how the data is partitioned among 	nodes/shards.<span id="more-1708"></span></li>
</ul>
<p style="margin-bottom: 0in;">After talking with Dwight, and also with Cassandra project chair Jonathan Ellis, I feel I&#8217;m doing decently in understanding the first of those three areas. But there&#8217;s a long way yet to go on the other two.</p>
<p style="margin-bottom: 0in;">In Dwight&#8217;s opinion, as I understand it, NoSQL data models come in four general kinds.</p>
<ul>
<li><em><strong>Key-value stores,</strong></em><em> more or less pure.</em> I.e., they store keys+BLOBs (Binary Large 	OBjects), except that the “Large” part of “BLOB” may not 	come into play.</li>
<li><em><strong>Table-oriented,</strong></em><em> more or less. </em>The major examples here are Google&#8217;s BigTable, and 	Cassandra.</li>
<li><em><strong>Document-oriented,</strong></em><em> where a “document” is more like XML than free text. </em>MongoDB 	and CouchDB are the big examples here.</li>
<li><strong><em>Graph-oriented.</em> </strong><span style="font-weight: normal;">To 	date, this is the smallest area of the four. I&#8217;m reserving judgment 	as to whether I agree it&#8217;s properly included in HVSP and NoSQL.</span></li>
</ul>
<p style="margin-bottom: 0in;">As Dwight sees it, JSON (JavaScript Object Notation) is the emerging markup standard for the document-oriented data models, and to some extent the BLOB part of key-value models as well. Reasons seem to include:</p>
<ul>
<li>JSON is something web developers 	are likely to know anyway.</li>
<li>JSON, unlike XML, is schema-less. 	In the NoSQL world, that&#8217;s perceived as a good thing.</li>
<li>Perhaps for both these reasons, 	JSON is perceived as easier to use than XML.</li>
</ul>
<p style="margin-bottom: 0in;">Except as noted, I&#8217;m not aware of anything that solidly contradicts the above.</p>
<p style="margin-bottom: 0in;">Dwight went on to say that there are two main NoSQL replication/sharding models, in line with the seminal papers to which I <a href="http://www.dbms2.com/2010/03/12/some-nosql-links/" >previously linked</a>:</p>
<ul>
<li><em>Based on or resembling </em><em><strong>Dynamo.</strong></em> The core idea here is accepting <strong>eventual consistency</strong> among 	nodes as being good enough, even if that means you sometimes read 	dirty data. The benefit is that <strong>you never are blocked from 	writing.</strong> By way of contrast, systems that enforce true 	inter-node consistency (think of a two-phase commit) can shut you 	down from writing if consistency guarantees aren&#8217;t being confirmed 	in a timely manner. Thus, in a Dynamo-like scheme you write data to 	multiple nodes, via <strong>consistent hashing;</strong> then when the time 	comes you read one or more nodes, and hope that what you&#8217;re getting 	back is a correct result.</li>
<li><em>Based on or resembling </em><em><strong>BigTable.</strong></em> In this model you&#8217;re trying to keep the 	nodes fully consistent in the usual way, e.g. by synchronous 	replication. Indeed, what&#8217;s being kept consistent is both data 	itself, and metadata about the data&#8217;s location. Details surely vary 	a lot from implementation to implementation.</li>
</ul>
<p style="margin-bottom: 0in;">I&#8217;m fuzzier on this stuff than on the data models, because to date nobody has ever explained to me how an actual live system (MongoDB, Cassandra, whatever) implements its replication strategy. Also, while I think that in both these models applications are allowed to be ignorant of the replication/sharding strategy, I&#8217;m not as sure of that as I&#8217;d like to be.</p>
<p style="margin-bottom: 0in;">If we stop here, we already have something useful. MongoDB has a document data model, and is in the BigTable-like replication camp, at least at first. Cassandra has a table-like data model, and is on the Dynamo-like eventual consistency side. But to say those are the only differences that matter would be like saying that all shared-disk RDBMS (e.g., Oracle and Sybase IQ) are essentially alike. That, of course, would be nonsense.</p>
<p style="margin-bottom: 0in;">So a third dimension needed in this taxonomy is how the systems actually bang data on and off of disk (or silicon, as the case may be). I don&#8217;t yet have an overview of that. I know something of how Cassandra does it, and will write about same in a future post, but that&#8217;s about it. So please stay tuned.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/03/14/nosql-taxonomy/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>This and that</title>
		<link>http://www.dbms2.com/2009/12/29/this-and-that/</link>
		<comments>http://www.dbms2.com/2009/12/29/this-and-that/#comments</comments>
		<pubDate>Tue, 29 Dec 2009 09:14:46 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Mark Logic]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1348</guid>
		<description><![CDATA[I have various subjects backed up that I don&#8217;t really want to write about at traditional blog-post length.  Here are a few of them.
Vertica offers a post on its 3.5 release, with a riff on the popular theme &#8220;We&#8217;ve fixed some weaknesses in our prior versions that we didn&#8217;t previously say we had.&#8221; More important, [...]]]></description>
			<content:encoded><![CDATA[<p>I have various subjects backed up that I don&#8217;t really want to write about at traditional blog-post length.  Here are a few of them.<span id="more-1348"></span></p>
<p><strong>Vertica</strong> offers a post on<a href="http://databasecolumn.vertica.com/database-innovation/vertica-3-5-flexstoretm-the-next-generation-of-column-stores/" onclick="javascript:pageTracker._trackPageview('/databasecolumn.vertica.com');"> its 3.5 release</a>, with a riff on the popular theme &#8220;We&#8217;ve fixed some weaknesses in our prior versions that we didn&#8217;t previously say we had.&#8221; More important, Vertica is pretty clear on the virtues of its <a href="http://www.dbms2.com/2009/08/04/pax-analytica-row-and-column-stores-begin-to-come-together/" >hybrid columnar architecture</a>.</p>
<p>Speaking of which &#8212; <strong>Oracle is going true hybrid columnar</strong> as well. I don&#8217;t have details or timing, however.</p>
<p>Dave Kellogg of <strong>Mark Logic</strong> wrote in to amusedly point out <a href="http://www.oracle.com/technology/tech/xml/xmldb/Current/marklogicserver_4.1_v1.0.pdf" onclick="javascript:pageTracker._trackPageview('/www.oracle.com');" target="_blank"><span style="color: #0000ff;"> </span>Oracle&#8217;s anti-MarkLogic collateral.</a> The very first charge Oracle levies is that MarkLogic goes beyond the emerging XQuery standard to add additional functionality. Considering Oracle&#8217;s approach to SQL standards, I tend to share Dave&#8217;s amusement.</p>
<div><span style="font-family: Calibri,sans-serif; font-size: small;"> </span></div>
<p>Bill Conniff of <a href="http://www.xponentsoftware.com/" onclick="javascript:pageTracker._trackPageview('/www.xponentsoftware.com');">Xponent LLC</a> wrote in to tell of a vastly cheaper and less functional approach to <strong>XML management,</strong> apparently geared to looking at very large XML files one at a time.</p>
<p><strong>Cayuga</strong> is a Cornell research project in complex event processing (CEP). There&#8217;s a <a href="http://www.cs.cornell.edu/bigreddata/cayuga/" onclick="javascript:pageTracker._trackPageview('/www.cs.cornell.edu');">Cayuga academic home page</a>, a Sourceforge page for some <a href="http://sourceforge.net/projects/cayuga/" onclick="javascript:pageTracker._trackPageview('/sourceforge.net');">open source Cayuga CEP code</a>, and so on. Minsheng Hong, writing from a Vertica email address, tipped me off some months ago. The basic idea seems to be to do <em>lots</em> of queries very quickly, rather than a smaller number of queries over and over again. Whether this is an advance in anything but open-sourceness over Apama or Aleri I couldn&#8217;t say, but I do think it&#8217;s a different focus than that of StreamBase or pre-Aleri Coral8.</p>
<p>And finally, editor Doug Henschen listed his <a href="http://intelligent-enterprise.informationweek.com/blog/archives/2009/12/intelligent_ent_2.html;jsessionid=0YRB5UUISPBXLQE1GHRSKH4ATMY32JVN" onclick="javascript:pageTracker._trackPageview('/intelligent-enterprise.informationweek.com');">15 favorite <em>Intelligent Enterprise</em> blog posts of 2009</a> &#8212; four each by Seth Grimes and Doug himself, three by Cindi Howson, two by me,* and one each by Mark Smith and Neil Raden.</p>
<p><em>*Doug selects up to three posts a month from here to republish.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/29/this-and-that/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Technical introduction to Splunk</title>
		<link>http://www.dbms2.com/2009/10/18/technical-introduction-to-splunk/</link>
		<comments>http://www.dbms2.com/2009/10/18/technical-introduction-to-splunk/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 16:01:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1124</guid>
		<description><![CDATA[As noted in my other introductory post, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it&#8217;s probably OK to assume they&#8217;re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">As noted in <a href="http://www.dbms2.com/2009/10/18/general-introduction-to-splunk/" >my other introductory post</a>, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it&#8217;s probably OK to assume they&#8217;re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used for general schema-free analytics, but that&#8217;s in early days at best.</p>
<p style="margin-bottom: 0in;">Splunk&#8217;s core technology indexes text and XML files or streams, especially log files. Technical highlights of that part include:<span id="more-1124"></span></p>
<ul>
<li>Splunk software both reads logs 	and indexes them. The same code runs both on the nodes that do the 	indexing and on machines that simply emit logs. However, in the 	latter case indexing is turned off. Thus, Splunk does not portray 	its software as “agentless.” However, it asserts that its 	agent-like software runs without “material” overhead.</li>
<li>The fundamental thing that Splunk 	looks at is an increment to a log – i.e., whatever has been added 	to the log since Splunk last looked at it.</li>
<li>Splunk tries to figure out what 	the individual entries are in a section of log it looks at.  In 	particular:
<ul>
<li>Time stamps are a big clue in this 	“inferencing” process, but they are not the be-all and end-all.</li>
<li>Nor are line boundaries, if logs 	are naturally broken up into lines. (Splunk threw that latter 	comment in as a shot at SenSage.)</li>
</ul>
</li>
<li>I get the impression that most 	Splunk entity extraction is done at search time, not at indexing 	time. Splunk says that, if a &lt;name, value&gt; pair is clearly 	marked, its software does a good job of recognizing same. Beyond 	that, fields seem to be specified by users when they define 	searches.</li>
<li>Splunk has a simple ILM 	(Information Lifecycle management) story based on time. I didn&#8217;t 	probe for details.</li>
</ul>
<p style="margin-bottom: 0in;">Given its text search engine, Splunk does – well, it does text searches. And it stores searches, so they can be used for alerting or reporting. Indeed, Splunk persists and presumably updates results to stored searches, in a rough analog to materialized views.</p>
<p style="margin-bottom: 0in;">Apparently, Splunk&#8217;s indexing is typically done via MapReduce jobs. I don&#8217;t know whether any actual Splunk searches are also done via MapReduce; surely they aren&#8217;t all, given the discussion of a near-real-time alerting engine and so on. Splunk fondly believes its MapReduce is an order of magnitude faster than SQL (I didn&#8217;t ask which SQL engines Splunk has in mind when they say this), and 5-10X faster than Hadoop. One efficiency trick is to look ahead and do Reduces in place where possible. This seems to be done automatically in the execution plan, ala Aster&#8217;s SQL-MapReduce, rather than having to be hand-coded. Splunk says its software can “easily” index 1-200 gigabytes of data per day on a commodity 8-core server, while maintaining an active search load, and 3-400 gigabytes are doable.</p>
<p style="margin-bottom: 0in;">Splunk&#8217;s capabilities right now in tabular-style analytics seem to be limited to a command-line report builder, plus a GUI wizard that generates the command line. A few users have asked for support of third-party business intelligence tools, but Splunk hasn&#8217;t provided that yet. Nor can I find much evidence of ODBC/JDBC drivers for Splunk. But then, I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.</p>
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/18/technical-introduction-to-splunk/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>How 30+ enterprises are using Hadoop</title>
		<link>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/</link>
		<comments>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/#comments</comments>
		<pubDate>Sat, 10 Oct 2009 10:19:29 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1073</guid>
		<description><![CDATA[MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera&#8217;s files. Facts and metrics ranged widely, of course:

Some are in heavy production with 	Hadoop, and closely engaged with Cloudera. [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of <a href="http://www.dbms2.com/2009/10/01/mapreduce-tidbits/" >Hadoop World</a>, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera&#8217;s files. Facts and metrics ranged widely, of course:</p>
<ul>
<li>Some are in heavy production with 	Hadoop, and closely engaged with Cloudera. Others are active Hadoop 	users but are very secretive. Yet others signed up for initial 	Hadoop training last week.</li>
<li>Some have Hadoop clusters in the 	thousands of nodes. Many have Hadoop clusters in the 50-100 node 	range. Others are just prototyping Hadoop use. And one seems to be 	&#8220;OEMing&#8221; a small Hadoop cluster in each piece of equipment 	sold.</li>
<li>Many export data from Hadoop to a 	relational DBMS; many others just leave it in HDFS (Hadoop 	Distributed File System), e.g. with <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >Hive</a> as the query 	language, or in exactly one case Jaql.</li>
<li>Some are household names, in web 	businesses or otherwise. Others seem to be pretty obscure.</li>
<li>Industries include financial 	services, telecom (Asia only, and quite new), bioinformatics (and 	other research), intelligence, and lots of web and/or 	advertising/media.</li>
<li>Application areas mentioned &#8212; and 	these overlap in some cases &#8212; include:
<ul>
<li>Log and/or clickstream analysis of 	various kinds</li>
<li>Marketing analytics</li>
<li>Machine learning and/or 	sophisticated data mining</li>
<li>Image processing</li>
<li>Processing of XML messages</li>
<li>Web crawling and/or text 	processing</li>
<li>General archiving, including of 	relational/tabular data, e.g. for compliance</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-1073"></span>We went over this list so quickly that we didn&#8217;t go into much detail on any one user. But one example that stood out was of an ad serving firm that had an &#8220;aggregation pipeline&#8221; consisting of 70-80 MapReduce jobs.</p>
<p style="margin-bottom: 0in;">I also talked yesterday again w/ Omer Trajman of Vertica, who surprised me by indicating a high single-digit number of Vertica&#8217;s customers were in production with Hadoop &#8212; i.e., over 10% of Vertica&#8217;s production customers.  (Vertica recently made its 100th sale, and of course not all those buyers are in production yet.) <a href="http://www.dbms2.com/2009/08/04/verticas-version-of-mapreduce-integration/" >Vertica/Hadoop</a> usage seems to have started in Vertica&#8217;s financial services stronghold &#8212; specifically in financial trading &#8212; with web analytics and the like coming on afterwards. Based on current prototyping efforts, Omer expects bioinformatics to be the third production market for Vertica/Hadoop, with telecommunications coming in fourth.</p>
<p style="margin-bottom: 0in;">Unsurprisingly, the general Vertica/Hadoop usage model seems to be:</p>
<ul>
<li>Do something to the data in Hadoop</li>
<li>Dump it into Vertica to be queried</li>
</ul>
<p style="margin-bottom: 0in;">What I did find surprising is that the data often isn&#8217;t reduced by this analysis, but rather exploded in size.  E.g., a complete store of mortgage trading data might be a few terabytes in size, but Hadoop-based post processing can increase that by 1 or 2 orders of magnitude. (Analogies to the importance and magnitude of <em>&#8220;cooked&#8221; data</em> in scientific data processing come to mind.)</p>
<p style="margin-bottom: 0in;">And finally, I talked to Aster a few days ago about the usage of its nCluster/Hadoop connector. Aster characterized Aster/Hadoop users&#8217; Hadoop usage as being of the batch/ETL variety, which is the classic use case one concedes to Hadoop even if one believes that MapReduce should commonly be done right in the DBMS.</p>
<p style="margin-bottom: 0in;"><em><strong>Related link</strong></em></p>
<ul>
<li><a href="../2008/08/26/known-applications-of-mapreduce/">An 	August, 2008 round-up of MapReduce applications</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>HadoopDB</title>
		<link>http://www.dbms2.com/2009/09/13/hadoopdb/</link>
		<comments>http://www.dbms2.com/2009/09/13/hadoopdb/#comments</comments>
		<pubDate>Sun, 13 Sep 2009 04:59:39 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=890</guid>
		<description><![CDATA[Despite a thoughtful heads-up from Daniel Abadi at the time of his original posting about HadoopDB, I&#8217;m just getting around to writing about it now.  HadoopDB is a research project carried out by a couple of Abadi&#8217;s students.  Further research is definitely planned. But it seems too early to say that HadoopDB will [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Despite a thoughtful heads-up from Daniel Abadi at the time of <a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html" onclick="javascript:pageTracker._trackPageview('/dbmsmusings.blogspot.com');">his original posting about HadoopDB</a>, I&#8217;m just getting around to writing about it now.  HadoopDB is a research project carried out by a couple of Abadi&#8217;s students.  Further research is definitely planned. But it seems too early to say that HadoopDB will ever get past the &#8220;research and oh by the way the code is open sourced&#8221; stage and become a real code line &#8212; whether commercialized, open source, or both.</p>
<p style="margin-bottom: 0in;">The basic idea of HadoopDB is to put copies of a DBMS at different nodes of a grid, and use Hadoop to parcel work among them. Major benefits when compared with massively parallel DBMS are said to be:</p>
<ul>
<li>Open/cheap/free</li>
<li><a href="http://www.dbms2.com/2009/09/13/fault-tolerant-queries/" >Query fault-tolerance</a></li>
<li><span style="font-style: normal;">The 	related concept of tolerating node degradation that isn&#8217;t an 	outright node failure.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">HadoopDB has actually been built with PostgreSQL. That version achieved performance well below that of a commercial DBMS &#8220;DBX&#8221;, where X=2. Column-store guru Abadi has repeatedly signaled his intention to try out HadoopDB with </span><a href="http://www.dbms2.com/2009/08/04/vectorwise-ingres-and-monetdb/" >VectorWise</a><span style="font-style: normal;"> at the nodes instead.  (Recall that VectorWise is shared-everything.) It will be interesting to see how that configuration performs.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">The real opportunity for HadoopDB, however, in my opinion may lie elsewhere.<span id="more-890"></span> Rather than trying to compete with parallel relational DBMS, HadoopDB might do more good parallelizing more specialized kinds of database engines. How about, for example, a massively parallel XML manager to compete with MarkLogic? Or a massively parallel array processor other than the still-nascent </span><a href="http://www.dbms2.com/2009/09/12/xldb-scid/" >SciDB</a>? <span style="font-style: normal;">Or, even more to the point, something that parallelizes a yet-more-specialized scientific data management engine? That kind of area is where I suspect the potential for HadoopDB really lives.</span></p>
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/09/13/hadoopdb/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>IBM&#8217;s Oracle emulation strategy reconsidered</title>
		<link>http://www.dbms2.com/2009/04/24/ibms-oracle-emulation-strategy-reconsidered/</link>
		<comments>http://www.dbms2.com/2009/04/24/ibms-oracle-emulation-strategy-reconsidered/#comments</comments>
		<pubDate>Sat, 25 Apr 2009 02:10:58 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data types]]></category>
		<category><![CDATA[Emulation, transparency, portability]]></category>
		<category><![CDATA[EnterpriseDB and Postgres Plus]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=764</guid>
		<description><![CDATA[I&#8217;ve now had a chance to talk with IBM about its recently-announced Oracle emulation strategy for DB2. (This is for DB2 9.7, which I gather has been quasi-announced in April, will be re-announced in May, and will be re-re-announced as being in general availability in June.)
Key points include:

This really is more like Oracle 	emulation than [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I&#8217;ve now had a chance to talk with IBM about its recently-announced Oracle emulation strategy for DB2. (This is for DB2 9.7, which I gather has been quasi-announced in April, will be re-announced in May, and will be re-re-announced as being in general availability in June.)</p>
<p style="margin-bottom: 0in;">Key points include:</p>
<ul>
<li>This really is more like Oracle 	<em><strong>emulation</strong></em> than it is <em>transparency,</em> a term I 	<a href="../2009/04/22/dbms-transparency-layers-never-seem-to-sell-well/">carelessly 	used</a> before.</li>
<li>IBM&#8217;s Oracle emulation effort is 	focused on two technological goals:
<ul>
<li>Making it easy for <strong>an Oracle 	application to be ported</strong> to DB2.</li>
<li>Making it easy for <strong>an Oracle 	developer to develop</strong> for DB2.</li>
</ul>
</li>
<li>The initial target market for 	DB2&#8217;s Oracle emulation is <strong>ISVs</strong> (Independent Software Vendors) 	much more than it is enterprises. IBM suggested there were a couple 	hundred early adopters, and those are primarily in the ISV area.</li>
</ul>
<p style="margin-bottom: 0in;">Because of Oracle&#8217;s market share, many ISVs focus on Oracle as the underlying database management system for their applications, whether or not they actually resell it along with their own software.  IBM proposed three reasons why such ISVs might want to support DB2:<span id="more-764"></span></p>
<ul>
<li><strong>Oracle is expensive.</strong> In 	particular, IBM suggested it is more flexible on licensing terms for 	resale than Oracle is.  I find that easy to believe.</li>
<li>Hey, there&#8217;s a <strong>DB2 market or 	installed base</strong> out there of some size &#8212; why not address it?</li>
<li>Acquisition-fueled expansion in 	applications<strong> makes Oracle a much bigger competitor to many ISVs </strong>(all around the world) than it used to be before.  That one makes 	all kinds of sense.</li>
</ul>
<p style="margin-bottom: 0in;">And by the way &#8212; if I wanted an Oracle-emulating DBMS, I&#8217;d feel a lot happier about doing business with IBM than I would with EnterpriseDB.</p>
<p style="margin-bottom: 0in;">IBM feels that DB2&#8217;s Oracle compatibility is a strict superset of <a href="../2008/07/07/enterprisedbf-oracle-compatibility/">EnterpriseDB&#8217;s</a>, which it presumably has carried over more or less in its entirety.  I didn&#8217;t press too hard for examples of what Oracle emulation DB2 offers and EnterpriseDB doesn&#8217;t, but IBM did say something about support for more programming languages.  IBM was clear on one broad area where DB2 does not offer Oracle emulation, which is the specifics of various kinds of datatype support or other specialized data access methods.  For example, IBM has its own syntax for querying text, geospatial, or XML data, and has not added support for Oracle&#8217;s alternative approaches.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/04/24/ibms-oracle-emulation-strategy-reconsidered/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Schema flexibility and XML data management</title>
		<link>http://www.dbms2.com/2008/10/05/schema-flexibility-and-xml-data-management/</link>
		<comments>http://www.dbms2.com/2008/10/05/schema-flexibility-and-xml-data-management/#comments</comments>
		<pubDate>Sun, 05 Oct 2008 12:53:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[pureXML]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=594</guid>
		<description><![CDATA[Conor O&#8217;Mahony, marketing manager for IBM&#8217;s DB2 pureXML talks a lot about one of my favorite hobbyhorses &#8212; schema flexibility &#8212; as a reason to use an XML data model.  In a number of industries he sees use cases based around ongoing change in the information being managed:

Tax authorities change their rules 	and forms [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;"><a href="http://nativexmldatabase.com/" onclick="javascript:pageTracker._trackPageview('/nativexmldatabase.com');">Conor O&#8217;Mahony</a>, marketing manager for <a href="http://www.dbms2.com/2008/10/05/overview-of-ibm-db2-purexml/" >IBM&#8217;s DB2 pureXML</a> talks a lot about one of my favorite hobbyhorses &#8212; <strong>schema flexibility</strong> &#8212; as a reason to use an XML data model.  In a number of industries he sees use cases based around ongoing change in the information being managed:</p>
<ul>
<li>Tax authorities change their rules 	and forms every year, but don&#8217;t want to do total rewrites of their 	electronic submission and processing software.</li>
<li>The financial services industry 	keeps inventing new products, which don&#8217;t just have different terms 	and conditions, but may also have different <em>kinds</em> of terms 	and conditions.</li>
<li>The same, to some extent, goes for 	the travel industry, which also keeps adding different and different 	kinds of destinations.</li>
<li>The energy industry keeps adding 	new kinds of highly complex equipment it has to manage.</li>
</ul>
<p style="margin-bottom: 0in;">Conor also thinks market evidence shows that XML&#8217;s schema flexibility is important for data interchange.<span id="more-594"></span> For example, hospitals (especially in the US) have disparate medical records and billing systems, which can make information interchange a chore.</p>
<p style="margin-bottom: 0in;">The second suggestion is probably the less controversial of the two.  After all, everybody knows that data is very commonly exchanged in XML formats.  So if it gets persisted in XML format somewhere along the way, even relational purists shouldn&#8217;t much mind, as long as it eventually gets into what they regard as a more properly structured database.  (Besides &#8212; if the data is going on long, challenging, multi-stage journeys, then nobody should much blame it if it indeed wants to stop along the way somewhere and rest. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  )</p>
<p style="margin-bottom: 0in;">In the first group of examples, there&#8217;s usually also a kind of cooperation between native XML and other kinds of database managers.  Before those users had access to XML, they were getting by just fine using other database technology.  So XML can be used in conjunction with other systems, not as complete replacement.  Even so, it&#8217;s reasonable to consider scenarios in which <strong>XML is the primary data model of record, and relational/tabular copies of the information are secondary.</strong></p>
<p style="margin-bottom: 0in;">For example, an income tax authority wants to store your tax form in its entirety, so that they can check both your truthfulness and your arithmetic.  This is most naturally done in XML, although for many years it&#8217;s been done in relational or pre-relational technologies. They also want to aggregate a limited amount of information from each taxpayer&#8217;s form for all sorts of aggregation and administrative purposes; that&#8217;s best done in a relational database. But the part that belongs in XML is the most fundamental.</p>
<p style="margin-bottom: 0in;">As another example, the core information of a derivatives transaction is:</p>
<ul>
<li>The derivatives contract 	(naturally stored in XML)</li>
<li>The actual purchase/sale 	information (traditionally stored in relational systems)</li>
<li>Account balances of various kinds 	altered by the transaction (a classic case where relational 	databases guarantee much-needed data integrity)</li>
</ul>
<p style="margin-bottom: 0in;">Here the majority of the basic record fits best in XML.  The minority that fits best in a relational system is small enough that a good XML DBMS can probably handle it as well.  Neither the superior OLTP performance nor data integrity safeguards of a relational DBMS are needed for the purchase/sale information.  They <em>are</em> needed for the general account management – but again, that&#8217;s a relatively secondary or (no pun intended!) derivative part of the overall database.</p>
<p style="margin-bottom: 0in;">So what we&#8217;re coming up with here is a strategy along the lines of:</p>
<ol>
<li><strong>Use XML for your system of 	record.</strong></li>
<li><strong>Spawn transactions in your 	relational/tabular data stores right away.</strong></li>
</ol>
<p style="margin-bottom: 0in;">And by the way, while I haven&#8217;t dwelled on this – those relational/tabular data stores could be data warehouses instead of or in addition to transactional systems.</p>
<p style="margin-bottom: 0in;">Obviously, there are two major classes of objections to this strategy (when it is contrasted with a traditional relational approach):</p>
<ul>
<li>Assertions that the extra 	programming effort needed to assure data integrity are so important 	as to outweigh all other consideration.</li>
<li>Assertions that the need for 	schema flexibility isn&#8217;t really that high, or at least wouldn&#8217;t be 	if the enterprises&#8217; database designers were sufficiently competent.</li>
</ul>
<p style="margin-bottom: 0in;">Well, we&#8217;ll see.  So far the customer uptake for the native XML approach is small but non-zero. And thus the issue is far from being decided.</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2008/10/05/schema-flexibility-and-xml-data-management/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Vertical market XML standards</title>
		<link>http://www.dbms2.com/2008/10/05/vertical-market-xml-standards/</link>
		<comments>http://www.dbms2.com/2008/10/05/vertical-market-xml-standards/#comments</comments>
		<pubDate>Sun, 05 Oct 2008 12:43:31 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Application areas]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[pureXML]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=593</guid>
		<description><![CDATA[Tracking the alphabet soup of vertical market XML standards is hard.  So as a starting point, I&#8217;m splitting a list I got from IBM into a standalone post.
Among the most important or successful IBM pureXML-supported standards, in terms of downloads and other evidence of customer interest, are:

UNIFI (UNIversal Financial Industry message scheme). According to [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Tracking the alphabet soup of vertical market XML standards is hard.  So as a starting point, I&#8217;m splitting a list I got from IBM into a standalone post.</p>
<p style="margin-bottom: 0in;">Among the most important or successful <a href="http://www.dbms2.com/2008/10/05/overview-of-ibm-db2-purexml/" >IBM pureXML</a>-<a href="http://www.alphaworks.ibm.com/tech/purexml" onclick="javascript:pageTracker._trackPageview('/www.alphaworks.ibm.com');">supported</a> standards, in terms of downloads and other evidence of customer interest, are:<span id="more-593"></span></p>
<ul>
<li><em>UNIFI (UNIversal Financial Industry message scheme). </em><span style="font-style: normal;">According to ISO 20022, it&#8217;s a 	standard that any electronic fund transfer must be accompanied by a 	UNIFI document. This standard seems important at least in SEPA (the 	Single European Payment Area).  UNIFI seems to be a major pureXML 	use case.</span></li>
<li><em>FpML (Financial Products Markup Language)</em>.  FpML is 	used in the derivatives market, to actually create contracts and 	hence derivative securities.</li>
<li><em>ACORD (Association for Cooperative Operations, Research 	and Development).</em> When one discusses XML industry standards, 	ACORD is usually one of the first to come up.  It is used in the 	life insurance agency, also for contract definition/creation. The 	top use case is agents, although speculate it&#8217;s used among 	investment managers and reinsurers as well.</li>
<li><em>STAR (Standards for Technology in Automotive Retail).</em> STAR is used by car manufacturers and retailers to exchange 	information – about cars, no doubt, and probably some other things 	as well.</li>
<li><em>HR-XML,</em> which is used by human resource departments 	for resumes and other kinds of employee information.</li>
<li><em>OTA (Open Travel Alliance). </em><span style="font-style: normal;">This 	is an XML schema used by usual-suspect players in the travel 	business – hotel chains, car renters, travel agents, and probably 	anybody else who deals in travel reservations.</span></li>
</ul>
<p>I&#8217;ve heard at least about the financial services ones from other 	XML database vendors as well.  I&#8217;m a little surprised that nothing 	from the health area made the top of the list.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2008/10/05/vertical-market-xml-standards/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Overview of IBM DB2 pureXML</title>
		<link>http://www.dbms2.com/2008/10/05/overview-of-ibm-db2-purexml/</link>
		<comments>http://www.dbms2.com/2008/10/05/overview-of-ibm-db2-purexml/#comments</comments>
		<pubDate>Sun, 05 Oct 2008 12:34:51 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[pureXML]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=592</guid>
		<description><![CDATA[On August 29, I had a great call with IBM about DB2 pureXML (most of the IBM side of the talking was done by Conor O&#8217;Mahony and Qi Jin).  I&#8217;m finally getting around to writing it up now. (The world of tabular data warehousing has kept me just a wee bit busy &#8230;)
As I [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">On August 29, I had a great call with IBM about DB2 pureXML (most of the IBM side of the talking was done by Conor O&#8217;Mahony and Qi Jin).  I&#8217;m <em>finally</em> getting around to writing it up now. (The world of tabular data warehousing has kept me just a wee bit busy &#8230;)</p>
<p style="margin-bottom: 0in;">As I write it, I see there are a considerable number of holes, but that&#8217;s the way it seems to go when researching XML storage.  I&#8217;m also writing up a September call from which I finally figured out (I think) the essence of <a href="http://www.dbms2.com/2008/10/05/marklogic-architecture-deep-dive/" >how MarkLogic Server works</a> – but only after five months of trying. It turns out that MarkLogic works rather differently from DB2 pureXML.  Not coincidentally, IBM and Mark Logic focus on rather different use cases for native XML storage.</p>
<p style="margin-bottom: 0in;">What I understand so far about the basic DB2 pureXML architecture goes like this:<span id="more-592"></span></p>
<ul>
<li><strong>DB2 pureXML stores XML in “true 	hierarchical format.” </strong> Based on all the discussion of 	indexing, I&#8217;m guessing that the way it does this is somewhat similar 	to that in <em>MarkLogic</em>.</li>
<li>Unlike MarkLogic, DB2 pureXML 	gives you the choice of what tags to index on.</li>
<li>In a big difference from 	Marklogic, <strong>text search on DB2 pureXML involves two separate 	indexes</strong> – XML and text (the latter being of the usual 	inverted-list variety). You can text-search both contents and tags, 	with the usual CONTAINS semantics.</li>
<li>PureXML has <strong>a data store 	separate from the rest of DB2&#8217;s,</strong> notwithstanding IBM&#8217;s 	references to XML “columns.”  DB2&#8217;s general datatype 	extensibility framework is not used; I don&#8217;t wholly understand why.</li>
<li>I neglected to ask how well DB2 	backup, management tools, and so on extend to DB2 pureXML.</li>
<li>You can talk to DB2 pureXML via 	two data manipulation languages: <strong>SQL/XML, </strong><span>and </span><strong>XQuery.</strong> Both are compiled down to the same run-time 	instructions.  IBM said there&#8217;s an abstraction layer sitting over 	both the relational store and the XML store that allows for this  I 	don&#8217;t totally understand what that means, since presumably the 	SQL/XML starts out by being sent to DB2&#8217;s parser.</li>
</ul>
<p style="margin-bottom: 0in;">A big part of IBM&#8217;s XML business strategy is to support various (typically vertical market) XML standards. IBM has implemented support for these standards and made it freely downloadable. What does “support” mean? It surely starts with a DTD (Document Type Definition), and apparently also includes <a href="http://www.alphaworks.ibm.com/tech/purexml" onclick="javascript:pageTracker._trackPageview('/www.alphaworks.ibm.com');">mappings to generic web services interfaces</a>.  It turns out that there are a lot of them, so I&#8217;m listing some in a <a href="http://www.dbms2.com/2008/10/05/vertical-market-xml-standards/" >separate post</a>.</p>
<p style="margin-bottom: 0in;">More generally, it seems that the sales and uses for IBM pureXML are concentrated in two main (overlapping) cases:</p>
<ul>
<li><strong>When XML was going to be used 	anyway.</strong> One big example of this is the case of the 	standards-based industry data interchanges.   Another example is 	when pureXML, albeit disk-based, acts as a kind of quasi-cache or 	mini-MDM hub (Master Data Management) for WebSphere-based enterprise 	application integration (EAI).  IBM reports that DB2 pureXML has 	been sold as an intermediate EAI data store at least once each in 	banking, retailing, health care, and insurance.</li>
<li><strong>When schema flexibility is of 	great importance.</strong></li>
</ul>
<p style="margin-bottom: 0in;">Experience teaches me that schema flexibility is a subject that can attract considerable flames, in the general vein of “Omigod! The relational model is perfect because it&#8217;s mathematically proven to ensure referential integrity!!” So I&#8217;ll split out the main discussion of that into yet another <em>separate post,</em> and keep going.</p>
<p style="margin-bottom: 0in;">IBM actually breaks out the pureXML use cases into four main groups:</p>
<ol>
<li><span style="font-style: normal;"><strong>Transactional.</strong></span><em> </em> This comprises the transactional logging of information that 	just happens to be XML, such as in financial services.</li>
<li><span style="font-style: normal;"><strong>Forms-oriented.</strong></span> This comprises, for example, the tax authority use case.</li>
<li><span style="font-style: normal;"><strong>Service 	bus acceleration.</strong></span> That&#8217;s a fancy phrase to cover both 	the standards-based interchanges and the other EAI-related uses.</li>
<li><span style="font-style: normal;"><strong>Event-driven 	data warehousing. </strong></span> This one was kind of blurry to me.  	What I think it means is that if you have transactional data in XML, 	and you want to use it in near-real-time business intelligence, DB2 	pureXML can help you with that.</li>
</ol>
<p style="margin-bottom: 0in;">#1, 3, and 4 seem to fit into my “When XML was going to be used anyway” category.  Part of “Schema flexibility” matches #2; I&#8217;m not clear on where in IBM&#8217;s four buckets the rest of schema flexibility goes.</p>
<p style="margin-bottom: 0in;">Finally, I asked directly in what areas there were significant numbers of DB2 pureXML customers.  IBM offered two examples.  One was financial services in general &#8212; especially in North America, notwithstanding the importance of the UNIFI standard in Europe.  The other was health care data interchange outside the United States &#8212; especially in China, where regional and national centers are being established to more closely oversee local hospitals.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>IBM kindly gave me permission to 	make available the <a href="http://www.monash.com/uploads/DB2-pureXML-Aug-2008.pdf" onclick="javascript:pageTracker._trackPageview('/www.monash.com');">slide presentation</a> from our August 29 briefing.  	The last page has a large number of links to further IBM pureXML 	resources.</li>
<li>Conor O&#8217;Mahony has a <a href="http://nativexmldatabase.com/" onclick="javascript:pageTracker._trackPageview('/nativexmldatabase.com');">good blog</a>.</li>
<li>As noted above, I am putting up 	separate posts on <a href="http://www.dbms2.com/2008/10/05/vertical-market-xml-standards/" >standards-based data interchange</a> and <a href="http://www.dbms2.com/2008/10/05/schema-flexibility-and-xml-data-management/" >schema 	flexibility</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2008/10/05/overview-of-ibm-db2-purexml/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>
