<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS2 -- DataBase Management System Services &#187; Text</title>
	<atom:link href="http://www.dbms2.com/category/datatype/text-search/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Fri, 19 Mar 2010 15:49:58 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>A framework for thinking about data warehouse growth</title>
		<link>http://www.dbms2.com/2009/12/07/data-warehouse-volume-growth/</link>
		<comments>http://www.dbms2.com/2009/12/07/data-warehouse-volume-growth/#comments</comments>
		<pubDate>Mon, 07 Dec 2009 13:50:47 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1278</guid>
		<description><![CDATA[There are only three ways that the amount of data stored in data warehouses can grow:

The same kinds of data are 	stored as before, with more being added over time.
The same kinds of data are stored 	as before, but in more detail.
New kinds of data are 	stored.

The first of those three ways doesn&#8217;t lead to [...]]]></description>
			<content:encoded><![CDATA[<p>There are only three ways that the amount of data stored in data warehouses can grow:</p>
<ul>
<li><strong>The same kinds of data are 	stored as before, </strong>with more being added over time.</li>
<li>The same kinds of data are stored 	as before, but in <strong>more detail.</strong></li>
<li><strong>New kinds</strong> of data are 	stored.</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-1278"></span>The first of those three ways doesn&#8217;t lead to dramatic growth. If a data warehouse goes up from 5 years of data to 6, then its overall size will grow a little over 20%.  (How little depends on what the underlying business growth is –  i.e., on how many more business events you have next year than you had 3 years ago.) That&#8217;s almost certainly going to be well-handled, by whatever technology manages your data warehouse today, given that:</p>
<ul>
<li>Chips are still subject to 	something resembling Moore&#8217;s Law.</li>
<li>Disk capacity is still subject to 	Kryder&#8217;s Law, which is like Moore&#8217;s Law but with yet faster growth 	rates.</li>
<li>DBMS software gets more performant 	over time.</li>
</ul>
<p style="margin-bottom: 0in;">So <strong>the cost of managing your same-as-before data will go down every year,</strong> even as the volume of that data grows.</p>
<p style="margin-bottom: 0in;">True, <a href="../2005/11/13/breaking-the-disk-speed-barrier/">disk rotation speeds have only increased 12.5 times since the Eisenhower Administration</a>. But <a href="../2009/10/25/teradata-hardware-strategy-and-tactics/">solid-state drives (SSDs) are getting practical for data warehousing</a> fast, so even that bottleneck eventually will get swept away. And since what we&#8217;re discussing is, basically, the first and hence presumably highest-value data to be warehoused, it&#8217;s apt to wind up on SSDs before some other kinds of data warrant that treatment.  So it&#8217;s the two other factors that drive the greatest data warehouse growth.</p>
<p style="margin-bottom: 0in;">As costs go down, the wisdom of keeping <strong>detailed data</strong> goes up. I&#8217;d go so far as to say that <strong>every piece of data generated by a human being should be preserved and kept online,</strong> legal and privacy considerations permitting.* Most forms of capital-, labor-, and/or location-based competitive advantage being commoditized and/or globalized away. But information remains a unique corporate asset.  Don&#8217;t discard it lightly.</p>
<p style="margin-bottom: 0in;"><em>*Unless there&#8217;s an explicit law mandating data destruction, legal considerations </em>should <em>permit. The idea “Let&#8217;s destroy something of irreplaceable value today, against the possibility we might be brought to judgment tomorrow” is both morally and pragmatically weird. Privacy, however, may be a different matter.</em></p>
<p style="margin-bottom: 0in; font-style: normal;">What that means in practice is that “disk is the new tape.” No-apologies performance can be had on data warehouse systems for <a href="http://www.dbms2.com/2009/07/30/the-netezza-price-point/" >$20,000/terabyte</a> or less – perhaps even <a href="http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/" >a lot less</a>. Tolerable performance may cost 3-4X less than that. I think a lot of the growth in data warehouse volumes is of exactly this kind.</p>
<p style="margin-bottom: 0in; font-style: normal;">Ultimately, however, the greatest growth in data warehouse volumes will come from <strong>new kinds of data,</strong> especially data that is partly or wholly <strong>machine-generated.</strong><span> Moore&#8217;s Law applied to sensor chips tells us that data creation will grow just as fast as the data storage capacity. And thus </span><strong>we will be throwing away most machine-generated data forever.</strong><span> But what we keep will grow – well, it probably will grow at Moore&#8217;s/Kryder&#8217;s Law speeds.</span></p>
<p style="margin-bottom: 0in; font-style: normal;"><span>That&#8217;s not to say new kinds of data are all high-volume/machine-generated. Back in 2005, I wrote<a href="http://www.computerworld.com/s/article/103054/More_Data_Makes_Your_Business_Grow?taxonomyId=9&amp;pageNumber=2" onclick="javascript:pageTracker._trackPageview('/www.computerworld.com');"> </a></span><span><a href="http://www.computerworld.com/s/article/103054/More_Data_Makes_Your_Business_Grow?taxonomyId=9&amp;pageNumber=2" onclick="javascript:pageTracker._trackPageview('/www.computerworld.com');">two</a> <a href="http://blogs.computerworld.com/node/512" onclick="javascript:pageTracker._trackPageview('/blogs.computerworld.com');">pieces</a></span><span> for </span><em><span>Computerworld</span></em><span> advocating aggressive pursuit of new data sources, and the examples I mentioned were:</span></p>
<ul>
<li><span>Loyalty cards, especially 	in gaming</span></li>
<li>Location-based analytics</li>
<li>Extra customer feedback (e.g., 	opinion surveys)</li>
<li>Price/offer testing</li>
<li>Text mining 	in general</li>
<li>Medical 	records</li>
</ul>
<p style="margin-bottom: 0in;">Today I&#8217;d add (among others):</p>
<ul>
<li>RFID</li>
<li>The raw 	output from medical test devices</li>
<li>Sensors up and down the energy supply chain</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">But some of those older, low-data-volume ideas still head my list of low-hanging analytic fruit.</p>
<p style="margin-bottom: 0in; font-style: normal;"><span>One more complication – these buckets I&#8217;m outlining are less than precise. For example:</span></p>
<ul>
<li><span>Telecom 	CDRs (Call Detail Records) are machine-generated from a seed of 	human activity. They have long been stored, but now are being kept 	in much more detail. This is why telecommunications is one of the 	top markets for data warehouse technology.</span></li>
<li><span>Stock 	trade data used to be based on human decisions. Now most of it is 	just machines buying and selling from each other. Either way, 	increasingly many investment institutions want to keep 	100-terabyte-scale databases of complete historical trade detail. 	And that is why financial services is another huge market for data 	warehouse technology.</span></li>
<li><span>Not 	long ago, web and network event logs. didn&#8217;t even exist, or were 	tiny where they did. Now they fill the largest known commercial 	databases, at firms such as </span><span><a href="http://www.dbms2.com/2009/10/01/yahoos-decapetabyte-data-warehousinghadoop/" >Yahoo</a>, 	<a href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/" >eBay</a>, and <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >Facebook</a>.</span><span> Even so, more is thrown away than kept, especially on the network 	event side, which is a multiple of the size of the pure clickstream 	data.</span></li>
<li><span>We 	don&#8217;t know exactly what all data intelligence agencies collect from 	telemetry, from monitoring commercial telecommunication traffic, and 	so on. But they&#8217;re surely throwing the vast majority away, even as 	the small part they keep is </span><span><a href="http://www.dbms2.com/2009/09/30/facts-and-rumors/" >petabyte-scale</a>.</span></li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">But none of that interferes with my main points, which are:</p>
<ul>
<li><strong>Databases 	will continue to grow very quickly.</strong></li>
<li>One big driver 	is <strong>the increasing detail in which data is kept online.</strong></li>
<li>An even bigger 	driver will be <strong>the unending ability of machines to generate ever 	greater streams of at-least-somewhat interesting data.</strong></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/07/data-warehouse-volume-growth/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Technical introduction to Splunk</title>
		<link>http://www.dbms2.com/2009/10/18/technical-introduction-to-splunk/</link>
		<comments>http://www.dbms2.com/2009/10/18/technical-introduction-to-splunk/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 16:01:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1124</guid>
		<description><![CDATA[As noted in my other introductory post, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it&#8217;s probably OK to assume they&#8217;re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">As noted in <a href="http://www.dbms2.com/2009/10/18/general-introduction-to-splunk/" >my other introductory post</a>, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it&#8217;s probably OK to assume they&#8217;re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used for general schema-free analytics, but that&#8217;s in early days at best.</p>
<p style="margin-bottom: 0in;">Splunk&#8217;s core technology indexes text and XML files or streams, especially log files. Technical highlights of that part include:<span id="more-1124"></span></p>
<ul>
<li>Splunk software both reads logs 	and indexes them. The same code runs both on the nodes that do the 	indexing and on machines that simply emit logs. However, in the 	latter case indexing is turned off. Thus, Splunk does not portray 	its software as “agentless.” However, it asserts that its 	agent-like software runs without “material” overhead.</li>
<li>The fundamental thing that Splunk 	looks at is an increment to a log – i.e., whatever has been added 	to the log since Splunk last looked at it.</li>
<li>Splunk tries to figure out what 	the individual entries are in a section of log it looks at.  In 	particular:
<ul>
<li>Time stamps are a big clue in this 	“inferencing” process, but they are not the be-all and end-all.</li>
<li>Nor are line boundaries, if logs 	are naturally broken up into lines. (Splunk threw that latter 	comment in as a shot at SenSage.)</li>
</ul>
</li>
<li>I get the impression that most 	Splunk entity extraction is done at search time, not at indexing 	time. Splunk says that, if a &lt;name, value&gt; pair is clearly 	marked, its software does a good job of recognizing same. Beyond 	that, fields seem to be specified by users when they define 	searches.</li>
<li>Splunk has a simple ILM 	(Information Lifecycle management) story based on time. I didn&#8217;t 	probe for details.</li>
</ul>
<p style="margin-bottom: 0in;">Given its text search engine, Splunk does – well, it does text searches. And it stores searches, so they can be used for alerting or reporting. Indeed, Splunk persists and presumably updates results to stored searches, in a rough analog to materialized views.</p>
<p style="margin-bottom: 0in;">Apparently, Splunk&#8217;s indexing is typically done via MapReduce jobs. I don&#8217;t know whether any actual Splunk searches are also done via MapReduce; surely they aren&#8217;t all, given the discussion of a near-real-time alerting engine and so on. Splunk fondly believes its MapReduce is an order of magnitude faster than SQL (I didn&#8217;t ask which SQL engines Splunk has in mind when they say this), and 5-10X faster than Hadoop. One efficiency trick is to look ahead and do Reduces in place where possible. This seems to be done automatically in the execution plan, ala Aster&#8217;s SQL-MapReduce, rather than having to be hand-coded. Splunk says its software can “easily” index 1-200 gigabytes of data per day on a commodity 8-core server, while maintaining an active search load, and 3-400 gigabytes are doable.</p>
<p style="margin-bottom: 0in;">Splunk&#8217;s capabilities right now in tabular-style analytics seem to be limited to a command-line report builder, plus a GUI wizard that generates the command line. A few users have asked for support of third-party business intelligence tools, but Splunk hasn&#8217;t provided that yet. Nor can I find much evidence of ODBC/JDBC drivers for Splunk. But then, I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.</p>
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/18/technical-introduction-to-splunk/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>General introduction to Splunk</title>
		<link>http://www.dbms2.com/2009/10/18/general-introduction-to-splunk/</link>
		<comments>http://www.dbms2.com/2009/10/18/general-introduction-to-splunk/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 15:59:56 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Fox and MySpace]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1119</guid>
		<description><![CDATA[I dropped by log analysis software vendor Splunk a few weeks ago for a chat with Marketing VP Steve Sommer (who some you may know from Cognos and/or Informix), Product Management VP Christina Noren, and above all co-founder/CTO Erik Swan. Splunk turns out to be a pretty interesting company, from both business and technical standpoints. [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I dropped by log analysis software vendor Splunk a few weeks ago for a chat with Marketing VP Steve Sommer (who some you may know from Cognos and/or Informix), Product Management VP Christina Noren, and above all co-founder/CTO Erik Swan. Splunk turns out to be a pretty interesting company, from both business and technical standpoints. For one thing, Splunk seems highly regarded by most people I mention it to.</p>
<p style="margin-bottom: 0in;">Splunk&#8217;s technical stories include:</p>
<ul>
<li>Text search over log files.</li>
<li>Business intelligence over text 	search. (That part sounds a lot like <a href="http://www.texttechnologies.com/2007/12/12/attivio-tries-to-do-it-all/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">Attivio</a>.)</li>
<li>MapReduce with schema flexibility 	and smart multi-stage execution plans. (That part sounds a lot like 	Aster Data.)</li>
</ul>
<p style="margin-bottom: 0in;">More on those in <a href="http://www.dbms2.com/2009/10/18/technical-introduction-to-splunk/" >a separate post</a>.</p>
<p style="margin-bottom: 0in;">Less technical Splunk highlights include:<span id="more-1119"></span></p>
<ul>
<li>Splunk has ~1200 paying customers, 	and is adding a couple hundred more per quarter.</li>
<li>Splunk has ~160 people.</li>
<li>~80% of Splunk sales are in North 	America.</li>
<li>Typical Splunk sales prices are in 	the $10-50K range, with an average around $25K, or maybe that 	average is a bit over $30K. Some Splunk deals are six- or even 	seven-figure.</li>
<li>Splunk is “quite profitable.”</li>
<li>Splunk&#8217;s eponymous product is 	priced according to how much data is indexed per day. If you index 	half a gigabyte of logs per day or less, Splunk is completely free. 	So, while Splunk is closed-source, there&#8217;s something of an 	open-source-like Splunk adoption model.</li>
<li>Splunk has been selling product 	for a couple of years. I gather Splunk 4 was recently released.</li>
<li>Splunk&#8217;s biggest industry segments 	are, not too surprisingly,
<ul>
<li>Telco</li>
<li>Financial services</li>
<li>Government</li>
<li>“Online”</li>
</ul>
</li>
<li>Splunk&#8217;s paying customers seem to 	use it mainly for:
<ul>
<li>Web logs and associated network 	event logs (this seems to be the biggest area)</li>
<li>Security and perhaps other general 	IT log analysis</li>
<li>Physical security logs (mainly in 	the government)</li>
<li>Anti-fraud (I&#8217;m not sure how that 	works)</li>
</ul>
</li>
<li>One would think Splunk would be 	used to manage a lot of intelligence telemetry, but that wasn&#8217;t 	particularly hinted at.</li>
<li>In general, the core problem 	Splunk is used for is log analysis for trouble-shooting purposes.</li>
<li>Splunk&#8217;s nonpaying users are more 	diverse; examples mentioned included windmill operations and protein 	research.</li>
<li>Splunk&#8217;s customers include Aster 	Data flagship accounts MySpace and LinkedIn. I bet many other top 	web companies are Splunk customers as well.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/18/general-introduction-to-splunk/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>How 30+ enterprises are using Hadoop</title>
		<link>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/</link>
		<comments>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/#comments</comments>
		<pubDate>Sat, 10 Oct 2009 10:19:29 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1073</guid>
		<description><![CDATA[MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera&#8217;s files. Facts and metrics ranged widely, of course:

Some are in heavy production with 	Hadoop, and closely engaged with Cloudera. [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of <a href="http://www.dbms2.com/2009/10/01/mapreduce-tidbits/" >Hadoop World</a>, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera&#8217;s files. Facts and metrics ranged widely, of course:</p>
<ul>
<li>Some are in heavy production with 	Hadoop, and closely engaged with Cloudera. Others are active Hadoop 	users but are very secretive. Yet others signed up for initial 	Hadoop training last week.</li>
<li>Some have Hadoop clusters in the 	thousands of nodes. Many have Hadoop clusters in the 50-100 node 	range. Others are just prototyping Hadoop use. And one seems to be 	&#8220;OEMing&#8221; a small Hadoop cluster in each piece of equipment 	sold.</li>
<li>Many export data from Hadoop to a 	relational DBMS; many others just leave it in HDFS (Hadoop 	Distributed File System), e.g. with <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >Hive</a> as the query 	language, or in exactly one case Jaql.</li>
<li>Some are household names, in web 	businesses or otherwise. Others seem to be pretty obscure.</li>
<li>Industries include financial 	services, telecom (Asia only, and quite new), bioinformatics (and 	other research), intelligence, and lots of web and/or 	advertising/media.</li>
<li>Application areas mentioned &#8212; and 	these overlap in some cases &#8212; include:
<ul>
<li>Log and/or clickstream analysis of 	various kinds</li>
<li>Marketing analytics</li>
<li>Machine learning and/or 	sophisticated data mining</li>
<li>Image processing</li>
<li>Processing of XML messages</li>
<li>Web crawling and/or text 	processing</li>
<li>General archiving, including of 	relational/tabular data, e.g. for compliance</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-1073"></span>We went over this list so quickly that we didn&#8217;t go into much detail on any one user. But one example that stood out was of an ad serving firm that had an &#8220;aggregation pipeline&#8221; consisting of 70-80 MapReduce jobs.</p>
<p style="margin-bottom: 0in;">I also talked yesterday again w/ Omer Trajman of Vertica, who surprised me by indicating a high single-digit number of Vertica&#8217;s customers were in production with Hadoop &#8212; i.e., over 10% of Vertica&#8217;s production customers.  (Vertica recently made its 100th sale, and of course not all those buyers are in production yet.) <a href="http://www.dbms2.com/2009/08/04/verticas-version-of-mapreduce-integration/" >Vertica/Hadoop</a> usage seems to have started in Vertica&#8217;s financial services stronghold &#8212; specifically in financial trading &#8212; with web analytics and the like coming on afterwards. Based on current prototyping efforts, Omer expects bioinformatics to be the third production market for Vertica/Hadoop, with telecommunications coming in fourth.</p>
<p style="margin-bottom: 0in;">Unsurprisingly, the general Vertica/Hadoop usage model seems to be:</p>
<ul>
<li>Do something to the data in Hadoop</li>
<li>Dump it into Vertica to be queried</li>
</ul>
<p style="margin-bottom: 0in;">What I did find surprising is that the data often isn&#8217;t reduced by this analysis, but rather exploded in size.  E.g., a complete store of mortgage trading data might be a few terabytes in size, but Hadoop-based post processing can increase that by 1 or 2 orders of magnitude. (Analogies to the importance and magnitude of <em>&#8220;cooked&#8221; data</em> in scientific data processing come to mind.)</p>
<p style="margin-bottom: 0in;">And finally, I talked to Aster a few days ago about the usage of its nCluster/Hadoop connector. Aster characterized Aster/Hadoop users&#8217; Hadoop usage as being of the batch/ETL variety, which is the classic use case one concedes to Hadoop even if one believes that MapReduce should commonly be done right in the DBMS.</p>
<p style="margin-bottom: 0in;"><em><strong>Related link</strong></em></p>
<ul>
<li><a href="../2008/08/26/known-applications-of-mapreduce/">An 	August, 2008 round-up of MapReduce applications</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>IBM&#8217;s Oracle emulation strategy reconsidered</title>
		<link>http://www.dbms2.com/2009/04/24/ibms-oracle-emulation-strategy-reconsidered/</link>
		<comments>http://www.dbms2.com/2009/04/24/ibms-oracle-emulation-strategy-reconsidered/#comments</comments>
		<pubDate>Sat, 25 Apr 2009 02:10:58 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data types]]></category>
		<category><![CDATA[Emulation, transparency, portability]]></category>
		<category><![CDATA[EnterpriseDB and Postgres Plus]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=764</guid>
		<description><![CDATA[I&#8217;ve now had a chance to talk with IBM about its recently-announced Oracle emulation strategy for DB2. (This is for DB2 9.7, which I gather has been quasi-announced in April, will be re-announced in May, and will be re-re-announced as being in general availability in June.)
Key points include:

This really is more like Oracle 	emulation than [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I&#8217;ve now had a chance to talk with IBM about its recently-announced Oracle emulation strategy for DB2. (This is for DB2 9.7, which I gather has been quasi-announced in April, will be re-announced in May, and will be re-re-announced as being in general availability in June.)</p>
<p style="margin-bottom: 0in;">Key points include:</p>
<ul>
<li>This really is more like Oracle 	<em><strong>emulation</strong></em> than it is <em>transparency,</em> a term I 	<a href="../2009/04/22/dbms-transparency-layers-never-seem-to-sell-well/">carelessly 	used</a> before.</li>
<li>IBM&#8217;s Oracle emulation effort is 	focused on two technological goals:
<ul>
<li>Making it easy for <strong>an Oracle 	application to be ported</strong> to DB2.</li>
<li>Making it easy for <strong>an Oracle 	developer to develop</strong> for DB2.</li>
</ul>
</li>
<li>The initial target market for 	DB2&#8217;s Oracle emulation is <strong>ISVs</strong> (Independent Software Vendors) 	much more than it is enterprises. IBM suggested there were a couple 	hundred early adopters, and those are primarily in the ISV area.</li>
</ul>
<p style="margin-bottom: 0in;">Because of Oracle&#8217;s market share, many ISVs focus on Oracle as the underlying database management system for their applications, whether or not they actually resell it along with their own software.  IBM proposed three reasons why such ISVs might want to support DB2:<span id="more-764"></span></p>
<ul>
<li><strong>Oracle is expensive.</strong> In 	particular, IBM suggested it is more flexible on licensing terms for 	resale than Oracle is.  I find that easy to believe.</li>
<li>Hey, there&#8217;s a <strong>DB2 market or 	installed base</strong> out there of some size &#8212; why not address it?</li>
<li>Acquisition-fueled expansion in 	applications<strong> makes Oracle a much bigger competitor to many ISVs </strong>(all around the world) than it used to be before.  That one makes 	all kinds of sense.</li>
</ul>
<p style="margin-bottom: 0in;">And by the way &#8212; if I wanted an Oracle-emulating DBMS, I&#8217;d feel a lot happier about doing business with IBM than I would with EnterpriseDB.</p>
<p style="margin-bottom: 0in;">IBM feels that DB2&#8217;s Oracle compatibility is a strict superset of <a href="../2008/07/07/enterprisedbf-oracle-compatibility/">EnterpriseDB&#8217;s</a>, which it presumably has carried over more or less in its entirety.  I didn&#8217;t press too hard for examples of what Oracle emulation DB2 offers and EnterpriseDB doesn&#8217;t, but IBM did say something about support for more programming languages.  IBM was clear on one broad area where DB2 does not offer Oracle emulation, which is the specifics of various kinds of datatype support or other specialized data access methods.  For example, IBM has its own syntax for querying text, geospatial, or XML data, and has not added support for Oracle&#8217;s alternative approaches.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/04/24/ibms-oracle-emulation-strategy-reconsidered/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>MarkLogic architecture deep dive</title>
		<link>http://www.dbms2.com/2008/10/05/marklogic-architecture-deep-dive/</link>
		<comments>http://www.dbms2.com/2008/10/05/marklogic-architecture-deep-dive/#comments</comments>
		<pubDate>Sun, 05 Oct 2008 11:19:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Mark Logic]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=591</guid>
		<description><![CDATA[While I previously posted in great detail about how MarkLogic Server is an ACID-compliant XML-oriented DBMS with integrated text search that indexes everything in real time and executes range queries fairly quickly, I didn&#8217;t have a good feel for how all those apparently contradictory characteristics fit into a single product. But I finally had a [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">While I previously posted in great detail about how MarkLogic Server is <a href="../2008/04/29/the-mark-logic-story-in-xml-database-management/">an ACID-compliant XML-oriented DBMS with integrated text search that indexes everything in real time and executes range queries fairly quickly</a>, I didn&#8217;t have a good feel for how all those apparently contradictory characteristics fit into a single product. But I finally had a call with Mark Logic Director of Engineering Ron Avnur, and think I have a better grasp of the MarkLogic architecture and story.</p>
<p style="margin-bottom: 0in;">Ron described MarkLogic Server as <strong>a DBMS for trees.</strong> <span id="more-591"></span>That is, MarkLogic is designed for an XML data model that&#8217;s all about nodes and relationships, but not necessarily for XML data <em>per se. </em><span style="font-style: normal;">The fundamental paradigm is thus searches for nodes and/or trees, for example:</span></p>
<ul>
<li>Find all the trees or nodes 	anywhere in the system that have certain text in them.</li>
<li>Find all the nodes anywhere in the 	system that have certain text in their children or immediate 	descendants.</li>
<li>Find all trees that have a pair of 	nodes, of a certain kind, in a certain relationship to each other.</li>
</ul>
<p style="margin-bottom: 0in;">Also important are aggregates and perhaps also &#8212; although Mark Logic rarely mentions them unless prompted &#8212; range queries.</p>
<p style="margin-bottom: 0in;">So let&#8217;s start with some basics about Mark Logic Server&#8217;s indexing and storage strategy:</p>
<ul>
<li>As in a 	conventional text indexing system, <strong>most MarkLogic indexes are 	just term lists.</strong> A term is a token, which usually is just a 	word.</li>
<li>MarkLogic can 	actually have <strong>two term lists for each token – one for the XML 	tags </strong><span>(i.e., attribute values), 	and </span><strong>one for the plain text content.</strong> (Most tokens, of 	course, never actually occur in a tag.)</li>
<li><strong>MarkLogic 	also indexes node names</strong> – i.e., attribute name, or type of 	node – <strong>and node rela</strong><span style="font-style: normal;"><strong>tionships.</strong></span></li>
<li> “<span style="font-style: normal;">Document” has its usual 	meaning in MarkLogic Server.  (This is worth saying because XQuery 	books sometimes assume that everything in an XML database is 	concatenated into a sin</span>gle cosmic document.)</li>
<li> Documents can be broken up into fragments, with metadata as to how 	they fit together.  The smallest unit MarkLogic retrieves is such a 	fragment (or of course an entire document).</li>
<li> <span style="font-style: normal;">Largely because of how tag 	information is tokenized and stored, </span><span style="font-style: normal;"><strong>MarkLogic 	doesn&#8217;t store documents as serialized XML.</strong></span><span style="font-style: normal;"> Mark Logic says this eliminates a great deal of XML&#8217;s traditional 	bloat.</span></li>
<li> <span style="font-style: normal;">If I understood correctly, Mark 	Logic said that a single document – perhaps one that&#8217;s 100K long 	as serialized XML – could be referenced by 1000s or 10s of 1000s 	of term lists.  Frankly, that looks a little high to me, so it&#8217;s 	possible Mark Logic was talking about using that number of term 	lists for a medium- or large-sized collection.  (Mark Logic says 	that MarkLogic has been used for databases up to the 100s of 	terabytes, and for civilian databases in the 10s of terabytes 	range.)</span></li>
<li> MarkLogic&#8217;s term list indexes, I&#8217;m pretty sure, store not only where 	a term occurs but also what position in a document (section) it 	appears in.</li>
<li> MarkLogic indexes also record the position of each node in a tree.</li>
<li><span style="font-style: normal;">Term 	lists are all ordered “the same way.”  Therefore, Mark Logic 	claims, </span><span style="font-style: normal;"><strong>search time 	scales linearly with the number of documents.</strong></span></li>
</ul>
<p style="margin-bottom: 0in;">In addition to those indexes, which comprise what Mark Logic calls the “Universal Index,” there are <strong>scalar indexes.</strong> These are <strong>columnar, </strong><span>where a column can cover either a single element name (i.e., attribute name) or a set of (presumably related) element names.  Two</span> copies of each column are kept, one sorted by TreeID and one by value.  Mark Logic believes these suffice to give good performance on aggregations and range lookups.</p>
<p style="margin-bottom: 0in;"><span>With all those columns and term lists, the question naturally arises:  What about</span> the MarkLogic update strategy? Highlights include:</p>
<ul>
<li>MarkLogic 	uses <strong>MVCC</strong> (MultiVersion Concurrency Control). Updates are 	never done in place; rather, a new version of a record is appended.</li>
<li><strong>Documents 	are divided into subsets, each with its own set of indexes.</strong> Each subset that is being actively updated is small. This helps with 	update performance.  It&#8217;s actually more important for the scalar 	columns than for the term lists.  You update a term list by adding 	values to the end, so length hardly matters.  But in a column index 	you insert new values into the middle, so length is indeed relevant.</li>
<li>When document 	subsets get to be a certain size, they&#8217;re merged into yet larger 	subsets, which no longer have new documents added to them.  This is 	when MVCC cruft is cleaned up as well. However, you can specify a 	minimum time that old versions are guaranteed to survive.</li>
<li>This 	architecture gives an obvious form of <strong>partitioning,</strong> letting 	you parallelize in an obvious way. This partitioning is random by 	default, although there&#8217;s an API that lets you programmatically make 	it nonrandom.</li>
</ul>
<p style="margin-bottom: 0in;"><span>Why not some kind of intelligent horizontal partitioning, whether range-based or otherwise?</span><strong> </strong><span>Well, we&#8217;ve finally gotten to a MarkLogic weakness. </span><strong> Join performance</strong> is not a MarkLogic long suit or Mark Logic priority.  Indeed, Mark Logic insists that XML data is inherently denormalized, with joins (complex or otherwise) therefore rarely arising.</p>
<p style="margin-bottom: 0in;">For many of today&#8217;s use cases, that&#8217;s probably true.  For example, when I heard this I quickly started thinking “What if a book publisher changes name? That information is in a <em>lot </em>of individual book records.” But the fact is, people want author/publisher information that&#8217;s accurate as of the time of release, not updated for subsequent publishing company mergers and the like.</p>
<p style="margin-bottom: 0in;">For other use cases, however, joins may be more important.  For example:</p>
<ul>
<li>Derivatives 	contracts need to survive changes in brokerage firm corporate 	control.  (IBM reports that <em>derivatives are an important XML use 	case today</em>.)</li>
<li>Medical 	records need to be connected with a patient&#8217;s most current contact 	information.  (IBM reports that XML is making inroads into health 	care applications.)</li>
<li>Consumer 	profiling done right would benefit greatly from XML&#8217;s <em>schema 	flexibility</em>, but would also require significant use of joins.</li>
</ul>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2008/10/05/marklogic-architecture-deep-dive/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Known applications of MapReduce</title>
		<link>http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/</link>
		<comments>http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/#comments</comments>
		<pubDate>Tue, 26 Aug 2008 04:54:17 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=500</guid>
		<description><![CDATA[Most of the actual MapReduce applications I&#8217;ve heard of fall into a few areas:

Text tokenization, indexing, and 	search
Creation of other kinds of data 	structures (e.g., graphs)
Data mining and machine learning

That covers all MapReduce apps I recall hearing about via commercial companies and users, and also includes most of what&#8217;s in the two big sources I [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Most of the actual MapReduce applications I&#8217;ve heard of fall into a few areas:</p>
<ul>
<li><strong>Text tokenization, indexing, and 	search</strong></li>
<li><strong>Creation of other kinds of data 	structures (e.g., graphs)</strong></li>
<li><strong>Data mining and machine learning</strong></li>
</ul>
<p style="margin-bottom: 0in;">That covers all MapReduce apps I recall hearing about via commercial companies and users, and also includes most of what&#8217;s in the two big sources I found online.  <span id="more-500"></span>To wit:</p>
<p style="margin-bottom: 0in;">1.  In a <a href="http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0005.html" onclick="javascript:pageTracker._trackPageview('/labs.google.com');">slide presentation</a>, Google offers the following applications of MapReduce:</p>
<ul>
<li>distributed grep</li>
<li>distributed sort</li>
<li>web link-graph reversal</li>
<li>term-vector per host</li>
<li>web access log stats</li>
<li>inverted index construction</li>
<li>document clustering</li>
<li>machine learning</li>
<li>statistical machine translation</li>
</ul>
<p style="margin-bottom: 0in;">2.  The <a href="http://wiki.apache.org/hadoop/PoweredBy" onclick="javascript:pageTracker._trackPageview('/wiki.apache.org');">Hadoop applications page</a> offers a rich trove of applications.  Excerpts include:</p>
<ul>
<li>Aggregate, store, and analyze data related to in-stream 	viewing behavior of Internet video audiences.</li>
<li>Analytics</li>
<li>Analyze and index textual information</li>
<li>Analyzing similarities of user&#8217;s behavior.</li>
<li>Build scalable machine learning algorithms like canopy 	clustering, k-means and many more to come (naive bayes classifiers, 	others)</li>
<li>Charts calculation and web log analysis</li>
<li>Crawl Blog posts and later process them.</li>
<li>Crawling, processing, serving and log analysis</li>
<li>Data mining and blog crawling</li>
<li>Facial similarity and recognition across large datasets.</li>
<li>Filter and index our listings, removing exact duplicates and 	grouping similar ones.</li>
<li>Filtering and indexing listing, processing log analysis, and 	for recommendation data.</li>
<li>Flexible web search engine software</li>
<li>Gathering world wide DNS data in order to discover content 	distribution networks and configuration issues</li>
<li>Generating web graphs</li>
<li>Image based video copyright protection.</li>
<li>Image content based advertising and auto-tagging for social 	media.</li>
<li>Image processing environment for image-based product 	recommendation system</li>
<li>Image retrieval engine</li>
<li>Large scale image conversions</li>
<li>Latent Semantic Analysis, Collaborative Filtering</li>
<li>Log analysis, data mining and machine learning</li>
<li>Natural Language Search</li>
<li>Open source social search tools.</li>
<li>Parses and indexes mail logs for search</li>
<li>Plot the entire internet</li>
<li>Process apache log, analyzing user&#8217;s action and click flow 	and the links click with any specified page in site and more.</li>
<li>Process clickstream and demographic data in order to create 	web analytic reports.</li>
<li>Process data relating to people on the web</li>
<li>Process documents from a continuous web crawl and distributed 	training of support vector machines</li>
<li>Process whole price data user input with map/reduce.</li>
<li>Produce statistics.</li>
<li>Product search indices</li>
<li>Recommender system for behavioral targeting, plus other 	clickstream analytics</li>
<li>Reduce usage data for internal metrics, for search indexing 	and for recommendation data.</li>
<li>Research for Ad Systems and Web Search</li>
<li>Retrieving and Analyzing Biomedical Knowledge</li>
<li>Run Naive Bayes classifiers in parallel over crawl data to 	discover event information</li>
<li>Search engine for chiropractic information, local 	chiropractors, products and schools</li>
<li>Serve large Lucene indexes</li>
<li>Session analysis and report generation</li>
<li>Source code search engine</li>
<li>Statistical analysis and modeling at scale.</li>
<li>Storage, log analysis, and pattern discovery/analysis.</li>
<li>Store copies of internal log and dimension data sources and 	use it as a source for reporting/analytics and machine learning.</li>
<li>Teaching and general research activities on natural language 	processing and machine learning.</li>
<li>Vertical search engine for trustworthy wine information</li>
</ul>
<p>There also were some research apps and some general processing speed-up apps I found harder to excerpt.</p>
<p><strong><em>Some of our recent links about MapReduce</em></strong></p>
<ul>
<li><a href="http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing/" >The integration of MapReduce with SQL data warehousing</a></li>
<li><a href="http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/" >Three major applications of MapReduce</a></li>
<li><a href="http://www.dbms2.com/2008/08/26/three-approaches-to-parallelizing-data-transformation/" >Another application of MapReduce</a></li>
<li><a href="http://www.dbms2.com/2008/08/25/mapreduce-sound-bites/" >Sound bites about MapReduce</a></li>
<li><a href="http://www.dbms2.com/2008/08/25/mapreduce-links/" >Other links about MapReduce</a></li>
</ul>
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>The 4 main approaches to datatype extensibility</title>
		<link>http://www.dbms2.com/2008/01/27/the-4-main-approaches-to-datatype-extensibility/</link>
		<comments>http://www.dbms2.com/2008/01/27/the-4-main-approaches-to-datatype-extensibility/#comments</comments>
		<pubDate>Sun, 27 Jan 2008 05:26:46 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data types]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[data type]]></category>
		<category><![CDATA[Database]]></category>
		<category><![CDATA[enterprise search]]></category>
		<category><![CDATA[geospatial]]></category>
		<category><![CDATA[OODBMS]]></category>
		<category><![CDATA[XML]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/2008/01/27/the-4-main-approaches-to-datatype-extensibility/</guid>
		<description><![CDATA[Based on a variety of conversations – including some of the flames about my recent confession that mid-range DBMS aren&#8217;t suitable for everything &#8212; it seems as if a quick primer may be in order on the subject of datatype support. So here goes.
“Database management” usually deals with numeric or alphabetical data – i.e., the [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in">Based on a variety of conversations – including some of the flames about my recent confession that <a href="http://www.dbms2.com/2008/01/24/mysql-database/" >mid-range DBMS aren&#8217;t suitable for everything</a> &#8212; it seems as if a quick primer may be in order on the subject of datatype support. So here goes.</p>
<p style="margin-bottom: 0in">“Database management” usually deals with numeric or alphabetical data – i.e., the kind of stuff that goes nicely into tables.  It commonly has a natural one-dimensional sort order, which is very useful for sort/merge joins, b-tree indexes, and the like.  This kind of tabular data is what relational database management systems were invented for.</p>
<p style="margin-bottom: 0in">But ever more, there are important datatypes beyond character strings, numbers and dates.  Leaving out generic BLOBs and CLOBs (Binary/Character Large OBjects), the big four surely are:</p>
<ul>
<li><strong>Text.</strong> Text search is a huge business on the web, and a separate big business in <a href="http://www.texttechnologies.com/2008/01/14/enterprise-search-versus-web-search/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">enterprises</a>.  And <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/" >text doesn&#8217;t fit well into the relational paradigm</a> at all.</li>
<li><strong>Geospatial. </strong><span>Information about locations on the earth&#8217;s surface is essentially two-dimensional.  Some geospatial apps use three dimensions.</span></li>
<li><strong>Object. </strong><span> There are two main reasons for using object datatypes.  First, the data can have complex internal structures. Second, it can comprise a variety of simpler types.  Object structures are well-suited for engineering and medical applications.</span></li>
<li><strong>XML.</strong><span> A great deal of XML is, at its heart, either relational/tabular data or text documents.  Still, there are a variety of applications for which the most natural datatype truly is XML.</span></li>
</ul>
<p style="margin-bottom: 0in">Numerous other datatypes are important as well, with the top runners-up probably being images, sound, video, time series (even though they&#8217;re numeric, they benefit from special handling).</p>
<p style="margin-bottom: 0in">Four major ways have evolved to manage data of non-tabular datatype, either on their own or within an essentially relational data management environment. <span id="more-338"></span></p>
<ul>
<li><strong>Utterly standalone servers.</strong><span> There are lots of search engines, geospatial engines, object-oriented database management systems, and so on.  Some may have ODBC/JDBC SQL interfaces, to handle metadata (which is commonly tabular in nature) if nothing else.  But even so, there&#8217;s little relational about them.</span></li>
<li><strong>True RDBMS extensibility. </strong><span> In the 1990s, awkwardly named </span><em><span>object-relational database management systems</span></em><span> were introduced, boasting the awkwardly named feature </span><em><span>abstract datatypes.</span></em><span> Oracle, DB2, Informix, and PostgreSQL are now of this kind.  They let one write data access methods for data that&#8217;s right in the basic relational table structure, and get at it through extensions to SQL. </span></li>
<li><strong>Tightly coupled servers. </strong><span> A close relative of RDBMS extension via new access methods is to create new servers for new datatypes, well-integrated with your RDBMS.  Your parser and optimizer are in charge of federating them; the user just writes extended-SQL statements.</span></li>
<li><strong>User-defined functions. </strong><span> User-defined functions are like datatype extensions, but vastly easier to write, in that they don&#8217;t have any special access methods.  When their performance is good enough, UDFs are often the best way to handle extended-datatype needs.</span></li>
</ul>
<p style="margin-bottom: 0in">So how does this all play out in real-world examples?   It&#8217;s all over the place.</p>
<ul>
<li><strong>Enterprise text search</strong> is divided among three modes – integrated into the RDBMS (Oracle and IBM), tightly-couple server (Microsoft, pre-FAST acquisition), and standalone (Autonomy, FAST pre-acquisition, Google, and most other vendors).</li>
<li><strong>Geospatial datatypes</strong> are embedded into extensible DBMS – generally via technology from ESRI – for OLTP uses.  But for data warehousing, where you don&#8217;t need pinpoint record retrieval, UDFs are generally believed to suffice. (E.g. Teradata, Netezza.)</li>
<li><strong>Intersystems</strong> seems to stand alone in getting nontrivial revenue from the standalone OODBMS market.</li>
<li>The <strong>XML</strong> situation is really confused:  Oracle has been late getting its native XML strategy together; the tightly-coupled DB2 Viper engine has been a performance disappointment; Microsoft&#8217;s integrated native XML isn&#8217;t heard from much either; and text/XML integrated engine Marklogic is getting some non-text business almost by default.   In addition, every serious relational vendor has a capability to “shred” XML into relational tables, and can of course also just bulk-handle XML via BLOBs/CLOBs.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2008/01/27/the-4-main-approaches-to-datatype-extensibility/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Relational DBMS versus text data</title>
		<link>http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/</link>
		<comments>http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/#comments</comments>
		<pubDate>Fri, 09 Dec 2005 16:29:56 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data types]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Relational database management systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/</guid>
		<description><![CDATA[There seems to be tremendous confusion about &#8220;search,&#8221; &#8220;meaning,&#8221; &#8220;semantics,&#8221; the suitability of relational DBMS to manage text data, and similar subjects.  Here are some observations that may help sort some of that out.
1.  Relational database theorists like to talk about the &#8220;meaning&#8221; or &#8220;semantics&#8221; of data as being in the database (specifically [...]]]></description>
			<content:encoded><![CDATA[<p>There seems to be tremendous confusion about &#8220;search,&#8221; &#8220;meaning,&#8221; &#8220;semantics,&#8221; the suitability of relational DBMS to manage text data, and similar subjects.  Here are some observations that may help sort some of that out.</p>
<p>1.  Relational database theorists like to talk about the &#8220;meaning&#8221; or &#8220;semantics&#8221; of data as being in the database (specifically its metadata, and more specifically its constraints).  This is at best a very limited use of the words &#8220;meaning&#8221; or &#8220;semantics,&#8221; and has little to do with understanding the meaning of plain English (or other language) phrases, sentences, paragraphs, etc. that may be stored in the database. <a href="http://www.dbdebunk.com/page/page/2735122.htm" onclick="javascript:pageTracker._trackPageview('/www.dbdebunk.com');"> Hugh Darwen is right and his fellow relational theorists are confused</a>.</p>
<p>2.  The standard way to manage text is via a full-text index, designed like this:  For hundreds of thousands of words, the index maintains a list of which documents the word appears in, and at what positions in the document it appears.  This is a columnar, memory-centric approach, that doesn&#8217;t work well with the architecture of mainstream relational products.  Oracle pulled off a decent single-server integration nonetheless, although performance concerns linger to this day.  Others, like Sybase, which attempted a Verity integration, couldn&#8217;t make it work reasonably at all.  Microsoft, which started from the Sybase architecture, didn&#8217;t even try, or if they tried it wasn&#8217;t for long; Microsoft&#8217;s text search strategy has been multi-server more or less from the getgo.</p>
<p>3.  Notwithstanding point #2, Oracle, IBM, Microsoft, and others have SQL DBMS extended to handle text via the SQL3 (or SQL/MM ) standard.  (Truth be told, I get the names and sequencing of the SQL standard versions mixed up.)  From this standpoint, the full text of a document is in a single column, and one can write WHERE clauses on that column using a rich set of text search operators.</p>
<p>But while such SQL statements formally fit into the relational predicate logic model, the fit is pretty awkward.  Text search functions aren&#8217;t two-valued binary yes/no types of things; rather, they give scores, e.g. with 101 possible values (the integers from 0 &#8211; 100).   Compounding them into a two-valued function typically throws away information, especially since that compounding isn&#8217;t well understood (which is why it&#8217;s so hard to usefully federate text searches across different corpuses).</p>
<p>4.   Something even trickier is going on.  Text search can be carried out against many different kinds of things.  One increasingly useful target is the tables of a relational database.   Where a standard SQL query might have trouble finding all the references in a whole database to a particular customer organization or product line or whatever, a text search can do a better job.   This kind of use is becoming increasingly frequent.  And while it works OK against relational products, it doesn&#8217;t fit into the formal relational model at all (at least not without a tremendous amount of contortion).</p>
<p>5.  Relational DBMS typically manage the data they index.  Text search systems often don&#8217;t.  But that difference is almost a small one compared with some of the others mentioned above, especially since it&#8217;s a checkmark item for leading RDBMS to have some sort of formal federation capability.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>
