<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; NoSQL</title>
	<atom:link href="http://www.dbms2.com/category/database-theory-practice/nosql/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>WibiData, derived data, and analytic schema flexibility</title>
		<link>http://www.dbms2.com/2012/02/06/wibidata-derived-data-and-analytic-schema-flexibility/</link>
		<comments>http://www.dbms2.com/2012/02/06/wibidata-derived-data-and-analytic-schema-flexibility/#comments</comments>
		<pubDate>Tue, 07 Feb 2012 03:18:25 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Odiago and WibiData]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5907</guid>
		<description><![CDATA[My clients at Odiago, vendors of WibiData, have changed their company name simply to WibiData. Even better, they blogged with more detail as to how WibiData works, in what is essentially a follow-on to my original WibiData post last October. Among other virtues, WibiData turns out to be a poster child for my views on [...]]]></description>
			<content:encoded><![CDATA[<p>My clients at Odiago, vendors of WibiData, have changed their company name simply to WibiData. Even better, they blogged with more detail as to <a href="http://www.wibidata.com/2012/02/07/how-wibidata-works/">how WibiData works</a>, in what is essentially a follow-on to <a href="../../../../../2011/11/02/5576/">my original WibiData post</a> last October. Among other virtues, WibiData turns out to be a poster child for my views on <a href="../../../../../2011/09/06/derived-data-progressive-enhancement-and-schema-evolution/">derived data and the corresponding schema evolution</a>.</p>
<p>Interesting quotes include:</p>
<blockquote><p>WibiData is designed to store &#8230; transactional data side-by-side with profile and other derived data attributes.</p></blockquote>
<blockquote><p>&#8230; the ability to add new ad-hoc columns to a table enables more flexible analysis: output data that is the result of one analytic pipeline is stored adjacent to its input data, meaning that you can easily use this as input to second- or third-order derived data as well.</p></blockquote>
<blockquote><p>schemas can vary over time; you can easily add a field to a record, or delete a field. &#8230; But even though you start collecting that new data, your existing analysis pipelines can treat records like they always did; programs that don’t yet know about the new cookie are still compatible with both the old records already collected, and the new records with the additional field. New programs fill in default values for old data recorded before a field was added, applying the new schema at read time.</p></blockquote>
<blockquote><p>schemas for every column are stored in a data dictionary that matches column names with their schemas, as well as human-readable descriptions of the data.</p></blockquote>
<p>Interesting aspects of the post that don&#8217;t lend themselves as well to being excerpted include:</p>
<ul>
<li>How the Produce-Gather &#8220;analysis calculus&#8221; &#8212; i.e. framework &#8212; works.</li>
<li>How this all ties into Apache projects (and sub-projects) such as Hadoop, HBase, and Avro.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/02/06/wibidata-derived-data-and-analytic-schema-flexibility/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Couchbase update</title>
		<link>http://www.dbms2.com/2012/02/01/couchbase-update/</link>
		<comments>http://www.dbms2.com/2012/02/01/couchbase-update/#comments</comments>
		<pubDate>Thu, 02 Feb 2012 04:00:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Basho and Riak]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[CouchDB]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Zynga]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5877</guid>
		<description><![CDATA[I checked in with James Phillips for a Couchbase update, and I understand better what&#8217;s going on. In particular: Give or take minor tweaks, what I wrote in my August, 2010 Couchbase updates still applies. Couchbase now and for the foreseeable future has one product line, called Couchbase. Couchbase 2.0, the first version of Couchbase [...]]]></description>
			<content:encoded><![CDATA[<p>I checked in with James Phillips for a Couchbase update, and I understand better what&#8217;s going on. In particular:</p>
<ul>
<li>Give or take minor tweaks, what I wrote in my <a href="../../../../../2011/08/13/couchbase-business-update/">August, 2010 Couchbase updates</a> still applies.</li>
<li>Couchbase now and for the foreseeable future has one product line, called Couchbase.</li>
<li>Couchbase 2.0, the first version of Couchbase (the product) to use CouchDB for persistence, has slipped &#8230;</li>
<li>&#8230; because more parts of CouchDB had to be rewritten for performance than Couchbase (the company) had hoped.</li>
<li>Think mid-year or so for the release of Couchbase 2.0, hopefully sooner.</li>
<li>In connection with the need to rewrite parts of CouchDB, Couchbase has:
<ul>
<li><a href="../../../../../2012/01/18/notes-from-the-couch-blogs/">Gotten out of the single-server CouchDB business</a>.</li>
<li>Donated its proprietary single-sever CouchDB intellectual property to the Apache Foundation.</li>
</ul>
</li>
<li>The 150ish new customers in 2011 Couchbase brags about are real, subscription customers.</li>
<li>Couchbase has 60ish people, headed to &gt;100 over the next few months.</li>
</ul>
<p><span id="more-5877"></span><em>If you previously heard the brand names Couchbase Single or Couchbase Mobile, pay no further attention to them. Couchbase Single was CouchDB; Couchbase Mobile is part of Couchbase&#8217;s feature set.</em></p>
<p>The current product is Couchbase 1.8, which is a whole lot like what previously was called Membase. New features in Couchbase 1.8 (versus prior versions of Membase) were concentrated in client libraries/SDK (Software Development Kit). Not coincidentally, Couchbase has hired developer evangelists who are in charge of making Couchbase play nicely with various specific languages (e.g. C/C++)</p>
<p>Drilling down further into the CouchDB part of the story:</p>
<ul>
<li>Couchbase 2.0 will replace Couchbase 1.8/Membase&#8217;s SQLite back-end with CouchDB.</li>
<li>Parts of CouchDB that do things like read, write, or compact data have been rewritten from Erlang to C.</li>
<li>Couchbase still uses other Erlang parts of Apache CouchDB, and would be delighted if the community were to usefully enhance them.</li>
<li>Couchbase&#8217;s heavy contributions to development of open source CouchDB will, for the most part, continue.</li>
<li>CouchDB stuff donated to the Apache Foundation includes:
<ul>
<li>Documentation</li>
<li>Packaging</li>
<li>Performance enhancements</li>
</ul>
</li>
</ul>
<p>There&#8217;s at least one Couchbase user with &gt;1000 nodes (at a guess, <a href="../../../../../2011/09/05/zynga-linkedin-data-warehous/">Zynga</a>).  More typical might be 20 nodes or less. This led me to wonder how much data one puts on a Couchbase node anyway. The answer turns out to vary widely, in that you want your working set to be in RAM, and whether that&#8217;s your entire database or just a slice of it depends on the nature of the application.</p>
<p>James echoed a trend I&#8217;ve heard elsewhere as well, in which products one things of as being internet-specific are also sold in a few cases to conventional enterprises for &#8212; you guessed it! &#8212; their internet operations. I also asked him about competition, and he asserted:</p>
<ul>
<li>MongoDB is the big competition. He believes Couchbase has an excellent win rate vs. 10gen for actual paying accounts.</li>
<li>DataStax/Cassandra wins over Couchbase only when multi-data-center capability is important. Naturally, multi-data-center capability is planned for Couchbase. (Indeed, that&#8217;s one of the benefits of swapping in CouchDB at the back end.)</li>
<li>Redis has &#8220;dropped off the radar&#8221;, presumably because there&#8217;s no particular persistence strategy for it.</li>
<li>Riak doesn&#8217;t show up much.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/02/01/couchbase-update/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Microsoft SQL Server 2012 and enterprise database choices in general</title>
		<link>http://www.dbms2.com/2012/01/24/microsoft-sql-server-2012/</link>
		<comments>http://www.dbms2.com/2012/01/24/microsoft-sql-server-2012/#comments</comments>
		<pubDate>Tue, 24 Jan 2012 14:42:34 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Mid-range]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5859</guid>
		<description><![CDATA[Microsoft is launching SQL Server 2012 on March 7. An IM chat with a reporter resulted, and went something like this. Reporter: [Care to comment]? CAM: SQL Server is an adequate product if you don&#8217;t mind being locked into the Microsoft stack. For example, the ColumnStore feature is very partial, given that it can&#8217;t be [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.sqlserverlaunch.com/ww/Home">Microsoft is launching SQL Server 2012 on March 7</a>. An IM chat with a reporter resulted, and went something like this.</p>
<p><strong>Reporter: [Care to comment]?</strong><br />
<strong>CAM:</strong> SQL Server is an adequate product if you don&#8217;t mind being locked into the Microsoft stack. For example, the ColumnStore feature is very partial, given that <a href="http://msdn.microsoft.com/en-us/library/gg492088%28v=sql.110%29.aspx#Update">it can&#8217;t be updated</a>; but Oracle doesn&#8217;t have columnar storage at all.</p>
<p><strong>Reporter: Is the lock-in overall worse than IBM DB2, Oracle?</strong><br />
<strong>CAM:</strong> Microsoft locks you into an operating system, so yes.</p>
<p><strong>Reporter: Is this release something larger Oracle or IBM shops could consider as a lower-cost alternative a co-habitation scenario, in the event they&#8217;re mulling whether to buy more Oracle or IBM licenses?</strong><br />
<strong>CAM:</strong> If they have a strong Microsoft-stack investment already, sure. Otherwise, why?</p>
<p><strong>Reporter: [How about] just cost?</strong><br />
<strong>CAM:</strong> DB2 works just as well to keep Oracle honest as SQL Server does, and without a major operating system commitment. For analytic databases you want an analytic DBMS or appliance anyway.</p>
<p>Best is to have one major vendor of OTLP/general-purpose DBMS, a web DBMS, a DBMS for disposable projects (that may be the same as one of the first two), plus however many different analytic data stores you need to get the job done.</p>
<p>By &#8220;web DBMS&#8221; I mean MySQL, NewSQL, or NoSQL. Actually, you might need more than one product in that area.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/24/microsoft-sql-server-2012/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Big data terminology and positioning</title>
		<link>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/</link>
		<comments>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 01:35:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5768</guid>
		<description><![CDATA[Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions: Bigness &#8212; Volume, Velocity, size Structure &#8212; Variety, Variability, Complexity given that High-velocity &#8220;big data&#8221; problems are usually high-volume as well.* Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction. But the conflation should [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I observed that <a href="../../../../../2011/09/11/big-data-has-jumped-the-shark/">Big Data terminology is seriously broken</a>. It is reasonable to reduce the subject to two quasi-dimensions:</p>
<ul>
<li><strong>Bigness</strong> &#8212; Volume, Velocity, size</li>
<li><strong>Structure</strong> &#8212; Variety, Variability, Complexity</li>
</ul>
<p>given that</p>
<ul>
<li>High-velocity &#8220;big data&#8221; problems are usually high-volume as well.*</li>
<li>Variety, variability, and complexity all relate to the <a href="../../../../../2011/05/17/poly-structured-database/">simply-structured/poly-structured</a> distinction.</li>
</ul>
<p>But the conflation should stop there.</p>
<p><em>*Low-volume/high-velocity problems are commonly referred to as <a href="../2011/08/25/renaming-cep-or-not/">&#8220;event processing&#8221; and/or &#8220;streaming&#8221;</a>.</em></p>
<p>When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2&#215;2 matrix of possibilities. For want of better alternatives, my suggestions are:</p>
<ul>
<li><strong>Relational big data</strong> is data of high volume that fits well into a relational DBMS.</li>
<li><strong>Multi-structured big data</strong> is data of high volume that doesn&#8217;t fit well into a relational DBMS. <em>Alternative: Poly-structured big data.</em></li>
<li><strong>Conventional relational data</strong> is data of not-so-high volume that fits well into a relational DBMS. <em>Alternatives: Ordinary/normal/smaller relational data.</em></li>
<li><strong>Smaller poly-structured data</strong> is data for which <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schema</a> capabilities are important, but which doesn&#8217;t rise to &#8220;big data&#8221; volume.</li>
</ul>
<p><span id="more-5768"></span>Notes on all this include:</p>
<ul>
<li>&#8220;Relational big data&#8221; is commonly what you need a scalable analytic relational DBMS for. But there are non-analytic use cases as well.</li>
<li>The paradigmatic example of &#8220;multi-structured big data&#8221; is log files. Thus, multi-structured big data is commonly what you need a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> for.</li>
<li>One might want to equate non-analytic relational big data technology to &#8220;NewSQL&#8221;. However, I&#8217;m struggling to think of a database size range in which the entire NewSQL industry can match Oracle&#8217;s market share alone.</li>
<li>One might want to equate non-analytic multi-structured big data technology to &#8220;NoSQL&#8221;. However:
<ul>
<li>&#8220;NoSQL&#8221; is also used to encompass not-so-big-data use cases, such as prototyping in MongoDB.</li>
<li><a href="../../../../../2011/10/02/defining-nosql/">&#8220;NoSQL&#8221; has non-ACID/low(er)-data-integrity connotations</a> that aren&#8217;t appropriate for all non-relational systems.</li>
</ul>
</li>
<li>Up to a point, you can analyze relational big data in a conventional relational DBMS, but an analytic RDBMS will usually win on TCO (Total Cost of Ownership). In particular, reasonable thresholds for moving an analytic database off Oracle might be:
<ul>
<li>1-2 terabytes if you&#8217;ve never bought anything past Oracle Standard Edition.</li>
<li>5-10 terabytes if you&#8217;re already paying for Oracle Enterprise Edition.</li>
<li>A lot higher than that if you actually find Oracle Exadata to be cost-effective.</li>
</ul>
</li>
<li>Depending on how big one acknowledges as &#8220;big&#8221;, the market share leader in &#8220;big bit bucket&#8221; use cases is either Splunk or Hadoop.</li>
<li>If we look at multi-structured big data management overall, MarkLogic joins the list of market share contenders, as do various NoSQL alternatives.</li>
<li>It is wrong to say that the large web companies invented &#8220;big data&#8221; technology. But it is more reasonable to say they invented much of &#8220;multi-structured big data&#8221; management. In particular (and this is just a partial list), Google, Amazon, Yahoo, Facebook, et al. can reasonably be credited with Hadoop, Cassandra, HBase and various predecessors to same.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Some big-vendor execution questions, and why they matter</title>
		<link>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/</link>
		<comments>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 11:01:20 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cognos]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[HP and Neoview]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[SAP AG]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5704</guid>
		<description><![CDATA[When I drafted a list of key analytics-sector issues in honor of look-ahead season, the first item was &#8220;execution of various big vendors&#8217; ambitious initiatives&#8221;.  By &#8220;execute&#8221; I mean mainly: &#8220;Deliver products that really meet customers&#8217; desires and needs.&#8221; &#8220;Successfully convince them that you&#8217;re doing so &#8230;&#8221; &#8220;&#8230; at an attractive overall cost.&#8221; Vendors mentioned [...]]]></description>
			<content:encoded><![CDATA[<p>When I drafted a list of key analytics-sector issues in honor of <a href="http://www.dbms2.com/2011/11/21/analytic-trends-in-2012-qa/">look-ahead season</a>, the first item was &#8220;execution of various big vendors&#8217; ambitious initiatives&#8221;.  By &#8220;execute&#8221; I mean mainly:</p>
<ul>
<li>&#8220;Deliver products that really meet customers&#8217; desires and needs.&#8221;</li>
<li> &#8220;Successfully convince them that you&#8217;re doing so &#8230;&#8221;</li>
<li>&#8220;&#8230; at an attractive overall cost.&#8221;</li>
</ul>
<p>Vendors mentioned here are Oracle, SAP, HP, and IBM. Anybody smaller got left out due to the length of this post. Among the bigger omissions were:</p>
<ul>
<li>salesforce.com (multiple subjects).</li>
<li><a href="../../../../../2011/04/21/sas-hpa-does-make-sense-after-all/">SAS HPA</a>.</li>
<li><a href="../../../../../2011/08/21/hadoop-evolution/">The evolution of Hadoop</a>.</li>
</ul>
<p><span id="more-5704"></span><strong>A (lingering) issue for SAP and Oracle alike</strong></p>
<p>As I noted in January of this year, <a href="../../../../../2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/">integration of business intelligence into operational apps is making very slow progress</a>. Even so, it&#8217;s a huge part of the apparent strategy at SAP and Oracle alike, as well it should be. Much of the benefit from automating routine desk work has already happened. The areas ripest for exploitation are the ones where analytics are part of the equation.</p>
<p>Given the lack of tangible progress, why do I think this is a genuine area of Oracle and SAP emphasis? Three reasons of many are:</p>
<ul>
<li>Why else did SAP buy Business Objects?</li>
<li>If they&#8217;re not trying to <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrate operational apps and analytics</a>, why else does SAP&#8217;s emphasis on HANA make sense?</li>
<li>Without business intelligence in the picture, how does Oracle&#8217;s integrated-stack story promise any direct user benefits?*</li>
</ul>
<p><em>*As opposed to IT concerns &#8212; integration, administration, TCO (Total Cost of Ownership), etc.</em></p>
<p>After so many years of disappointment, I&#8217;m not going to forecast 2012 as a pivotal year for <strong>the integration of business intelligence into operational applications.</strong> But if one of SAP or Oracle ever does get a significant BI/operational app integration advantage over the other, it could be a major competitive advantage in those application market segments that are still up for grabs. It also is an opportunity for both vendors to gain BI market share in their respective application customer bases.</p>
<p><strong>A more urgent issue for SAP</strong></p>
<p>SAP has put huge amounts of credibility on the line for HANA, the integration of two different and not particularly mature in-memory database technologies. So far, it is difficult to find evidence that HANA is robust enough for widespread adoption. Whether or not SAP can fix that is a huge open question, which could have significant impact on the course of several technology areas: applications, business intelligence, in-memory DBMS, and maybe even hardware.</p>
<p>Based on current information, which is admittedly partial, I&#8217;m a short-term pessimist on HANA. Longer-term, I&#8217;m on record as saying that <a href="../../../../../2011/05/23/databases-ram/">traditional databases will eventually wind up in RAM</a>. SAP will surely get that technology right some day, whether or not the way it does so has anything to do with present-day HANA code.</p>
<p><strong>Four more issues for Oracle </strong></p>
<p>Oracle&#8217;s ambitions are near-endless, and so also therefore is its list of execution challenges. Four in the analytics area that I find particularly interesting are:</p>
<ul>
<li><strong>True hybrid columnar DBMS.</strong> <a href="../../../../../2011/09/22/teradata-columnar-compression/">I was guessing that Oracle, like Teradata, would announce true hybrid columnar the week of Oracle OpenWorld</a>. I was wrong. But if Oracle can&#8217;t bring out true hybrid columnar DBMS functionality relatively soon, Exadata will lose credibility as a competitor to more specialized analytic DBMS.</li>
<li><strong>Oracle Exalytics.</strong> With Exalytics in the mix, Oracle&#8217;s technology stack has HANA-like potential. But will Exalytics even ship in 2012? (I think so.) Will it be good for much in the first release? (I&#8217;m skeptical.)</li>
<li><strong>Oracle&#8217;s Big Data Appliance</strong>. I&#8217;m skeptical both about <a href="../../../../../2011/10/20/more-notes-on-oracle-nosql/">Oracle&#8217;s NoSQL product</a> &#8212; <a href="http://www.infoworld.com/d/data-explosion/first-look-oracle-nosql-database-179107">a favorable InfoWorld review</a> notwithstanding &#8212; and <a href="../../../../../2011/09/23/hadoop-appliances/">Hadoop appliances</a>. But if I&#8217;m wrong, and Oracle can successfully embrace/extend the new non-relational paradigms, then it really might regain control over the evolution of data management.</li>
<li><strong><a href="../../../../../2011/10/18/oracle-is-buying-endeca/">Oracle&#8217;s Endeca acquisition</a></strong> &#8212; will Oracle prove me wrong and integrate Endeca effectively into its overall analytic product line? If it does, we might finally see effective text (and eventually speech) navigation of enterprise software. (But as with all Oracle issues cited here, this is something that probably won&#8217;t amount to much in 2012 even if it does later go well.)</li>
</ul>
<p><strong>Three issues for IBM</strong></p>
<p>Like Oracle, IBM is a huge company with many ambitions and hence many execution challenges. The biggest of those is surely: <strong>How effective can IBM be at selling outside its existing customer base?</strong> I don&#8217;t hear as much competitively about IBM DataStage, IBM SPSS or now IBM Netezza as I did when their vendors were independent companies. Even Cognos may not be much of an exception to the rule, although it has its own large customer base outside of IBM&#8217;s traditional one. (To lesser extents , the same is of course true of Netezza and numerous other IBM acquisitions.)</p>
<p>Another general issue for IBM is <strong>substantively integrating its various product lines,</strong> at least to the extent that makes sense. DB2/Netezza integration sounds good, but even that is a matter more of product marketing (the admirable part of that discipline) more than of actual technology. Other integrations (e.g. Cognos/DB2 in various bundles) have tended toward the dubious side.*</p>
<p><em>*I&#8217;m still waiting for IBM to get back to me with examples of how Cognos/DB2 joint tuning amounts to anything. It&#8217;s been more than a year, so I&#8217;m glad I didn&#8217;t hold my breath.</em></p>
<p>In a somewhat narrower vein, I wonder: <strong><a href="../../../../../2011/11/10/cep-streaming-catchup/">Will IBM be able to gain traction for InfoSphere Streams</a>? </strong>And if so, when and where will the traction be?</p>
<p><strong>Will HP screw up Vertica?</strong></p>
<p>Vertica has a very attractive product offering. It&#8217;s perhaps <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">the most scalable analytic DBMS outside of Teradata</a>, running on the hardware of your reasonable choice.  It&#8217;s also the one I recommend most often to clients in the 1-50 terabyte range.</p>
<p>So far HP doesn&#8217;t seem to have done much to leadfoot Vertica. (About all I&#8217;ve heard from competitors is that Vertica seems to have faded somewhat in the financial services market, and there could be multiple explanations if that is indeed true.) But if HP Vertica does somehow manage to botch things, opportunities will open up for a range of columnar analytic DBMS competitors.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The cool aspects of Odiago WibiData</title>
		<link>http://www.dbms2.com/2011/11/02/5576/</link>
		<comments>http://www.dbms2.com/2011/11/02/5576/#comments</comments>
		<pubDate>Wed, 02 Nov 2011 15:05:01 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Odiago and WibiData]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5576</guid>
		<description><![CDATA[Christophe Bisciglia and Aaron Kimball have a new company. It&#8217;s called Odiago, and is one of my gratifyingly more numerous tiny clients. Odiago&#8217;s product line is called WibiData, after the justly popular We Be Sushi restaurants. We&#8217;ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on TechCrunch. But this [...]]]></description>
			<content:encoded><![CDATA[<p>Christophe Bisciglia and Aaron Kimball have a new company.</p>
<ul>
<li>It&#8217;s called Odiago, and is one of my gratifyingly more numerous tiny clients.</li>
<li>Odiago&#8217;s product line is called <a href="http://www.wibidata.com/">WibiData</a>, after the justly popular We Be Sushi restaurants.</li>
<li>We&#8217;ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on <a href="http://techcrunch.com/2011/11/02/cloudera-founder-debuts-big-data-management-and-analysis-platform-wibidata-with-backing-from-eric-schmidt/">TechCrunch</a>. But this is the place for &#8212; well, for the tech crunch.</li>
</ul>
<p><strong>WibiData is designed for management of, <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> on, and operational analytics on consumer internet data,</strong> the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop,* is <strong>a data management and analytic execution layer.</strong> That&#8217;s where the secret sauce resides. Also included are:</p>
<ul>
<li>REST APIs for interactive access.</li>
<li>Import/export tools, including JDBC access.</li>
<li>Management tools.</li>
<li>Analytic libraries &#8212; data mining, predictive      analytics, machine learning, and so on.</li>
</ul>
<p>The whole thing is in beta, with about three (paying) beta customers.</p>
<p><em>*And Avro and so on.</em></p>
<p>The core ideas of WibiData include:</p>
<ul>
<li><strong>ALL data pertaining to a single user </strong>(or mobile device) <strong>is kept in      a single, </strong>possibly very long,<strong> HBase row.</strong><strong> </strong></li>
<li>There are two primary operators in WibiData, <strong>Produce </strong>and <strong>Gather.</strong>
<ul>
<li>Produce operates on single       rows. It can operate on one row at HBase speed (milliseconds) if you need       to inform an interactive user response. Or it can operate on the whole       database in batch via Hadoop MapReduce.</li>
<li>It is reasonable to think of       Produce as mainly doing two things. One is the aforementioned serving of       data out of WibiData into interactive applications. The other is scoring,       classifying, recommending, etc. on individual users (i.e. rows), in line       with an analytic model.</li>
<li>Gather typically operates on       all your rows at once, and emits suitable input for a MapReduce Reduce       step. It is reasonable to think of Gather as being a key cog in the       training of analytic models.</li>
</ul>
</li>
<li>HBase <strong>schema management is done at the      WibiData system level,</strong> not directly in applications. There&#8217;s a      WibiData HBase data dictionary, powered by a set of system tables, that      specifies cell data types/record types and, in effect, primitive schemas.</li>
</ul>
<p><span id="more-5576"></span>WibiData-enhanced HBase differs from relational DBMS in most of the ways you would imagine, both good and bad. In particular:</p>
<ul>
<li>Depending on how you look at it,      WibiData-enhanced HBase either has no DML (Data Manipulation Language) at      all, or else has one that &#8216;s a lot less rich than SQL.</li>
<li>WibiData-enhanced HBase schemas are much more <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic</a> than SQL schemas.</li>
<li>WibiData-enhanced HBase schemas can have nested      or recursive data structures, such as array-valued cells.</li>
</ul>
<p>To expand on each of those points in turn:</p>
<p>WibiData&#8217;s underlying one-giant-table philosophy notwithstanding, there are times you manage multiple tables with it. (For example, you ingest data into WibiData however you can, and then run transformations &#8212; typically batch &#8212; until the data is in the preferred structure.) While Wibidata does have ways to simulate joins, foreign keys, and so on, there&#8217;s nothing resembling referential integrity or foreign key constraints.</p>
<p><strong>WibiData takes single-table schema flexibility to an extreme.</strong> Not only can different rows in the same table have different associated columns &#8212; something that relational systems can in effect also do via NULL values &#8212; but schemas can even change over the life of a column. If you have an array-valued cell storing the results of a marketing campaign, and you start recording more data partway through the campaign, then different rows in the table will, in the same column, hold different-sized arrays.</p>
<p>That nesting can also get pretty serious; <strong>where you’d have a single value in a relational table, you might have the equivalent of a whole relational table (or at least selection/view) in WibiData-enhanced HBase. </strong>For example, if a user visits the same web page ten times, and each time 50 attributes are recorded (including a timestamp), all 500 data – to use the word “data” in its original “plural of <em>datum</em>” sense – would likely be stored in the same WibiData cell.</p>
<p>That’s about all Odiago is disclosing about WibiData right now. Christophe will also be talking at Hadoop World next week, and presumably can be hit up with any burning questions then.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/02/5576/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>NoSQL notes</title>
		<link>http://www.dbms2.com/2011/10/23/nosql-notes/</link>
		<comments>http://www.dbms2.com/2011/10/23/nosql-notes/#comments</comments>
		<pubDate>Mon, 24 Oct 2011 04:20:27 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Basho and Riak]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Parallelization]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5522</guid>
		<description><![CDATA[Last week I visited with James Phillips of Couchbase, Max Schireson and Eliot Horowitz of 10gen, and Todd Lipcon, Eric Sammer, and Omer Trajman of Cloudera. I guess it&#8217;s time for a round-up NoSQL post. Views of the NoSQL market horse race are reasonably consistent, with perhaps some elements of “Where you stand depends upon [...]]]></description>
			<content:encoded><![CDATA[<p>Last week I visited with James Phillips of Couchbase, Max Schireson and Eliot Horowitz of 10gen, and Todd Lipcon, Eric Sammer, and Omer Trajman of Cloudera. I guess it&#8217;s time for a round-up NoSQL post. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Views of the NoSQL market horse race are reasonably consistent, with perhaps some elements of “Where you stand depends upon where you sit.”</p>
<ul>
<li>As      James tells it, NoSQL is simply a three-horse race between Couchbase,      MongoDB, and Cassandra.</li>
<li>Max      would include HBase on the list.</li>
<li>Further,      Max pointed out that metrics such as job listings suggest MongoDB has the      most development activity, and Couchbase/Membase/CouchDB perhaps have      less.</li>
<li>The Cloudera      guys remarked on some serious HBase adopters.*</li>
<li>Everybody      I spoke with agreed that Riak had little current market presence, although      some Basho guys could surely be found who&#8217;d disagree.</li>
</ul>
<p><span id="more-5522"></span><em>*I hope to do a separate post on HBase adoption soon. In connection with that, any info on HBase adoption by Facebook (said to be very heavy), Twitter, et al. would be much appreciated.</em></p>
<p>The reasons for using NoSQL of course are, in some order, <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schemas</a>, scale-out, and open source. <a href="http://www.dbms2.com/2011/10/23/transparent-relational-oltp-scale-out/">I find the scale-out argument somewhat bogus</a>,* but the data model one is very real. Depending on whom you talk with, the most important point about dynamic schemas may actually be that they’re changeable, or it may just be that you don’t have to specify a schema at the time of initial application design. MongoDB gets particular praise as a good platform on which to throw something together quickly, although predictions as to how far the application will then scale may differ depending on whether you’re talking with, say, Max or Todd.</p>
<p><em>*It’s fair to say that NoSQL systems are more proven in scale-out than most relational DBMS. Even so, I would cringe at any line of reasoning that concluded one should adopt NoSQL because it is more mature than relational alternatives.</em></p>
<p>Finally, I was perhaps too extreme when <a href="../../../../../2011/10/20/more-notes-on-oracle-nosql/">I suggested there was no good reason for Oracle to have adopted the major key/minor key approach it took in its NoSQL offering</a>. Todd offered a reason why that approach – which he characterized as similar to Project Voldemort’s – could make sense:</p>
<ul>
<li>If you      have some kind of global secondary index, it’s hard to maintain that index      consistently without what amounts to distributed transactions.</li>
<li>If you      want to avoid the overhead of those, one alternative is a column-group      system such as HBase or Cassandra. Those have no indexes at all, except in      the sense that a column is its own index.</li>
<li>Another      alternative is to load as much indexing information as you can into the      key of a key-value store.</li>
</ul>
<p>I’d be interested to learn about the Couchbase and MongoDB answers to that challenge.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/23/nosql-notes/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Transparent relational OLTP scale-out</title>
		<link>http://www.dbms2.com/2011/10/23/transparent-relational-oltp-scale-out/</link>
		<comments>http://www.dbms2.com/2011/10/23/transparent-relational-oltp-scale-out/#comments</comments>
		<pubDate>Mon, 24 Oct 2011 04:19:09 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Schooner Information Technology]]></category>
		<category><![CDATA[dbShards and CodeFutures]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5521</guid>
		<description><![CDATA[There’s a perception that, if you want (relatively) worry-free database scale-out, you need a non-relational/NoSQL strategy. That perception is false. In the analytic case it’s completely ridiculous, as has been demonstrated by Teradata, Vertica, Netezza, and various other MPP (Massively Parallel Processing) analytic DBMS vendors. And now it’s false for short-request/OLTP (OnLine Transaction Processing) use [...]]]></description>
			<content:encoded><![CDATA[<p>There’s a perception that, if you want (relatively) worry-free database scale-out, you need a non-relational/NoSQL strategy. That perception is false. In the analytic case it’s completely ridiculous, as has been demonstrated by <a href="../../../../../2011/09/24/confusion-about-teradatas-big-customers/">Teradata</a>, <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">Vertica</a>, Netezza, and various other MPP (Massively Parallel Processing) analytic DBMS vendors. And now it’s false for <a href="../../../../../2011/03/02/short-request-processing/">short-request</a>/OLTP (OnLine Transaction Processing) use cases as well.</p>
<p>My favorite relational OLTP scale-out choice these days is <a href="http://www.dbms2.com/2011/10/23/schooner-pivots-further/">the SchoonerSQL/dbShards partnership</a>. Schooner Information Technology (SchoonerSQL) and Code Futures (dbShards) are young, small companies, but I’m not too concerned about that, because the APIs they want you to write to are just MySQL’s. The main scenarios in which I can see them failing are ones in which they are competitively leapfrogged, either by other small competitors – e.g. ScaleBase, Akiban, TokuDB, or ScaleDB &#8212; or by Oracle/MySQL itself. While that could suck for my clients Schooner and Code Futures, it would still provide users relying on MySQL scale-out with one or more good product alternatives.</p>
<p>Relying on non-MySQL NewSQL startups, by way of contrast, would leave me somewhat more concerned. (However, if their code is open sourced. you have at least some vendor-failure protection.) And big-vendor scale-out offerings, such as Oracle RAC or <a href="../../../../../2011/05/06/db2-oltp-scale-out-purescale/">DB2 pureScale</a>, may be more complex to deploy and administer than the MySQL and NewSQL alternatives.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/23/transparent-relational-oltp-scale-out/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>More notes on Oracle NoSQL</title>
		<link>http://www.dbms2.com/2011/10/20/more-notes-on-oracle-nosql/</link>
		<comments>http://www.dbms2.com/2011/10/20/more-notes-on-oracle-nosql/#comments</comments>
		<pubDate>Thu, 20 Oct 2011 15:49:31 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5515</guid>
		<description><![CDATA[A reporter asked me for some thoughts on Oracle&#8217;s new NoSQL product. For the most part, I stand by my previous comments on Oracle NoSQL. Still, NoSQL in general deserves a place in Oracle shops, so it makes sense for Oracle to try to coopt it. Oracle&#8217;s core DBMS is not well suited to track [...]]]></description>
			<content:encoded><![CDATA[<p>A reporter asked me for some thoughts on Oracle&#8217;s new NoSQL product. For the most part, I stand by <a href="http://www.dbms2.com/2011/09/30/oracle-nosql/">my previous comments on Oracle NoSQL</a>. Still, NoSQL in general deserves a place in Oracle shops, so it makes sense for Oracle to try to coopt it.</p>
<p>Oracle&#8217;s core DBMS is not well suited to track interactions (e.g. web clicks), even in cases where it&#8217;s the choice for transactions; it&#8217;s unnecessarily heavyweight. What&#8217;s worse, <a href="http://www.dbms2.com/2010/09/16/chase-authentication-database-outage/">using the same database to store actions and interactions can lead to serious reliability problems.</a> If a better architecture is to dump the clicks into some NoSQL store, massage the information, and eventually put some derived data into a relational DBMS, then Oracle will naturally try to own each step of the data pipeline.</p>
<p><a href="http://www.dbms2.com/2011/07/31/dynamic-fixed-schema-databases/">Dynamic schemas</a> are another area of Oracle weakness, leading in some cases to outright <a href="http://www.dbms2.com/2011/07/27/mongodb-users-and-use-cases/">Oracle replacements</a>. However, pure key-value stores go too far to the opposite extreme; you should at least be able to index and retrieve data one field at a time. Based on what I&#8217;ve seen of Oracle&#8217;s marketing literature, that feature will be missing from the first release of Oracle&#8217;s NoSQL.* Until it&#8217;s in there, and until it works well, I don&#8217;t see why anybody should use Oracle&#8217;s NoSQL product.</p>
<p><em>*Frankly, that choice makes no sense to me on any level. Yet it&#8217;s the way Oracle seems to have elected to go &#8212; or, if it isn&#8217;t, then there&#8217;s somebody writing Oracle marketing collateral who&#8217;s clearly in the wrong line of work.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/20/more-notes-on-oracle-nosql/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Text data management, Part 3: Analytic and progressively enhanced</title>
		<link>http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/</link>
		<comments>http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 01:59:17 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5420</guid>
		<description><![CDATA[This is Part 3 of a three post series. The posts cover: Confusion about text data management. Choices for text data management (general and short-request). Choices for text data management (analytic). I&#8217;ve gone on for two long posts about text data management already, but even so I&#8217;ve glossed over a major point: Using text data [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is Part 3 of a three post series. The posts cover:</p>
<ol>
<li><a href="../2011/10/10/text-data-management-confusion/">Confusion about text data management</a>.</li>
<li><a href="../2011/10/10/text-data-management-part-2-general-and-short-request/">Choices for text data management (general and short-request)</a>.</li>
<li><a href="../2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/">Choices for text data management (analytic)</a>.</li>
</ol>
<p></em></p>
<p>I&#8217;ve gone on for two long posts about text data management already, but even so I&#8217;ve glossed over a major point:</p>
<p><strong>Using text data commonly involves a long series of data enhancement steps.</strong></p>
<p>Even before you do what we&#8217;d normally think of as &#8220;analysis&#8221;, text markup can include steps such as:</p>
<ul>
<li>Figure out where the words break.</li>
<li>Figure out where the clauses and sentences break.</li>
<li>Figure out where the paragraphs, sections, and chapters break.</li>
<li>(Where necessary) map the words to similar ones &#8212; spelling correction, stemming, etc.</li>
<li>Figure out which words are grammatically which parts of speech.</li>
<li>Figure out which pronouns and so on refer to which other words. (Technical term: Anaphora resolution.)</li>
<li>Figure out what was being said, one clause at a time.</li>
<li>Figure out the emotion &#8212; or &#8220;sentiment&#8221; &#8212; associated with it.</li>
</ul>
<p>Those processes can add up to dozens of steps. And maybe, six months down the road, you&#8217;ll think of more steps yet.</p>
<p><span id="more-5420"></span>So when you manage text, it is convenient to assume <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schemas</a>. That would be an argument for using MarkLogic, NoSQL document stores, and/or Hadoop, rather than strictly relational systems.</p>
<p>That said, text analytics can be done perfectly well in relational databases. Again, I point you to the example of <a href="../../../../../2011/04/14/attensity-update/">Attensity</a>, which will extract for you a large fraction of the information that can be gotten out of the text, put it into a convenient relational schema, and let you get to work. Once the principal extraction has been done, there&#8217;s no reason why your <a href="../../../../../2011/09/06/derived-data-progressive-enhancement-and-schema-evolution/">derived data</a> issues need be any more complex than others you deal with relationally, especially on the analytic side of the house.</p>
<p>But what if you want to do your own text enhancement, rather than using a third party tool? The first thing to ask yourself is &#8212; why? With all due respect to the 10-20 internet-centric companies that are having fun reinventing large portions of the data processing wheel &#8212; if you&#8217;re not at one of those companies, you should probably be trying to use as much third-party software as you possibly can.</p>
<p>I can think of a couple of cases where rolling your own technology make sense, namely:</p>
<ul>
<li>The hard part of what you&#8217;re doing is extracting snippets of text from some data format proprietary to you.</li>
<li>You&#8217;re trying to do very simple things across a variety of languages much broader than the 10-20 that the text analytics vendors currently do a halfway decent job of handling.</li>
</ul>
<p>I can&#8217;t think of many others.</p>
<p>One thing I&#8217;d definitely be wary of is using Hadoop as a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> for individual documents in a variety of formats. I don&#8217;t know what you&#8217;d do with them once they&#8217;re there. Yes, Google invented MapReduce in part to do things like document indexing &#8212; but you&#8217;d probably prefer not to reinvent the Google stack. That&#8217;s quite apart from questions as to whether your document count exceeds Hadoop&#8217;s comfortable <a href="../../../../../2011/08/21/hadoop-evolution/">file-count limit</a>. Solr is a different matter; but while Solr and Hadoop are both open source projects that can be traced back to Doug Cutting, otherwise they&#8217;re rather different things.</p>
<p>A useful way of looking at your choices may be to ask:</p>
<p><strong>After text has run through the main pipeline of manipulation and information extraction:</strong></p>
<ul>
<li><strong>What will the output look like?</strong></li>
<li><strong>Where do I want that output to end up?</strong></li>
</ul>
<p>If the output has to be something that fits into a structured/relational analytic system, then it should probably go into a relational DBMS. If you&#8217;re going to do social network analysis of the sort you&#8217;d ideally like to do in a graph database &#8212; well, unless you&#8217;re an intelligence agency with blank-check resources, you&#8217;ll probably still end up opting for a relational DBMS. If the output consists of simple, homogeneous text files, plus a few fields of metadata, and you&#8217;re not going to do much analysis of it, it can pretty much go anywhere; either SQL or NoSQL might suit your purposes. If you want maximum power and flexibility, MarkLogic may be the ideal destination.</p>
<p>From there, the next question is:</p>
<ul>
<li><strong>What pipeline should the text run through to get to its final destination?</strong></li>
</ul>
<p>Often, as I&#8217;ve argued, the right answer is a third-party text analytic system. Those can generally consume text in almost any kind of file format. Other times &#8212; less often than you may think &#8212; it&#8217;s Hadoop. OK, then pass it through Hadoop. Other possibilities could come up as well (text search engines aren&#8217;t really as usually as I may have seemed to be suggesting).</p>
<p>Anyhow, when you&#8217;ve established where text starts out (that&#8217;s usually a given), what it passes through (please see above), and where its best parts need to end up (ditto), you&#8217;ve done the hardest parts. Figuring out the rest of your text management architecture should be relatively easy by comparison.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

