<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; MarkLogic</title>
	<atom:link href="http://www.dbms2.com/category/products-and-vendors/marklogic/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Big data terminology and positioning</title>
		<link>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/</link>
		<comments>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 01:35:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5768</guid>
		<description><![CDATA[Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions: Bigness &#8212; Volume, Velocity, size Structure &#8212; Variety, Variability, Complexity given that High-velocity &#8220;big data&#8221; problems are usually high-volume as well.* Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction. But the conflation should [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I observed that <a href="../../../../../2011/09/11/big-data-has-jumped-the-shark/">Big Data terminology is seriously broken</a>. It is reasonable to reduce the subject to two quasi-dimensions:</p>
<ul>
<li><strong>Bigness</strong> &#8212; Volume, Velocity, size</li>
<li><strong>Structure</strong> &#8212; Variety, Variability, Complexity</li>
</ul>
<p>given that</p>
<ul>
<li>High-velocity &#8220;big data&#8221; problems are usually high-volume as well.*</li>
<li>Variety, variability, and complexity all relate to the <a href="../../../../../2011/05/17/poly-structured-database/">simply-structured/poly-structured</a> distinction.</li>
</ul>
<p>But the conflation should stop there.</p>
<p><em>*Low-volume/high-velocity problems are commonly referred to as <a href="../2011/08/25/renaming-cep-or-not/">&#8220;event processing&#8221; and/or &#8220;streaming&#8221;</a>.</em></p>
<p>When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2&#215;2 matrix of possibilities. For want of better alternatives, my suggestions are:</p>
<ul>
<li><strong>Relational big data</strong> is data of high volume that fits well into a relational DBMS.</li>
<li><strong>Multi-structured big data</strong> is data of high volume that doesn&#8217;t fit well into a relational DBMS. <em>Alternative: Poly-structured big data.</em></li>
<li><strong>Conventional relational data</strong> is data of not-so-high volume that fits well into a relational DBMS. <em>Alternatives: Ordinary/normal/smaller relational data.</em></li>
<li><strong>Smaller poly-structured data</strong> is data for which <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schema</a> capabilities are important, but which doesn&#8217;t rise to &#8220;big data&#8221; volume.</li>
</ul>
<p><span id="more-5768"></span>Notes on all this include:</p>
<ul>
<li>&#8220;Relational big data&#8221; is commonly what you need a scalable analytic relational DBMS for. But there are non-analytic use cases as well.</li>
<li>The paradigmatic example of &#8220;multi-structured big data&#8221; is log files. Thus, multi-structured big data is commonly what you need a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> for.</li>
<li>One might want to equate non-analytic relational big data technology to &#8220;NewSQL&#8221;. However, I&#8217;m struggling to think of a database size range in which the entire NewSQL industry can match Oracle&#8217;s market share alone.</li>
<li>One might want to equate non-analytic multi-structured big data technology to &#8220;NoSQL&#8221;. However:
<ul>
<li>&#8220;NoSQL&#8221; is also used to encompass not-so-big-data use cases, such as prototyping in MongoDB.</li>
<li><a href="../../../../../2011/10/02/defining-nosql/">&#8220;NoSQL&#8221; has non-ACID/low(er)-data-integrity connotations</a> that aren&#8217;t appropriate for all non-relational systems.</li>
</ul>
</li>
<li>Up to a point, you can analyze relational big data in a conventional relational DBMS, but an analytic RDBMS will usually win on TCO (Total Cost of Ownership). In particular, reasonable thresholds for moving an analytic database off Oracle might be:
<ul>
<li>1-2 terabytes if you&#8217;ve never bought anything past Oracle Standard Edition.</li>
<li>5-10 terabytes if you&#8217;re already paying for Oracle Enterprise Edition.</li>
<li>A lot higher than that if you actually find Oracle Exadata to be cost-effective.</li>
</ul>
</li>
<li>Depending on how big one acknowledges as &#8220;big&#8221;, the market share leader in &#8220;big bit bucket&#8221; use cases is either Splunk or Hadoop.</li>
<li>If we look at multi-structured big data management overall, MarkLogic joins the list of market share contenders, as do various NoSQL alternatives.</li>
<li>It is wrong to say that the large web companies invented &#8220;big data&#8221; technology. But it is more reasonable to say they invented much of &#8220;multi-structured big data&#8221; management. In particular (and this is just a partial list), Google, Amazon, Yahoo, Facebook, et al. can reasonably be credited with Hadoop, Cassandra, HBase and various predecessors to same.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>MarkLogic&#8217;s Hadoop connector</title>
		<link>http://www.dbms2.com/2011/11/03/marklogic-hadoop-connector/</link>
		<comments>http://www.dbms2.com/2011/11/03/marklogic-hadoop-connector/#comments</comments>
		<pubDate>Fri, 04 Nov 2011 00:58:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5585</guid>
		<description><![CDATA[It&#8217;s time to circle back to a subject I skipped when I otherwise wrote about MarkLogic 5: MarkLogic&#8217;s new Hadoop connector. Most of what&#8217;s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you: Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s time to circle back to a subject I skipped when I otherwise wrote about <a href="http://www.dbms2.com/2011/11/01/marklogic-version-5/">MarkLogic 5</a>: MarkLogic&#8217;s new Hadoop connector.</p>
<p>Most of what&#8217;s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you:</p>
<ul>
<li>Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API for streaming documents into or out of a MarkLogic database.</li>
<li>Hadoop can make requests to MarkLogic in MarkLogic&#8217;s normal mode of operation, namely to address any node in the MarkLogic cluster, which then serves as a &#8220;head&#8221; node for the duration of that particular request. But alternatively, Hadoop can use a long-standing MarkLogic option to circumvent the whole DBMS cluster and only talk to one specific MarkLogic node.</li>
</ul>
<p>Otherwise, the whole thing is just what you would think:</p>
<ul>
<li>Hadoop can read from and write to MarkLogic, in parallel at both ends.</li>
<li>If Hadoop is just writing to MarkLogic, there&#8217;s a good chance the process is properly called &#8220;ETL.&#8221;</li>
<li>If Hadoop is reading a lot from MarkLogic, there&#8217;s a good chance the process is properly called &#8220;batch analytics.&#8221;</li>
</ul>
<p>MarkLogic said that it wrote this Hadoop connector itself.</p>
<p><span id="more-5585"></span>When I realized MarkLogic was claiming the ability to seamlessly integrate short-request and batch analytic processing, I asked about workload management. I gathered that:</p>
<ul>
<li>MarkLogic believes that MarkLogic 5 does a great job of granular workload monitoring.</li>
<li>However, MarkLogic doesn&#8217;t have a strong workload management administrative interface. Rather, you may have to do workload management programmatically.</li>
</ul>
<p>Overall, I think the MarkLogic Hadoop connector could prove pretty useful. The first question I ask somebody who wants to process relational data in Hadoop is &#8220;Why not just an analytic RDBMS?&#8221; But the natural use cases for MarkLogic are often ones in which you might as well do your analytics in Hadoop, including a 4 billion Word/PDF/image document insurance-industry example I recently encountered, and for which <a href="../../../../../2011/10/10/text-data-management-part-2-general-and-short-request/">I favor MarkLogic over MongoDB or straight Hadoop alike</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/03/marklogic-hadoop-connector/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>MarkLogic 5, and why you might care</title>
		<link>http://www.dbms2.com/2011/11/01/marklogic-version-5/</link>
		<comments>http://www.dbms2.com/2011/11/01/marklogic-version-5/#comments</comments>
		<pubDate>Tue, 01 Nov 2011 04:03:59 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5560</guid>
		<description><![CDATA[MarkLogic is releasing MarkLogic 5. Key elements of the announcement are: More-of-the-same in line with MarkLogic’s core positioning. A new bi-directional Hadoop connector. A free MarkLogic Express edition, limited in license terms more than in actual features, as per Slide 27 of the deck MarkLogic graciously supplied for me to post. Also, MarkLogic is early [...]]]></description>
			<content:encoded><![CDATA[<p>MarkLogic is releasing MarkLogic 5. Key elements of the announcement are:</p>
<ul>
<li>More-of-the-same      in line with MarkLogic’s core positioning.</li>
<li>A new      bi-directional Hadoop connector.</li>
<li>A free      MarkLogic Express edition, limited in license terms more than in actual      features, as per Slide 27 of <a href="http://www.monash.com/uploads/MarkLogic-5-Deck.pptx">the deck      MarkLogic graciously supplied for me to post</a>.</li>
</ul>
<p>Also, MarkLogic is early with a feature that most serious DBMS vendors will  soon have – support for tiered storage, with writes going first to  solid-state storage, then being flushed to disk via a caching-style  algorithm.* And as befits a sometime search-engine-substitute, MarkLogic has finally licensed a large set of document filters, from an Australian company called <a href="http://www.isys-search.com/index.html">Isys</a>. Apparently, the special virtue of the Isys filters is that they’re good at extracting not only text, but metadata as well.</p>
<p><em>*If there’s a caching algorithm that doesn’t contain a major element of LRU (Least Recently Used), I don’t recall ever hearing about it.</em></p>
<p>MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:</p>
<ul>
<li>MarkLogic      is a serious, enterprise-class DBMS (see for example Slide 12 of <a href="http://www.monash.com/uploads/MarkLogic-5-Deck.pptx">the MarkLogic      deck</a>) …</li>
<li>…      which has been optimized from the getgo for <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured      data</a>.</li>
<li>MarkLogic      can and does scale out to handle large amounts of data.</li>
<li>MarkLogic      is a general-purpose DBMS, suitable for <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">both      short-request and analytic tasks</a>.</li>
<li>MarkLogic      is particularly well suited for analyses with long chains of “progressive      enhancement” (MarkLogic’s favorite term when talking about <a href="../../../../../2011/05/30/another-category-of-derived-data/">derived      data</a>).</li>
<li><a href="http://blogs.avalonconsult.com/blog/search/is-marklogic-a-search-engine/">MarkLogic      often plays the role of a content assembler and/or search engine</a>, and      the people who use MarkLogic in those ways are commonly doing things that can      be described as research and analysis.</li>
</ul>
<p>Based on that reality, MarkLogic talks a lot about Volume, Velocity, Variety, Big Data, unstructured data, semi-structured data, and big data analytics.</p>
<p><span id="more-5560"></span><em>My <a href="../../../../../2010/11/29/marklogic-and-its-document-dbms/">November, 2010 overview of MarkLogic technology</a> remains pretty relevant. One correction, however: Node heterogeneity configurations, in which “data” and “evaluation” nodes reside on separate servers, are the exception rather than the rule.</em></p>
<p>Like <a href="../../../../../2011/10/18/vertica-community-edition/">Vertica</a>, MarkLogic has laudably said that true academic researchers can get MarkLogic for free without the severe license restrictions. Free MarkLogic should be of particular interest to researchers who:</p>
<ul>
<li>Are      studying natural networks or graphs, such as social networks or biological      pathways. (This might be a fit in the social or biological sciences.)</li>
<li>Are      managing metadata for, say, a variety of disparate kinds of experimental      files. (This might be a fit anywhere in the natural sciences.)</li>
<li>Are      managing actual documents, images, videos, etc., or data about such      things. (This might be a fit in the humanities or social sciences.)</li>
</ul>
<p>MarkLogic provided some disclosable financial substance by email, which I shall quote verbatim:</p>
<ul>
<li><em>MarkLogic      has 45% revenue growth and 55-60% license growth year over year.</em></li>
<li><em>We      expect to finish this year with over $85 million in revenue, up from $55      million last year.</em></li>
</ul>
<p>Arithmetical purists might note that 85/55 is more than 145%, but I’m just going to settle for the information I got and move on.</p>
<p><em>Edit: I posted separately about the <a href="http://www.dbms2.com/2011/11/03/marklogic-hadoop-connector/">MarkLogic Hadoop connector.</a></em> <span style="text-decoration: line-through;">As for that Hadoop connector – stay tuned for a short follow-up post, as writing about it now would not be convenient. (My backup discipline isn’t what it should be, and the only copy of my notes about that product is on a heavy tower computer in a house that doesn’t have working power.)</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/01/marklogic-version-5/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Text data management, Part 2: General and short-request</title>
		<link>http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/</link>
		<comments>http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 01:58:49 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5419</guid>
		<description><![CDATA[This is Part 2 of a three post series. The posts cover: Confusion about text data management. Choices for text data management (general and short-request). Choices for text data management (analytic). I&#8217;ve recently given widely varied advice about managing text (and similar files &#8212; images and so on), ranging from Sure, just keep going with [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is Part 2 of a three post series. The posts cover:</em></p>
<ol>
<li> <em><a href="../2011/10/10/text-data-management-confusion/">Confusion about text data management</a>.</em></li>
<li><em><a href="../2011/10/10/text-data-management-part-2-general-and-short-request/">Choices for text data management (general and short-request)</a>.</em></li>
<li><em><a href="../2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/">Choices for text data management (analytic)</a>.</em></li>
</ol>
<p>I&#8217;ve recently given widely varied advice about managing text (and similar files &#8212; images and so on), ranging from</p>
<blockquote><p>Sure, just keep going with your old strategy of keeping .PDFs in the file system and pointing to them from the relational database. That&#8217;s an easy performance optimization vs. having the RDBMS manage them as BLOBs.</p></blockquote>
<p>to</p>
<blockquote><p>I suspect MongoDB isn&#8217;t heavyweight enough for your document management needs, let alone just dumping everything into Hadoop. Why don&#8217;t you take a look at MarkLogic?</p></blockquote>
<p>Here are some reasons why.</p>
<p>There are three basic kinds of text management use case:</p>
<ul>
<li><strong>Text as payload. </strong></li>
<li><strong>Text as search parameter.</strong></li>
<li><strong>Text as analytic input.</strong></li>
</ul>
<p><span id="more-5419"></span>The simplest way to manage text electronically is to:</p>
<ul>
<li>Store it as whole documents &#8212; scanned images, .PDFs, word processing files, whatever.</li>
<li>Find it only via fielded metadata, perhaps manually created &#8212; title, author, date, and so on.</li>
</ul>
<p>For example, an application for college admission is accompanied by recommendation letters, transcripts, and so on; those are then moved around as dumb payloads, until such time as an admissions officer reads them. Most relational database management systems can manage BLOBs (Binary Large OBjects), but performance may be better if you leave the big objects outside the relational system. For text-as-payload, that way of managing documents can often suffice.</p>
<p>In other cases, the text may be so short that it naturally fits into a character field in a relational database. This is particularly likely when the text is typed in at the time of record creation, e.g. by a call center operator, a doctor, or a customer entering a support ticket. In such cases, leaving it under the management of an RDBMS makes perfect sense.</p>
<p>In many situations you actually want to search based on the context of the text. Unless you&#8217;re doing simple search on short text snippets in relational character fields, that generally calls for some kind of <strong>text index.</strong> Text indexes are generally found in text search engines. That said, however:</p>
<ul>
<li>Anything that can be done in a standalone text search engine can in principle also be integrated into relational DBMS and other data stores. (How well that works in practice is another matter.)</li>
<li>Text search engines commonly index data <em>in situ</em> that they don&#8217;t actually manage.</li>
</ul>
<p>In theory, then:</p>
<ul>
<li>You can store your text in your chosen DBMS or outside it, as pleases you.</li>
<li>Your chosen DBMS can have a text search capability.</li>
<li>This text search capability can be integrated with the query method used to get at the rest of the data managed by that DBMS.</li>
</ul>
<p>That theory is commonly reflected in actual products, such as Oracle and DB2.</p>
<p>As so often, then, the choice of how to manage text comes down to issues such as performance, programming ease, and other components of total cost of ownership (or of some other general &#8220;goodness&#8221; metric, such as time-to-value). As a general rule, it seems:</p>
<ul>
<li>Text indexing inside relational DBMS has poorer performance than in, say, text search engines, often drastically so.</li>
<li>BLOB management inside relational databases has poorer performance than leaving the files outside the DBMS&#8217; purview.</li>
<li>Relational DBMS do just fine at managing text strings up to, say, 2048 characters long.</li>
<li>Tight integration between  text search and SQL is valuable in a few applications, but irrelevant to many others.</li>
</ul>
<p>And so, to a first approximation:</p>
<ul>
<li><strong>If you just have short text snippets</strong>, it can make sense to <strong>leave them in your relational database.</strong></li>
<li><strong>If performance is not an issue,</strong> you can just leave your BLOBs in your relational database too.</li>
<li><strong>If performance is an issue,</strong> you probably want to have <strong>your larger text files outside your RDBMS&#8217; control.</strong></li>
</ul>
<p>However, that phrasing assumes the default option is a relational DBMS, which may not be the case at all. Other choices include:</p>
<ul>
<li><strong>Standalone text search engines.</strong> If you want the best available text search, get a search engine. But attempts to start with a search engine and wind up with an application platform have generally run into difficulty.</li>
<li><strong>Document-oriented (or other) NoSQL </strong>systems. The story here is surprisingly like that for relational DBMS. I&#8217;ve previously noted that <a href="../../../../../2011/02/07/notes-on-document-oriented-nosql/">document-oriented NoSQL systems manage objects, not &#8220;documents&#8221; in the ordinary sense of the word</a>. Even so, conceptually they&#8217;re no less suited for management of true documents than relational DBMS are. I&#8217;d guess that the correlation between use cases involving true documents and use cases where document-oriented NoSQL is suitable is positive, but not very strong.</li>
<li><strong>MarkLogic.</strong> From one standpoint, MarkLogic is just a heavier-weight version of document-oriented NoSQL. But MarkLogic&#8217;s XML (and XQuery) orientation, tuned-for-years indexing, and built-in search engine put it on a different level for document management than the upstarts have reached.</li>
</ul>
<p>I&#8217;ll cover the Hadoop option in the next, more analytically-focused post.</p>
<p>I hope I&#8217;ve demonstrated that there are appropriate use cases for each of:</p>
<ul>
<li>Letting documents be managed by the file system (and pointing to them from your preferred DBMS).</li>
<li>Sticking documents straight into your preferred DBMS (SQL or non-SQL as the case may be).</li>
<li>Using a specialty system such as MarkLogic (or of course, in some cases, an enterprise search engine).</li>
</ul>
<p>And that&#8217;s even before we move on to<em> analytically-oriented text data management.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Text data management, Part 1: Confusion</title>
		<link>http://www.dbms2.com/2011/10/10/text-data-management-confusion/</link>
		<comments>http://www.dbms2.com/2011/10/10/text-data-management-confusion/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 01:58:03 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5421</guid>
		<description><![CDATA[This is Part 1 of a three post series. The posts cover: Confusion about text data management. Choices for text data management (general and short-request). Choices for text data management (analytic). There&#8217;s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include: The terminology around text [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is Part 1 of a three post series. The posts cover:</em></p>
<ol>
<li> <em><a href="http://www.dbms2.com/2011/10/10/text-data-management-confusion/">Confusion about text data management</a>.</em></li>
<li><em><a href="http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/">Choices for text data management (general and short-request)</a>.</em></li>
<li><em><a href="http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/">Choices for text data management (analytic)</a>.</em></li>
</ol>
<p>There&#8217;s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:</p>
<ul>
<li>The terminology around text data is inaccurate.</li>
<li>Data volume estimates for text are misleading.</li>
<li>Multiple different technologies are in the mix, including:
<ul>
<li>Enterprise text search.</li>
<li>Text analytics &#8212; <a href="http://www.texttechnologies.com/category/software-as-a-service-saas/category/text-mining/">text mining</a>, sentiment analysis, etc.</li>
<li>Document stores &#8212; e.g. document-oriented NoSQL, or MarkLogic.</li>
<li>Log management and parsing &#8212; e.g. Splunk.</li>
<li>Text archiving &#8212; e.g., various specialty email archiving products I couldn&#8217;t even name.</li>
<li>Public web search &#8212; Google et al.</li>
</ul>
</li>
<li>Text search vendors have disappointed, especially technically.</li>
<li>Text analytics vendors have disappointed, especially financially.</li>
<li>Other analytic technology vendors ignore <a href="http://www.texttechnologies.com/2010/12/01/state-of-the-art-text-analytics-mining-applications/">what the text analytic vendors actually have accomplished</a>, and reinvent inferior wheels rather than OEM the state of the art.</li>
</ul>
<p>Above all: <a href="http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/">The use cases for text data vary greatly</a>, just as the use cases for simply-structured databases do.</p>
<p>There are probably fewer people now than there were six years ago who need to be told that <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">text and relational database management are very different things</a>. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: <span id="more-5421"></span></p>
<ul>
<li><strong> The terms &#8220;unstructured&#8221; or &#8220;semi-structured&#8221; data are inherently misleading</strong>. That&#8217;s why <a href="../../../../../2011/05/17/poly-structured-database/">I favor &#8220;multi-structured&#8221; or &#8220;poly-structured&#8221; instead</a>. (&#8220;Multi-structured&#8221; seems to be winning; e.g., it&#8217;s been adopted by Teradata and Teradata/Aster.)</li>
<li>The &#8220;social media&#8221; text data any one enterprise brings in house isn&#8217;t all that much. For example, <a href="../../../../../2011/04/14/attensity-update/">Attensity serves many different enterprises&#8217; social media needs from a single 20-terabyte data store</a>, and reports that no single enterprise has required as much as 1 terabyte of text yet. <strong>Text data may consume a lot of storage </strong>on spinning disks somewhere,<strong> but it&#8217;s not that big a factor in future DBMS industry growth.</strong> (That 20 terabyte figure does seem low.)</li>
<li><strong>Structured databases are typically worth a lot more per bit than other kinds.</strong> The most valuable electronic data, per-bit, is probably records of significant economic transactions &#8212; purchases, sales, money transfers, etc. The least valuable may be sensor log files, whose contents consist mainly of &#8220;Nothing going on here; ping you again in a minute.&#8221; Email logs, web interaction data and many other kinds fall somewhere in between. Highly valuable documents &#8212; such as signed contracts &#8212; generally persist in paper as well as electronic forms. <strong>Investors commonly overlook this point.</strong></li>
<li><strong>The enterprise text search industry is screwed up.</strong>
<ul>
<li>FAST was a goofy company before it was acquired for far too much money by Microsoft.</li>
<li>Autonomy was a goofy company before it was acquired for far too much money by HP.</li>
<li>Google&#8217;s enterprise efforts are quiet.</li>
<li>The integration of text search and relational DBMS &#8212; e.g. at Oracle &#8212; has languished, with poor performance and evident lack of management attention.</li>
<li>Smaller text search vendors don&#8217;t seem to be getting a lot of traction &#8212; e.g., <a href="http://www.texttechnologies.com/category/vendors/coveo/">Coveo</a> has a decent reputation, but when&#8217;s the last time you heard much about them? What has Attivio actually accomplished?</li>
</ul>
</li>
<li><strong>Text analytics is a small business</strong>. Add up the revenue for Attensity, Clarabridge, Lexalytics, Temis, and all the others, and you might poke above $100 million, especially now that Attensity had a three-way merger. Then again, you might not.</li>
<li>Even so, <strong>the text analytics vendors have developed sophisticated technology.</strong> In particular, you can use it to get a pretty good idea as to what people are writing about you, individually or as groups.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/10/text-data-management-confusion/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Defining NoSQL</title>
		<link>http://www.dbms2.com/2011/10/02/defining-nosql/</link>
		<comments>http://www.dbms2.com/2011/10/02/defining-nosql/#comments</comments>
		<pubDate>Mon, 03 Oct 2011 00:32:02 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Schooner Information Technology]]></category>
		<category><![CDATA[dbShards and CodeFutures]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5394</guid>
		<description><![CDATA[A reporter tweeted:  &#8221;Is there a simple plain English definition for NoSQL?&#8221; After reminding him of my cynical yet accurate Third Law of Commercial Semantics, I gave it a serious try, and came up with the following. More precisely, I tweeted the bolded parts of what&#8217;s below; the rest is commentary added for this post. NoSQL [...]]]></description>
			<content:encoded><![CDATA[<p>A reporter tweeted:  &#8221;Is there a simple plain English definition for NoSQL?&#8221; After reminding him of my cynical yet accurate <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">Third Law of Commercial Semantics</a>, I gave it a serious try, and came up with the following. More precisely, I tweeted the bolded parts of what&#8217;s below; the rest is commentary added for this post.</p>
<p><strong>NoSQL is most easily defined by what it excludes: SQL, joins, strong analytic alternatives to those, and some forms of database integrity. If you leave all four out, and you have a strong scale-out story, you&#8217;re in the NoSQL mainstream.</strong>   <span id="more-5394"></span></p>
<ul>
<li>Thus, I&#8217;d say Cassandra, HBase, Mongo DB, and Couchbase are prime examples, in no particular order. Riak as well.</li>
<li>I might have phrased that better if I&#8217;d used a different word than simply &#8220;strong&#8221; &#8212; but hey, there was a 140-character limit, and he was on deadline.</li>
</ul>
<p><strong>Using NoSQL can make sense when at least one of two things is paramount: low-cost scale-out or dynamic schemas.</strong></p>
<ul>
<li>There are some seriously sensible use cases for <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schemas</a>.</li>
<li>&#8220;Low-cost&#8221; generally boils down to:
<ul>
<li>Performance.</li>
<li>Open source free-like-beer.</li>
<li>Not a lot of database administration.</li>
</ul>
</li>
</ul>
<p>I&#8217;ve generally given object-oriented DBMS vendors and also MarkLogic hard times whenever they consider saying they&#8217;re &#8220;NoSQL&#8221;. Reasons include:</p>
<ul>
<li>Closed source.</li>
<li>Database administration overhead (even if you get good stuff for incurring that overhead, like MarkLogic&#8217;s comprehensive indexing).</li>
</ul>
<p>Also, NoSQL started out being ACID-unfriendly.</p>
<p><strong>What you give up are the query flexibility and the easily automatic data integrity of SQL-based systems.</strong> I should have added something about a mature ecosystem.</p>
<p>In the most recent live example, I influenced a <a href="../../../../../2011/09/19/oltp-disk-solid-state/">client</a> away from Cassandra and toward scale-out MySQL (dbShards and/or Schooner flavors, most likely). Part of the reason was the ability to do joins, which are useful in their application. Another part is that their development practices obviated any significant benefit from dynamic schemas. But perhaps the most important &#8212; or at least resonant &#8212; reason of all was that they really, really cared about .NET support.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/02/defining-nosql/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Derived data, progressive enhancement, and schema evolution</title>
		<link>http://www.dbms2.com/2011/09/06/derived-data-progressive-enhancement-and-schema-evolution/</link>
		<comments>http://www.dbms2.com/2011/09/06/derived-data-progressive-enhancement-and-schema-evolution/#comments</comments>
		<pubDate>Tue, 06 Sep 2011 08:10:23 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5177</guid>
		<description><![CDATA[The emphasis I&#8217;m putting on derived data is leading to a variety of questions, especially about how to tease apart several related concepts: Derived data. Many-step processes to produce derived data. Schema evolution. Temporary data constructs. So let&#8217;s dive in.  When I started my discussion of derived data, I focused on five kinds: Aggregates, when [...]]]></description>
			<content:encoded><![CDATA[<p>The emphasis I&#8217;m putting on derived data is leading to a variety of questions, especially about how to tease apart several related concepts:</p>
<ul>
<li>Derived data.</li>
<li>Many-step processes to produce derived data.</li>
<li>Schema evolution.</li>
<li>Temporary data constructs.</li>
</ul>
<p>So let&#8217;s dive in.  <span id="more-5177"></span></p>
<p>When I started <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">my discussion of derived data</a>, I focused on five kinds:</p>
<ul>
<blockquote>
<li>Aggregates,      when they are maintained, generally for reasons of performance or response      time.</li>
<li>Calculated      scores, commonly based on data mining/predictive analytics.</li>
<li>Text      analytics.</li>
<li>The      kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce      are commonly used for.</li>
<li>Adjusted      data, especially in scientific contexts.</li>
</blockquote>
</ul>
<p>Later I added a sixth kind:</p>
<ul>
<li><a href="../2011/05/30/another-category-of-derived-data/">Derived metadata</a>, commonly for polystructured data sets (logs, text, images, video, whatever).</li>
</ul>
<p>More kinds may yet follow.</p>
<p>In all cases, I was (and am) talking about data that is actually persisted into the database. Temporary tables &#8212; for example the kind frequently created by Microstrategy &#8212; are also important in data processing, as is <a href="../../../../../2010/08/16/vertica-flash-temp-space/">temp space managed solely for the convenience of the DBMS</a>. But neither are what I mean when I talk about &#8220;derived data.&#8221;</p>
<p>As I noted back in June, <a href="../../../../../2011/06/19/investigative-analytics-derived-data/">derived data naturally leads to schema evolution</a>. You load data into an analytic database. You do some analysis and get some interesting results &#8212; interesting enough for you to want to keep them persistently. So you extend the schema to include them. You do more research; you discover something else interesting; you extend the schema again. Repeat as needed.</p>
<p>However, in no way is derived data the only source of analytic schema evolution. Duh. Sometimes you just have new kinds of information coming in. Of course, once it&#8217;s there, you may want to derive something from it. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  In <a href="../../../../../2010/06/08/profile-of-revealed-preferences/">marketing</a> contexts, both parts of that might be true in spades.</p>
<p>When I mentioned all this to my clients at MarkLogic &#8212; which was my inspiration for the polystructured/metadata example &#8212; they perked up and said &#8220;Oh! Progressive enhancement.&#8221; Indeed, it&#8217;s long been the case that a simple text processing pipeline could have &gt;15 steps of extraction; indeed, <a href="http://www.texttechnologies.com/2005/10/19/linkage-among-different-text-technologies/">I learned about the &#8220;tokenization chain&#8221; in 1997</a>. If all the &#8220;progression&#8221; in  the data enhancement occurs in a single processing run, that wouldn&#8217;t necessarily spawn much schema evolution. On the other hand, if you think of additional steps to add every now and then &#8212; in that case your schema might indeed evolve over time.</p>
<p>Somewhat similarly, <a href="../../../../../2009/10/10/enterprises-using-hadoo/">Hadoop can be used to run &#8220;aggregation pipelines&#8221; of many 10s of steps</a>. The output of the whole thing might be a relatively small number of fields. But again, if the number or nature of the fields changes over time, schemas will need to evolve accordingly.</p>
<p>So to sum up:</p>
<ul>
<li>Derived data &#8212; of multiple kinds &#8212; is very important.</li>
<li>If you want to increase the value you get from derived data, you might need to evolve your schema accordingly.</li>
<li>Data derivation happens to sometimes have long processing pipelines; those pipelines might happen to offer clues as how to do yet better at data derivation in the future; those improvements might happen to lead to schema evolution over time.</li>
</ul>
<p>&#8220;Just the raw facts&#8221; analytic databases are, for the most part, obsolete.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/06/derived-data-progressive-enhancement-and-schema-evolution/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Another category of derived data</title>
		<link>http://www.dbms2.com/2011/05/30/another-category-of-derived-data/</link>
		<comments>http://www.dbms2.com/2011/05/30/another-category-of-derived-data/#comments</comments>
		<pubDate>Tue, 31 May 2011 03:53:10 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MarkLogic]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4580</guid>
		<description><![CDATA[Six months ago, I argued the importance of derived analytic data, saying &#8230; there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted data for analytic data processing. The five areas I have in mind are, loosely speaking: Aggregates, when they are maintained, generally for reasons of performance or response time. Calculated scores, commonly based on data mining/predictive [...]]]></description>
			<content:encoded><![CDATA[<p>Six months ago, I argued the importance of <a href="http://www.dbms2.com/2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">derived analytic data</a>, saying</p>
<blockquote><p>&#8230; there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted data for analytic data processing. The five areas I have in mind are, loosely speaking:</p>
<ul>
<li>Aggregates, when they are      maintained, generally for reasons of performance or response time.</li>
<li>Calculated scores, commonly based      on data mining/predictive analytics.</li>
<li>Text analytics.</li>
<li>The kinds of ETL      (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly      used for.</li>
<li>Adjusted data, especially in      scientific contexts.</li>
</ul>
<p>Probably there are yet more examples that I am at the moment overlooking.</p></blockquote>
<p>Well, I did overlook at least one category. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>A surprisingly important kind of derived data is metadata, especially for large, <a href="http://www.dbms2.com/2011/05/17/poly-structured-database/">poly-structured</a> data sets. For example, CERN has vastly quantities of experiment sensor data, stored as files; just the metadata alone fills <a href="../../../../../2009/10/03/issues-in-scientific-data-management/">over 10 terabytes in an Oracle database</a>. <a href="../../../../../2010/11/29/marklogic-and-its-document-dbms/">MarkLogic</a> is big on storing derived metadata, both on the publishing/media and intelligence sides of the business.</p>
<p><span id="more-4580"></span>Actually, what made me think of writing this post was a few conversations at MarkLogic&#8217;s April  user conference. For example, MarkLogic likes to break lunch up into subject-specific tables, hosted either by a partner company, or by one of the analysts who is attending anyway. So they asked me to hold a table about having Hadoop and MarkLogic work together. When I showed up, I discovered that most of the users at the table worked for a single organization; what&#8217;s more, they were skeptical about the table&#8217;s discussion subject, and wanted to be see if I could persuade them otherwise. I gently pointed out that I hadn&#8217;t actually picked the subject, and asked them what their use cases might be like. Those turned out to be classified &#8230;</p>
<p>&#8230; but have no fear! Your hero thought quickly, and soon was holding forth about various ways one might combine the two technologies for various intelligence tasks. The one that finally struck a chord was &#8212; you guessed it! &#8212; metadata management. It seems they had colleagues with a lot of machine-generated data maintained in Hadoop and, upon reflection, they thought MarkLogic might be a good way to manage the metadata for same.</p>
<p>So should metadata management be handled relationally? Looking at my <a href="../../../../../2011/05/29/when-to-use-relational-database-management-system/">first three tests for when going relational is a slam-dunk choice</a>:</p>
<ul>
<li>I don&#8217;t think the application suites exploiting derived metadata are complex enough to support a strong pro-relational bias.</li>
<li>I don&#8217;t think the benefits of normalization are intense enough to mandate relational storage. (Also, since provenance matters, some of the traditional benefits of normalization are obviated &#8212; you may actually want out-of-date information in some cases.)</li>
<li>There certainly are some cases where you can set up a fixed schema, have one row of metadata per object, and be happy. In those cases, a relational database likely suffices, and is probably the right choice, but &#8230;</li>
</ul>
<p>&#8230; I&#8217;m not sure how numerous the cases are where a simple, fixed database design isn&#8217;t a good fit. Thoughts?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/05/30/another-category-of-derived-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>What to do about &#8220;unstructured data&#8221;</title>
		<link>http://www.dbms2.com/2011/05/15/what-to-do-about-unstructured-data/</link>
		<comments>http://www.dbms2.com/2011/05/15/what-to-do-about-unstructured-data/#comments</comments>
		<pubDate>Sun, 15 May 2011 21:54:30 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4449</guid>
		<description><![CDATA[We hear much these days about unstructured or semi-structured (as opposed to) structured data. Those are misnomers, however, for at least two reasons. First, it&#8217;s not really the data that people think is un-, semi-, or fully structured; it&#8217;s databases.* Relational databases are highly structured, but the data within them is unstructured &#8212; just lists [...]]]></description>
			<content:encoded><![CDATA[<p>We hear much these days about <em>unstructured</em> or <em>semi-structured</em> (as opposed to) <em>structured data.</em> Those are misnomers, however, for at least two reasons. First<strong>, it&#8217;s not really the data that people think is un-, semi-, or fully structured; it&#8217;s databases.</strong>* Relational databases are highly structured, but the data within them is unstructured &#8212; just lists of numbers or character strings, whose only significance derives from the structure that the database imposes.</p>
<p><em>*Here I&#8217;m using the term &#8220;database&#8221; literally, rather than as a concise  synonym for &#8220;database management system&#8221;. But see below.<br />
</em></p>
<p>Second, a more accurate distinction is<strong> not whether a database has one structure or none </strong>&#8211; it&#8217;s<strong> whether a database has one structure or many.</strong> The easiest way to see this is for databases that have clearly-defined schemas. A relational database has one schema (even if it is just the union of various unrelated sub-schemas); an XML database, however, can have as many schemas as it contains documents.</p>
<p>One small terminological problem is easily handled, namely that people don&#8217;t talk about true databases very often, at least when they&#8217;re discussing generalities; rather, they talk about data and DBMS.* So let&#8217;s talk of DBMS being &#8220;structured&#8221; singly or multiply or whatever, just as the databases they&#8217;re designed to manage are.</p>
<p><em>*And they refer to the DBMS as &#8220;databases,&#8221; because they don&#8217;t have much other use for the word. </em></p>
<p>All that said &#8212; I think that <strong>single vs. multiple database structures isn&#8217;t a bright-line binary distinction</strong>; rather, it&#8217;s a <strong>spectrum.</strong> For example:  <span id="more-4449"></span></p>
<ul>
<li>IMS is the most structured DBMS of all. The data in an IMS database is in a hierarchy, and that&#8217;s that.</li>
<li>CODASYL and other kinds of what used to be called <em>network</em> DBMS (before the word got so overloaded) &#8212; e.g. RDB, IDMS, or TOTAL &#8212; are/were almost as structured as IMS.</li>
<li>Relational databases were invented because their structure was more flexible than that of linked-list databases. The whole point of relational DBMS is that you can view the data in a multitude of ways. Still, I see classical relational databases as being toward the single-structure end of the spectrum. (I say &#8220;classical&#8221; because Oracle and DB2 actually can manage combinations of XML, text, and traditional relational tables, if you choose.)</li>
<li>A multivalue DBMS is a little more multi-structured than a relational one, because of how a field can be filled one or multiple times.</li>
<li><a href="../../../../../2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay Singularity</a> (as implemented on Teradata gear) has, in essence, two structures (that I know of). One structure is just the relational schema. The other is the structure you would get if each kind of name-value pair truly had its own column.</li>
<li>A Splunk collection of log data can reasonably be said to have a different structure for every type or source of log. It further can be said to have multiple structures in somewhat the same way that eBay Singularity does.</li>
<li>So-called <a href="../../../../../2011/02/07/notes-on-document-oriented-nosql/">document stores</a> can be very multi-structured. MongoDB, Couchbase, et al. let you have a different structure for every document, if you choose. The same goes for XML-based MarkLogic.</li>
<li>HBase and Cassandra are also very multi-structured. Theoretically, each record gets to decide which column sets it does or doesn&#8217;t fit into.</li>
</ul>
<p>As a general rule &#8212; the more structures a database can have at once, the easier it is to change those structures, even on the fly (e.g., by inserting yet another bit of self-describing data). Thus, I sometimes use the term <strong>polystructured </strong>instead of<strong> multi-structured </strong>or <strong>multistructured.</strong> Thoughts as to which term I should choose going forward would be much appreciated.</p>
<p>As for an actual definition &#8212; well, here&#8217;s something I drafted 3 1/2 years ago but never published:</p>
<blockquote><p>These problems with the relational paradigm are big enough to be worth coining a word for – polystructured. Polystructured data is data with structure that:</p>
<ul>
<li>Can be exploited to provide most of the benefits of a highly structured database (e.g., a tabular/relational one) &#8230;</li>
<li>&#8230; but cannot be described in the concise, consistent form such highly structured systems require.</li>
</ul>
<p>Specifically, we’ll call a database “polystructured” if it is characterized by at least two of the following:</p>
<ol>
<li>Data suitable for being queried by      simple predicate-based matching (e.g., equality to certain values, falling      with in ranges, etc.)</li>
<li>(Other) data suitable for being queried      by more complex matching (e.g., text search relevancy rankings)</li>
<li>Subsets that are more neatly structured      than the whole.</li>
</ol>
<p>Equivalently, we’ll just say that <strong>polystructured data is data that has considerable structure, but whose structure is in some important way unpredictable.</strong></p></blockquote>
<p>NoSQL document or &#8220;column&#8221; stores would satisfy #1 and #3, as would Splunk. MarkLogic would satisfy all three criteria. #1 + #2 is sort of like what happens when text queries are allowed to go against (groups of) relational columns &#8230; and the vagueness with which I&#8217;m saying that makes me suspect that at least the unbolded/first definition doesn&#8217;t really fly.</p>
<p>Finally, here&#8217;s what led up to those definitions (the whole thing is from the introduction to a never-completed white paper). Please forgive any  anachronisms in it. A number of the points in it have also been addressed in posts  here; e.g.,</p>
<ul>
<li>In December, 2005 I expounded on <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">the  mismatch between text data and the relational model</a>.</li>
<li>In June, 2010 I elucidated <a href="http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/">the  variety of data that could go into an individual&#8217;s marketing-oriented  profile</a>.</li>
<li>In February, 2008 I predicted that <a href="../2008/02/15/non-relational-database-management/">flexible-schema   DBMS would gain share</a>.</li>
</ul>
<blockquote><p><strong>The case for polystructured data</strong></p>
<p>Traditional computer databases amount to sets of records.   There usually are a limited number of record formats, which each instance of a particular format containing parallel kinds of information.  Business transactions, web page visits, instrument readings&#8211; whatever the nature of the information, application designers stick it into the simplest structure they think makes sense.</p>
<p>These records are arranged into a variety of data structures.</p>
<ul>
<li>Log files are widely used, especially to track web site visits, in other networking uses, and for other kinds of instrument readings.</li>
<li>Computer user administration is commonly in LDAP (Lightweight Directory Access Protocol) format.</li>
<li>There are still a lot of installations of legacy “linked-list” DBMS (DataBase Management Systems) such as IBM&#8217;s IMS.</li>
<li>Some decision support applications use data in multidimensional arrays.</li>
</ul>
<p>Even so, most new business applications are written over relational DBMS, in the well-known rows-and-tables paradigm.</p>
<p>There are good reasons for the dominance of the relational model and of rows and tables.  (Strictly speaking, “relational” equates neither to “rows and tables” nor to “SQL”, but in practice the three concepts are closely linked.) In particular:</p>
<ul>
<li>Data integrity is (fairly) easy to ensure.</li>
<li>From some standpoints, relational databases are flexible; you can construct almost any kind of query, without having to do any kind of database reorganization (except perhaps for performance).</li>
<li>SQL programmers are easy to find.</li>
<li>There&#8217;s simply been much more engineering effort invested in making good relational DBMS than in any other kind.</li>
</ul>
<p>But the relational database paradigm also has some major drawbacks.  Three of the big ones are:</p>
<ul>
<li>Queries must have strictly match/fail answers; there&#8217;s no natural way for a relational DBMS to handle “somewhat relevant” hits.</li>
<li>Relational databases can get cumbersome when large fractions of the potential data happen to be missing. (Hence the decades-long debates about the problems with NULL values.)</li>
<li>While you have good flexibility in querying against any particular data structure, you do have to predefine your structure before you start accepting input.</li>
</ul>
<p>The last point is why you wind up with all those NULL values in the first place; if a kind of information can be in any record in a set, the database is set up to assume that its present in all of them.  Or if you normalize your database so highly as to avert missing values, then you wind up with a huge number of tables, making queries (and updates) complicated from both the programmer&#8217;s and the machine&#8217;s standpoint.</p>
<p>Text apps suffer from RDBMS&#8217; inelegant handing of relevancy. What&#8217;s more, documents can have almost unlimited internal structures, in three senses:</p>
<ol>
<li>They can have chapters, sections, subsections,      sidebars, footnotes, and so on, in any combination.</li>
<li>Semantic references can link words, phrases,      sentences, and paragraphs in a near-infinite number of ways.</li>
<li>Documents can explicitly contain fielded data, such      as numbers, addresses, dates, or geo-encodings.</li>
</ol>
<p>Another group of apps that suffer from RDBMS&#8217; limitations are in the area of personalization and similar fine-grained marketing analysis. Analysis of web clicks throws away most kinds of path information.  Analysis of written or verbal communication isn&#8217;t well-integrated with that of fielded data.  Different customers and prospects give different kinds of contact information, and are “touched” by different marketing initiatives; current systems do a poor job of integrating all that scattered information.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/05/15/what-to-do-about-unstructured-data/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>Whither MarkLogic?</title>
		<link>http://www.dbms2.com/2011/04/05/whither-marklogic/</link>
		<comments>http://www.dbms2.com/2011/04/05/whither-marklogic/#comments</comments>
		<pubDate>Wed, 06 Apr 2011 02:06:51 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Structured documents]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4168</guid>
		<description><![CDATA[My clients at MarkLogic have a new CEO, Ken Bado, even though former CEO Dave Kellogg was quite successful. If you cut through all the happy talk and side issues, the reason for the change is surely that the board wants to see MarkLogic grow faster, and specifically to move beyond its traditional niches of [...]]]></description>
			<content:encoded><![CDATA[<p>My clients at MarkLogic have a new CEO, Ken Bado, even though former CEO Dave Kellogg was quite successful. If you cut through all the happy talk and side issues, the reason for the change is surely that the board wants to see MarkLogic grow faster, and specifically to move beyond its traditional niches of publishing (especially technical publishing) and national intelligence.</p>
<p>So what other markets could MarkLogic pursue? Before Ken even started work, I sent over some thoughts. They included (but were not limited to):  <span id="more-4168"></span></p>
<ul>
<li>Everybody now knows that not all problems require a relational DBMS.  The NoSQL movement has seen to that.</li>
<li>Not everybody agrees that &#8220;heavyweight&#8221; enterprise DBMS are  needed for everything. The NoSQL movement has seen to that too.</li>
<li><a href="http://www.dbms2.com/2011/02/07/notes-on-document-oriented-nosql/">The &#8220;document&#8221;/&#8221;object&#8221; DBMS distinction has long been blurry</a>. XML is  full of documents, but they&#8217;re really objects. The same goes for the  JSON/quasi-JSON objects of CouchDB/Couchbase and MongoDB.  Object-oriented DBMS vendors have dabbled in XML on and off over the  years because of technical similarity. Etc.</li>
<li>MarkLogic has always focused on markets  where the database truly was about documents in the conventional sense &#8212; especially long text documents &#8212;  aka &#8220;content&#8221;. I always thought that focus was over-narrow.</li>
<li>There are various cases where law, regulation, compliance etc.  mandate the production of long text documents. I&#8217;m not sure MarkLogic has  penetrated those as well as it could have.</li>
<li>Graph DBMS  technology is going nowhere fast, largely because nobody has solved the  data distribution problem in cases big enough to need scale-out, and the  technology isn&#8217;t obviously needed in single-server cases. (But see my post on <a href="http://www.dbms2.com/2010/06/19/objectivity-infinite-graph/">Objectivity&#8217;s Infinite Graph</a>.) Even so,  graph-oriented apps are exploding, and MarkLogic should think about playing in the graph area, even if by acquisition.</li>
<li> I think what I  described in <a href="../../../../../2010/06/08/profile-of-revealed-preferences/" target="_blank">http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/</a> is non-relational and a very big market. Playing there with a  &#8220;heavyweight&#8221; DBMS is of course a challenge.</li>
<li>Coming over from Autodesk, Ken Bado hopefully knows more about the product  data management business than I do.</li>
</ul>
<p>It will be interesting to see what MarkLogic actually does.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/04/05/whither-marklogic/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

