<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Text</title>
	<atom:link href="http://www.dbms2.com/category/datatype/text-search/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Tue, 07 Feb 2012 06:49:30 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Sumo Logic and UIs for text-oriented data</title>
		<link>http://www.dbms2.com/2012/02/06/sumo-logic-and-uis-for-text-oriented-data/</link>
		<comments>http://www.dbms2.com/2012/02/06/sumo-logic-and-uis-for-text-oriented-data/#comments</comments>
		<pubDate>Mon, 06 Feb 2012 13:27:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5897</guid>
		<description><![CDATA[I talked with the Sumo Logic folks for an hour Thursday. Highlights included: Sumo Logic does SaaS (Software as a Service) log management. Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as &#8220;Splunk-like&#8221;. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to [...]]]></description>
			<content:encoded><![CDATA[<p>I talked with the Sumo Logic folks for an hour Thursday. Highlights included:</p>
<ul>
<li>Sumo Logic does SaaS (Software as a Service) log management.</li>
<li>Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as &#8220;Splunk-like&#8221;. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to <a href="../../../../../2012/01/10/splunk-update/">branch out</a>.)</li>
<li>Sumo Logic has hacked Lucene for faster indexing, and says 10-30 second latencies are typical.</li>
<li>Sumo Logic&#8217;s main differentiation is <strong>automated classification of events. </strong></li>
<li>There&#8217;s some kind of streaming engine in the mix, to update counters and drive alerts.</li>
<li>Sumo Logic has around 30 &#8220;customers,&#8221; free (mainly) or paying (around 5) as the case may be.</li>
<li>A truly typical Sumo Logic customer has single to low double digits of gigabytes of log data per day. However, Sumo Logic seems highly confident in its ability to handle a terabyte per customer per day, give or take a factor of 2.</li>
<li>When I asked about the implications of shipping that much data to a remote data center, Sumo Logic observed that log data compresses really well.</li>
<li>Sumo Logic recently raised a bunch of venture capital.</li>
<li>Sumo Logic&#8217;s founders are out of ArcSight, a log management company HP paid a bunch of money for.</li>
<li>Sumo Logic coined a marketing term &#8220;LogReduce&#8221;, but it has nothing to do with &#8220;MapReduce&#8221;. Sumo Logic seems to find this amusing.</li>
</ul>
<p>What interests me about Sumo Logic is that automated classification story. I thought I heard Sumo Logic say:<span id="more-5897"></span></p>
<ul>
<li>It&#8217;s largely unsupervised machine learning.</li>
<li>It&#8217;s specific to a particular user/data set.</li>
<li>It can be up and running and classifying things effectively almost instantly (i.e., on seconds&#8217; or minutes&#8217; worth of data).</li>
<li>It&#8217;s informed by what different users tag as false positives. (Or maybe that is planned for future versions.)</li>
</ul>
<p><em>I have a little trouble seeing how all those points fit exactly together, so perhaps I got some details wrong.</em></p>
<p>The payoff is that <strong>machine learning directly informs the Sumo Logic user interface</strong>. In particular, large numbers of events are bundled into a small number of categories, hopefully making it much easier for network operations types to scan the UI and pick out what&#8217;s important.</p>
<p>In general, the idea of machine-learning informing analytic UIs via some sort of classification is common in text-oriented technologies, notably in:</p>
<ul>
<li>Good ol&#8217; text search.</li>
<li>Text mining vendors&#8217; approaches to clustering hits on words or phrases that say substantially the same thing.</li>
</ul>
<p>But otherwise it seems kind of rare, if we stipulate that ad-serving/general internet personalization isn&#8217;t really an analytic UI &#8212; but I&#8217;d love to hear of any interesting examples I&#8217;ve overlooked.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/02/06/sumo-logic-and-uis-for-text-oriented-data/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Lessons from T-Mobile&#8217;s epic fail</title>
		<link>http://www.dbms2.com/2011/11/04/lessons-from-t-mobiles-epic-fail/</link>
		<comments>http://www.dbms2.com/2011/11/04/lessons-from-t-mobiles-epic-fail/#comments</comments>
		<pubDate>Fri, 04 Nov 2011 06:01:25 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5593</guid>
		<description><![CDATA[When my electric power came back on but my Verizon FiOS internet connection didn&#8217;t, it was time for a mobile hotspot/prepaid wireless internet service. T-Mobile&#8217;s 4G Mobile Hotspot/Prepaid Mobile Broadband offering seemed like a good choice. But the experience of setting it up was a nightmare, and a possible instructive nightmare at that. T-Mobile&#8217;s instructions [...]]]></description>
			<content:encoded><![CDATA[<p>When my electric power came back on but my Verizon FiOS internet connection didn&#8217;t, it was time for a mobile hotspot/prepaid wireless internet service. T-Mobile&#8217;s 4G Mobile Hotspot/Prepaid Mobile Broadband offering seemed like a good choice. But the experience of setting it up was a nightmare, and a possible instructive nightmare at that.</p>
<p>T-Mobile&#8217;s instructions tell you that you need to know the factory defaults for network name and password. That makes sense. They don&#8217;t also tell you that you need to know your SIM card number (included), IMEI number (included), or authorization number (not included).</p>
<p>That&#8217;s right &#8212; you need a number that T-Mobile doesn&#8217;t tell you you need. But the story gets a lot worse from there, because it&#8217;s almost impossible to get the number from them. I eventually talked with approximately 8 T-Mobile call center associates over the course of the evening before getting successfully connected.</p>
<p><span id="more-5593"></span><em>One of the few redeeming features in this story is that T-Mobile call center folks pick up the phone quickly. One of the many non-redeeming ones is that they efficiently give you inaccurate information after they do. The one who finally got me the right answer was a young woman who put me on hold to call an internal resource approximately four times before finally handling the situation correctly.</em></p>
<p>At one point I found somewhat helpful information by searching on T-Mobile&#8217;s website for <em>mobile hotspot activation.</em> However, the same information did not surface on my earlier searches on strings like <em>activate mobile hotspot.</em> Stemming has been a basic feature of search engines since the mid 1990s, but evidently T-Mobile&#8217;s technological choices aren&#8217;t as current as that. Other inexcusable T-Mobile mistakes include:</p>
<ul>
<li>Not providing information about the need for an &#8220;activation number&#8221; in the product&#8217;s paper documentation.</li>
<li>Taking the buyer to a sign-on screen that doesn&#8217;t lead to the call center reps responsible for the product being signed-on for.</li>
<li>Not providing call center operators with the tools they need to get callers to the right place.</li>
</ul>
<p>Perhaps Elbonian* contractors were involved.</p>
<p><em>Elbonia is a fictitious country of outsourcers in Scott Adams&#8217; </em>Dilbert<em> comic strip. <a href="http://dilbert.com/strips/comic/2006-03-25/">Elbonian work is not noted for its high quality</a>. </em></p>
<p>In case you haven&#8217;t guessed yet, the missing T-Mobile  &#8220;activation number&#8221; was tantamount to a telephone number, complete with area code. <strong>T-Mobile was insisting on assigning a telephone number for a service that had nothing to do with making or receiving telephone calls.</strong> While I can believe there was some legitimate database/application design reason for having such inflexibility under the covers, it&#8217;s hard to see why T-Mobile didn&#8217;t get a composite application tool and hack a front-end that automatically generates the number without call-center intervention.</p>
<p>I did eventually get connected, and in my limited experience with T-Mobile’s Prepaid/Mobile Hotspot 4G “Broadband” offering, I get the impression it has good speed but conventionally flaky WiFi reliability. It may well be a T-Mobile service that is worthy of great success. <em>Edit: That&#8217;s not true.*</em> But it&#8217;s not going to experience such success as long as T-Mobile idiotically infuriates its users at the relationship start.</p>
<p><em>*Edit: It turns out that the T-Mobile Mobile Hotspot device has terrible range. We can use it to get online in my office, Linda&#8217;s office, or the living room/dining room area, but no 2 of the 3 at once.</em></p>
<p>My takeaways from this story include:</p>
<ul>
<li>Use competent documentation writers.</li>
<li>Run usability testing on your entire customer-experience processes.</li>
<li>Test your site search engine for usefulness.</li>
</ul>
<p>Beyond that, there&#8217;s not a single part of this story for which there isn&#8217;t a straightforward fix, most of them alluded to above.</p>
<p>If you see anything of your organization in this story, it&#8217;s probably time to raise your standards. This is obviously a different kind of failure as, say, the one last year at <a href="http://www.dbms2.com/category/users/jpmorgan-chase/">Chase</a>. Even so, &#8220;as awful as T-Mobile&#8221; would be a sad state to endure.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/04/lessons-from-t-mobiles-epic-fail/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MarkLogic 5, and why you might care</title>
		<link>http://www.dbms2.com/2011/11/01/marklogic-version-5/</link>
		<comments>http://www.dbms2.com/2011/11/01/marklogic-version-5/#comments</comments>
		<pubDate>Tue, 01 Nov 2011 04:03:59 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5560</guid>
		<description><![CDATA[MarkLogic is releasing MarkLogic 5. Key elements of the announcement are: More-of-the-same in line with MarkLogic’s core positioning. A new bi-directional Hadoop connector. A free MarkLogic Express edition, limited in license terms more than in actual features, as per Slide 27 of the deck MarkLogic graciously supplied for me to post. Also, MarkLogic is early [...]]]></description>
			<content:encoded><![CDATA[<p>MarkLogic is releasing MarkLogic 5. Key elements of the announcement are:</p>
<ul>
<li>More-of-the-same      in line with MarkLogic’s core positioning.</li>
<li>A new      bi-directional Hadoop connector.</li>
<li>A free      MarkLogic Express edition, limited in license terms more than in actual      features, as per Slide 27 of <a href="http://www.monash.com/uploads/MarkLogic-5-Deck.pptx">the deck      MarkLogic graciously supplied for me to post</a>.</li>
</ul>
<p>Also, MarkLogic is early with a feature that most serious DBMS vendors will  soon have – support for tiered storage, with writes going first to  solid-state storage, then being flushed to disk via a caching-style  algorithm.* And as befits a sometime search-engine-substitute, MarkLogic has finally licensed a large set of document filters, from an Australian company called <a href="http://www.isys-search.com/index.html">Isys</a>. Apparently, the special virtue of the Isys filters is that they’re good at extracting not only text, but metadata as well.</p>
<p><em>*If there’s a caching algorithm that doesn’t contain a major element of LRU (Least Recently Used), I don’t recall ever hearing about it.</em></p>
<p>MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:</p>
<ul>
<li>MarkLogic      is a serious, enterprise-class DBMS (see for example Slide 12 of <a href="http://www.monash.com/uploads/MarkLogic-5-Deck.pptx">the MarkLogic      deck</a>) …</li>
<li>…      which has been optimized from the getgo for <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured      data</a>.</li>
<li>MarkLogic      can and does scale out to handle large amounts of data.</li>
<li>MarkLogic      is a general-purpose DBMS, suitable for <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">both      short-request and analytic tasks</a>.</li>
<li>MarkLogic      is particularly well suited for analyses with long chains of “progressive      enhancement” (MarkLogic’s favorite term when talking about <a href="../../../../../2011/05/30/another-category-of-derived-data/">derived      data</a>).</li>
<li><a href="http://blogs.avalonconsult.com/blog/search/is-marklogic-a-search-engine/">MarkLogic      often plays the role of a content assembler and/or search engine</a>, and      the people who use MarkLogic in those ways are commonly doing things that can      be described as research and analysis.</li>
</ul>
<p>Based on that reality, MarkLogic talks a lot about Volume, Velocity, Variety, Big Data, unstructured data, semi-structured data, and big data analytics.</p>
<p><span id="more-5560"></span><em>My <a href="../../../../../2010/11/29/marklogic-and-its-document-dbms/">November, 2010 overview of MarkLogic technology</a> remains pretty relevant. One correction, however: Node heterogeneity configurations, in which “data” and “evaluation” nodes reside on separate servers, are the exception rather than the rule.</em></p>
<p>Like <a href="../../../../../2011/10/18/vertica-community-edition/">Vertica</a>, MarkLogic has laudably said that true academic researchers can get MarkLogic for free without the severe license restrictions. Free MarkLogic should be of particular interest to researchers who:</p>
<ul>
<li>Are      studying natural networks or graphs, such as social networks or biological      pathways. (This might be a fit in the social or biological sciences.)</li>
<li>Are      managing metadata for, say, a variety of disparate kinds of experimental      files. (This might be a fit anywhere in the natural sciences.)</li>
<li>Are      managing actual documents, images, videos, etc., or data about such      things. (This might be a fit in the humanities or social sciences.)</li>
</ul>
<p>MarkLogic provided some disclosable financial substance by email, which I shall quote verbatim:</p>
<ul>
<li><em>MarkLogic      has 45% revenue growth and 55-60% license growth year over year.</em></li>
<li><em>We      expect to finish this year with over $85 million in revenue, up from $55      million last year.</em></li>
</ul>
<p>Arithmetical purists might note that 85/55 is more than 145%, but I’m just going to settle for the information I got and move on.</p>
<p><em>Edit: I posted separately about the <a href="http://www.dbms2.com/2011/11/03/marklogic-hadoop-connector/">MarkLogic Hadoop connector.</a></em> <span style="text-decoration: line-through;">As for that Hadoop connector – stay tuned for a short follow-up post, as writing about it now would not be convenient. (My backup discipline isn’t what it should be, and the only copy of my notes about that product is on a heavy tower computer in a house that doesn’t have working power.)</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/01/marklogic-version-5/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Text data management, Part 3: Analytic and progressively enhanced</title>
		<link>http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/</link>
		<comments>http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 01:59:17 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5420</guid>
		<description><![CDATA[This is Part 3 of a three post series. The posts cover: Confusion about text data management. Choices for text data management (general and short-request). Choices for text data management (analytic). I&#8217;ve gone on for two long posts about text data management already, but even so I&#8217;ve glossed over a major point: Using text data [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is Part 3 of a three post series. The posts cover:</p>
<ol>
<li><a href="../2011/10/10/text-data-management-confusion/">Confusion about text data management</a>.</li>
<li><a href="../2011/10/10/text-data-management-part-2-general-and-short-request/">Choices for text data management (general and short-request)</a>.</li>
<li><a href="../2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/">Choices for text data management (analytic)</a>.</li>
</ol>
<p></em></p>
<p>I&#8217;ve gone on for two long posts about text data management already, but even so I&#8217;ve glossed over a major point:</p>
<p><strong>Using text data commonly involves a long series of data enhancement steps.</strong></p>
<p>Even before you do what we&#8217;d normally think of as &#8220;analysis&#8221;, text markup can include steps such as:</p>
<ul>
<li>Figure out where the words break.</li>
<li>Figure out where the clauses and sentences break.</li>
<li>Figure out where the paragraphs, sections, and chapters break.</li>
<li>(Where necessary) map the words to similar ones &#8212; spelling correction, stemming, etc.</li>
<li>Figure out which words are grammatically which parts of speech.</li>
<li>Figure out which pronouns and so on refer to which other words. (Technical term: Anaphora resolution.)</li>
<li>Figure out what was being said, one clause at a time.</li>
<li>Figure out the emotion &#8212; or &#8220;sentiment&#8221; &#8212; associated with it.</li>
</ul>
<p>Those processes can add up to dozens of steps. And maybe, six months down the road, you&#8217;ll think of more steps yet.</p>
<p><span id="more-5420"></span>So when you manage text, it is convenient to assume <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schemas</a>. That would be an argument for using MarkLogic, NoSQL document stores, and/or Hadoop, rather than strictly relational systems.</p>
<p>That said, text analytics can be done perfectly well in relational databases. Again, I point you to the example of <a href="../../../../../2011/04/14/attensity-update/">Attensity</a>, which will extract for you a large fraction of the information that can be gotten out of the text, put it into a convenient relational schema, and let you get to work. Once the principal extraction has been done, there&#8217;s no reason why your <a href="../../../../../2011/09/06/derived-data-progressive-enhancement-and-schema-evolution/">derived data</a> issues need be any more complex than others you deal with relationally, especially on the analytic side of the house.</p>
<p>But what if you want to do your own text enhancement, rather than using a third party tool? The first thing to ask yourself is &#8212; why? With all due respect to the 10-20 internet-centric companies that are having fun reinventing large portions of the data processing wheel &#8212; if you&#8217;re not at one of those companies, you should probably be trying to use as much third-party software as you possibly can.</p>
<p>I can think of a couple of cases where rolling your own technology make sense, namely:</p>
<ul>
<li>The hard part of what you&#8217;re doing is extracting snippets of text from some data format proprietary to you.</li>
<li>You&#8217;re trying to do very simple things across a variety of languages much broader than the 10-20 that the text analytics vendors currently do a halfway decent job of handling.</li>
</ul>
<p>I can&#8217;t think of many others.</p>
<p>One thing I&#8217;d definitely be wary of is using Hadoop as a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> for individual documents in a variety of formats. I don&#8217;t know what you&#8217;d do with them once they&#8217;re there. Yes, Google invented MapReduce in part to do things like document indexing &#8212; but you&#8217;d probably prefer not to reinvent the Google stack. That&#8217;s quite apart from questions as to whether your document count exceeds Hadoop&#8217;s comfortable <a href="../../../../../2011/08/21/hadoop-evolution/">file-count limit</a>. Solr is a different matter; but while Solr and Hadoop are both open source projects that can be traced back to Doug Cutting, otherwise they&#8217;re rather different things.</p>
<p>A useful way of looking at your choices may be to ask:</p>
<p><strong>After text has run through the main pipeline of manipulation and information extraction:</strong></p>
<ul>
<li><strong>What will the output look like?</strong></li>
<li><strong>Where do I want that output to end up?</strong></li>
</ul>
<p>If the output has to be something that fits into a structured/relational analytic system, then it should probably go into a relational DBMS. If you&#8217;re going to do social network analysis of the sort you&#8217;d ideally like to do in a graph database &#8212; well, unless you&#8217;re an intelligence agency with blank-check resources, you&#8217;ll probably still end up opting for a relational DBMS. If the output consists of simple, homogeneous text files, plus a few fields of metadata, and you&#8217;re not going to do much analysis of it, it can pretty much go anywhere; either SQL or NoSQL might suit your purposes. If you want maximum power and flexibility, MarkLogic may be the ideal destination.</p>
<p>From there, the next question is:</p>
<ul>
<li><strong>What pipeline should the text run through to get to its final destination?</strong></li>
</ul>
<p>Often, as I&#8217;ve argued, the right answer is a third-party text analytic system. Those can generally consume text in almost any kind of file format. Other times &#8212; less often than you may think &#8212; it&#8217;s Hadoop. OK, then pass it through Hadoop. Other possibilities could come up as well (text search engines aren&#8217;t really as usually as I may have seemed to be suggesting).</p>
<p>Anyhow, when you&#8217;ve established where text starts out (that&#8217;s usually a given), what it passes through (please see above), and where its best parts need to end up (ditto), you&#8217;ve done the hardest parts. Figuring out the rest of your text management architecture should be relatively easy by comparison.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Text data management, Part 2: General and short-request</title>
		<link>http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/</link>
		<comments>http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 01:58:49 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5419</guid>
		<description><![CDATA[This is Part 2 of a three post series. The posts cover: Confusion about text data management. Choices for text data management (general and short-request). Choices for text data management (analytic). I&#8217;ve recently given widely varied advice about managing text (and similar files &#8212; images and so on), ranging from Sure, just keep going with [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is Part 2 of a three post series. The posts cover:</em></p>
<ol>
<li> <em><a href="../2011/10/10/text-data-management-confusion/">Confusion about text data management</a>.</em></li>
<li><em><a href="../2011/10/10/text-data-management-part-2-general-and-short-request/">Choices for text data management (general and short-request)</a>.</em></li>
<li><em><a href="../2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/">Choices for text data management (analytic)</a>.</em></li>
</ol>
<p>I&#8217;ve recently given widely varied advice about managing text (and similar files &#8212; images and so on), ranging from</p>
<blockquote><p>Sure, just keep going with your old strategy of keeping .PDFs in the file system and pointing to them from the relational database. That&#8217;s an easy performance optimization vs. having the RDBMS manage them as BLOBs.</p></blockquote>
<p>to</p>
<blockquote><p>I suspect MongoDB isn&#8217;t heavyweight enough for your document management needs, let alone just dumping everything into Hadoop. Why don&#8217;t you take a look at MarkLogic?</p></blockquote>
<p>Here are some reasons why.</p>
<p>There are three basic kinds of text management use case:</p>
<ul>
<li><strong>Text as payload. </strong></li>
<li><strong>Text as search parameter.</strong></li>
<li><strong>Text as analytic input.</strong></li>
</ul>
<p><span id="more-5419"></span>The simplest way to manage text electronically is to:</p>
<ul>
<li>Store it as whole documents &#8212; scanned images, .PDFs, word processing files, whatever.</li>
<li>Find it only via fielded metadata, perhaps manually created &#8212; title, author, date, and so on.</li>
</ul>
<p>For example, an application for college admission is accompanied by recommendation letters, transcripts, and so on; those are then moved around as dumb payloads, until such time as an admissions officer reads them. Most relational database management systems can manage BLOBs (Binary Large OBjects), but performance may be better if you leave the big objects outside the relational system. For text-as-payload, that way of managing documents can often suffice.</p>
<p>In other cases, the text may be so short that it naturally fits into a character field in a relational database. This is particularly likely when the text is typed in at the time of record creation, e.g. by a call center operator, a doctor, or a customer entering a support ticket. In such cases, leaving it under the management of an RDBMS makes perfect sense.</p>
<p>In many situations you actually want to search based on the context of the text. Unless you&#8217;re doing simple search on short text snippets in relational character fields, that generally calls for some kind of <strong>text index.</strong> Text indexes are generally found in text search engines. That said, however:</p>
<ul>
<li>Anything that can be done in a standalone text search engine can in principle also be integrated into relational DBMS and other data stores. (How well that works in practice is another matter.)</li>
<li>Text search engines commonly index data <em>in situ</em> that they don&#8217;t actually manage.</li>
</ul>
<p>In theory, then:</p>
<ul>
<li>You can store your text in your chosen DBMS or outside it, as pleases you.</li>
<li>Your chosen DBMS can have a text search capability.</li>
<li>This text search capability can be integrated with the query method used to get at the rest of the data managed by that DBMS.</li>
</ul>
<p>That theory is commonly reflected in actual products, such as Oracle and DB2.</p>
<p>As so often, then, the choice of how to manage text comes down to issues such as performance, programming ease, and other components of total cost of ownership (or of some other general &#8220;goodness&#8221; metric, such as time-to-value). As a general rule, it seems:</p>
<ul>
<li>Text indexing inside relational DBMS has poorer performance than in, say, text search engines, often drastically so.</li>
<li>BLOB management inside relational databases has poorer performance than leaving the files outside the DBMS&#8217; purview.</li>
<li>Relational DBMS do just fine at managing text strings up to, say, 2048 characters long.</li>
<li>Tight integration between  text search and SQL is valuable in a few applications, but irrelevant to many others.</li>
</ul>
<p>And so, to a first approximation:</p>
<ul>
<li><strong>If you just have short text snippets</strong>, it can make sense to <strong>leave them in your relational database.</strong></li>
<li><strong>If performance is not an issue,</strong> you can just leave your BLOBs in your relational database too.</li>
<li><strong>If performance is an issue,</strong> you probably want to have <strong>your larger text files outside your RDBMS&#8217; control.</strong></li>
</ul>
<p>However, that phrasing assumes the default option is a relational DBMS, which may not be the case at all. Other choices include:</p>
<ul>
<li><strong>Standalone text search engines.</strong> If you want the best available text search, get a search engine. But attempts to start with a search engine and wind up with an application platform have generally run into difficulty.</li>
<li><strong>Document-oriented (or other) NoSQL </strong>systems. The story here is surprisingly like that for relational DBMS. I&#8217;ve previously noted that <a href="../../../../../2011/02/07/notes-on-document-oriented-nosql/">document-oriented NoSQL systems manage objects, not &#8220;documents&#8221; in the ordinary sense of the word</a>. Even so, conceptually they&#8217;re no less suited for management of true documents than relational DBMS are. I&#8217;d guess that the correlation between use cases involving true documents and use cases where document-oriented NoSQL is suitable is positive, but not very strong.</li>
<li><strong>MarkLogic.</strong> From one standpoint, MarkLogic is just a heavier-weight version of document-oriented NoSQL. But MarkLogic&#8217;s XML (and XQuery) orientation, tuned-for-years indexing, and built-in search engine put it on a different level for document management than the upstarts have reached.</li>
</ul>
<p>I&#8217;ll cover the Hadoop option in the next, more analytically-focused post.</p>
<p>I hope I&#8217;ve demonstrated that there are appropriate use cases for each of:</p>
<ul>
<li>Letting documents be managed by the file system (and pointing to them from your preferred DBMS).</li>
<li>Sticking documents straight into your preferred DBMS (SQL or non-SQL as the case may be).</li>
<li>Using a specialty system such as MarkLogic (or of course, in some cases, an enterprise search engine).</li>
</ul>
<p>And that&#8217;s even before we move on to<em> analytically-oriented text data management.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Text data management, Part 1: Confusion</title>
		<link>http://www.dbms2.com/2011/10/10/text-data-management-confusion/</link>
		<comments>http://www.dbms2.com/2011/10/10/text-data-management-confusion/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 01:58:03 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5421</guid>
		<description><![CDATA[This is Part 1 of a three post series. The posts cover: Confusion about text data management. Choices for text data management (general and short-request). Choices for text data management (analytic). There&#8217;s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include: The terminology around text [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is Part 1 of a three post series. The posts cover:</em></p>
<ol>
<li> <em><a href="http://www.dbms2.com/2011/10/10/text-data-management-confusion/">Confusion about text data management</a>.</em></li>
<li><em><a href="http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/">Choices for text data management (general and short-request)</a>.</em></li>
<li><em><a href="http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/">Choices for text data management (analytic)</a>.</em></li>
</ol>
<p>There&#8217;s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:</p>
<ul>
<li>The terminology around text data is inaccurate.</li>
<li>Data volume estimates for text are misleading.</li>
<li>Multiple different technologies are in the mix, including:
<ul>
<li>Enterprise text search.</li>
<li>Text analytics &#8212; <a href="http://www.texttechnologies.com/category/software-as-a-service-saas/category/text-mining/">text mining</a>, sentiment analysis, etc.</li>
<li>Document stores &#8212; e.g. document-oriented NoSQL, or MarkLogic.</li>
<li>Log management and parsing &#8212; e.g. Splunk.</li>
<li>Text archiving &#8212; e.g., various specialty email archiving products I couldn&#8217;t even name.</li>
<li>Public web search &#8212; Google et al.</li>
</ul>
</li>
<li>Text search vendors have disappointed, especially technically.</li>
<li>Text analytics vendors have disappointed, especially financially.</li>
<li>Other analytic technology vendors ignore <a href="http://www.texttechnologies.com/2010/12/01/state-of-the-art-text-analytics-mining-applications/">what the text analytic vendors actually have accomplished</a>, and reinvent inferior wheels rather than OEM the state of the art.</li>
</ul>
<p>Above all: <a href="http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/">The use cases for text data vary greatly</a>, just as the use cases for simply-structured databases do.</p>
<p>There are probably fewer people now than there were six years ago who need to be told that <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">text and relational database management are very different things</a>. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: <span id="more-5421"></span></p>
<ul>
<li><strong> The terms &#8220;unstructured&#8221; or &#8220;semi-structured&#8221; data are inherently misleading</strong>. That&#8217;s why <a href="../../../../../2011/05/17/poly-structured-database/">I favor &#8220;multi-structured&#8221; or &#8220;poly-structured&#8221; instead</a>. (&#8220;Multi-structured&#8221; seems to be winning; e.g., it&#8217;s been adopted by Teradata and Teradata/Aster.)</li>
<li>The &#8220;social media&#8221; text data any one enterprise brings in house isn&#8217;t all that much. For example, <a href="../../../../../2011/04/14/attensity-update/">Attensity serves many different enterprises&#8217; social media needs from a single 20-terabyte data store</a>, and reports that no single enterprise has required as much as 1 terabyte of text yet. <strong>Text data may consume a lot of storage </strong>on spinning disks somewhere,<strong> but it&#8217;s not that big a factor in future DBMS industry growth.</strong> (That 20 terabyte figure does seem low.)</li>
<li><strong>Structured databases are typically worth a lot more per bit than other kinds.</strong> The most valuable electronic data, per-bit, is probably records of significant economic transactions &#8212; purchases, sales, money transfers, etc. The least valuable may be sensor log files, whose contents consist mainly of &#8220;Nothing going on here; ping you again in a minute.&#8221; Email logs, web interaction data and many other kinds fall somewhere in between. Highly valuable documents &#8212; such as signed contracts &#8212; generally persist in paper as well as electronic forms. <strong>Investors commonly overlook this point.</strong></li>
<li><strong>The enterprise text search industry is screwed up.</strong>
<ul>
<li>FAST was a goofy company before it was acquired for far too much money by Microsoft.</li>
<li>Autonomy was a goofy company before it was acquired for far too much money by HP.</li>
<li>Google&#8217;s enterprise efforts are quiet.</li>
<li>The integration of text search and relational DBMS &#8212; e.g. at Oracle &#8212; has languished, with poor performance and evident lack of management attention.</li>
<li>Smaller text search vendors don&#8217;t seem to be getting a lot of traction &#8212; e.g., <a href="http://www.texttechnologies.com/category/vendors/coveo/">Coveo</a> has a decent reputation, but when&#8217;s the last time you heard much about them? What has Attivio actually accomplished?</li>
</ul>
</li>
<li><strong>Text analytics is a small business</strong>. Add up the revenue for Attensity, Clarabridge, Lexalytics, Temis, and all the others, and you might poke above $100 million, especially now that Attensity had a three-way merger. Then again, you might not.</li>
<li>Even so, <strong>the text analytics vendors have developed sophisticated technology.</strong> In particular, you can use it to get a pretty good idea as to what people are writing about you, individually or as groups.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/10/text-data-management-confusion/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Derived data, progressive enhancement, and schema evolution</title>
		<link>http://www.dbms2.com/2011/09/06/derived-data-progressive-enhancement-and-schema-evolution/</link>
		<comments>http://www.dbms2.com/2011/09/06/derived-data-progressive-enhancement-and-schema-evolution/#comments</comments>
		<pubDate>Tue, 06 Sep 2011 08:10:23 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5177</guid>
		<description><![CDATA[The emphasis I&#8217;m putting on derived data is leading to a variety of questions, especially about how to tease apart several related concepts: Derived data. Many-step processes to produce derived data. Schema evolution. Temporary data constructs. So let&#8217;s dive in.  When I started my discussion of derived data, I focused on five kinds: Aggregates, when [...]]]></description>
			<content:encoded><![CDATA[<p>The emphasis I&#8217;m putting on derived data is leading to a variety of questions, especially about how to tease apart several related concepts:</p>
<ul>
<li>Derived data.</li>
<li>Many-step processes to produce derived data.</li>
<li>Schema evolution.</li>
<li>Temporary data constructs.</li>
</ul>
<p>So let&#8217;s dive in.  <span id="more-5177"></span></p>
<p>When I started <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">my discussion of derived data</a>, I focused on five kinds:</p>
<ul>
<blockquote>
<li>Aggregates,      when they are maintained, generally for reasons of performance or response      time.</li>
<li>Calculated      scores, commonly based on data mining/predictive analytics.</li>
<li>Text      analytics.</li>
<li>The      kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce      are commonly used for.</li>
<li>Adjusted      data, especially in scientific contexts.</li>
</blockquote>
</ul>
<p>Later I added a sixth kind:</p>
<ul>
<li><a href="../2011/05/30/another-category-of-derived-data/">Derived metadata</a>, commonly for polystructured data sets (logs, text, images, video, whatever).</li>
</ul>
<p>More kinds may yet follow.</p>
<p>In all cases, I was (and am) talking about data that is actually persisted into the database. Temporary tables &#8212; for example the kind frequently created by Microstrategy &#8212; are also important in data processing, as is <a href="../../../../../2010/08/16/vertica-flash-temp-space/">temp space managed solely for the convenience of the DBMS</a>. But neither are what I mean when I talk about &#8220;derived data.&#8221;</p>
<p>As I noted back in June, <a href="../../../../../2011/06/19/investigative-analytics-derived-data/">derived data naturally leads to schema evolution</a>. You load data into an analytic database. You do some analysis and get some interesting results &#8212; interesting enough for you to want to keep them persistently. So you extend the schema to include them. You do more research; you discover something else interesting; you extend the schema again. Repeat as needed.</p>
<p>However, in no way is derived data the only source of analytic schema evolution. Duh. Sometimes you just have new kinds of information coming in. Of course, once it&#8217;s there, you may want to derive something from it. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  In <a href="../../../../../2010/06/08/profile-of-revealed-preferences/">marketing</a> contexts, both parts of that might be true in spades.</p>
<p>When I mentioned all this to my clients at MarkLogic &#8212; which was my inspiration for the polystructured/metadata example &#8212; they perked up and said &#8220;Oh! Progressive enhancement.&#8221; Indeed, it&#8217;s long been the case that a simple text processing pipeline could have &gt;15 steps of extraction; indeed, <a href="http://www.texttechnologies.com/2005/10/19/linkage-among-different-text-technologies/">I learned about the &#8220;tokenization chain&#8221; in 1997</a>. If all the &#8220;progression&#8221; in  the data enhancement occurs in a single processing run, that wouldn&#8217;t necessarily spawn much schema evolution. On the other hand, if you think of additional steps to add every now and then &#8212; in that case your schema might indeed evolve over time.</p>
<p>Somewhat similarly, <a href="../../../../../2009/10/10/enterprises-using-hadoo/">Hadoop can be used to run &#8220;aggregation pipelines&#8221; of many 10s of steps</a>. The output of the whole thing might be a relatively small number of fields. But again, if the number or nature of the fields changes over time, schemas will need to evolve accordingly.</p>
<p>So to sum up:</p>
<ul>
<li>Derived data &#8212; of multiple kinds &#8212; is very important.</li>
<li>If you want to increase the value you get from derived data, you might need to evolve your schema accordingly.</li>
<li>Data derivation happens to sometimes have long processing pipelines; those pipelines might happen to offer clues as how to do yet better at data derivation in the future; those improvements might happen to lead to schema evolution over time.</li>
</ul>
<p>&#8220;Just the raw facts&#8221; analytic databases are, for the most part, obsolete.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/06/derived-data-progressive-enhancement-and-schema-evolution/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>HP/Autonomy sound bites</title>
		<link>http://www.dbms2.com/2011/08/18/hp-autonomy-vertica/</link>
		<comments>http://www.dbms2.com/2011/08/18/hp-autonomy-vertica/#comments</comments>
		<pubDate>Fri, 19 Aug 2011 02:09:31 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[HP and Neoview]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5101</guid>
		<description><![CDATA[HP has announced that: HP is buying Autonomy. HP is pulling back from WebOS. HP may spin off its PC business altogether. On a high level, this means: HP is doubling down on enterprise IT. HP is taking a more software-centric approach to the enterprise IT business. HP is backing away from the consumer electronics [...]]]></description>
			<content:encoded><![CDATA[<p>HP has announced that:</p>
<ul>
<li>HP is buying Autonomy.</li>
<li>HP is pulling back from WebOS.</li>
<li>HP may spin off its PC business altogether.</li>
</ul>
<p>On a high level, this means:</p>
<ul>
<li>HP is doubling down on enterprise IT.</li>
<li>HP is taking a more software-centric approach to the enterprise IT business.</li>
<li>HP is backing away from the consumer electronics business.</li>
<li>HP in particular is backing away from the generic desktop/laptop PC business, which may with only moderate exaggeration be regarded as:
<ul>
<li>The intersection of the enterprise IT and consumer electronics businesses.</li>
<li>The least attractive sector of each.</li>
</ul>
</li>
</ul>
<p><a href="http://www.texttechnologies.com/category/vendors/autonomy/">My coverage of Autonomy</a> isn&#8217;t exactly current, but I don&#8217;t know of anything that contradicts long-time competitor* Dave Kellogg&#8217;s <a href="http://kellblog.com/2011/08/18/hp-rumored-to-be-buying-uks-autonomy-for-10b/">skeptical view of Autonomy</a>. Autonomy is a collection of businesses involved in the management, search, and retrieval of <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured data</a>, in some cases with strong market share, but even so not necessarily with the strongest of reputations for technology or technology momentum. Autonomy started from a text search engine and a Bayesian search algorithm on top of that, which did a decent job for many customers. But if there&#8217;s been much in the way of impressive enhancement over the past 8-10 years, I&#8217;ve missed the news.</p>
<p><em>*Dave, of course, was CEO of MarkLogic.</em></p>
<p>Questions obviously arise about how the Autonomy acquisition relates to other HP businesses. My early thoughts include:  <span id="more-5101"></span></p>
<ul>
<li>HP has clearly signaled that it intends to pursue and focus on the data management business. Thus, we can anticipate marketing messages spanning Autonomy and <a href="../../../../../2011/06/20/vertica-release-5/">Vertica</a>. It may be helpful to recall that Vertica plays nicely with both <a href="../../../../../2010/10/12/vertica-hadoop-connector-integration/">Hadoop</a> and <a href="../../../../../2011/04/14/attensity-update/">Attensity</a>.</li>
<li>The first two natural tuck-in acquisitions I can think to add are Attensity and <a href="../../../../../2011/04/05/whither-marklogic/">MarkLogic</a>.</li>
<li>One place I&#8217;d look for synergy is with HP&#8217;s system management software business. HP has previously acquired its way into a strong position there. If you add in knowledge of how many kinds of data are used, you have a chance to set yourself apart in the system management area.</li>
<li>I had enough trouble advising Vertica about how to explain what they do in terms that HP&#8217;s hardware sales force can comfortably embrace. I think I did OK with that. But Autonomy? Youch. On the other hand, &#8230;</li>
<li>&#8230; HP is run by guys from SAP (Leo Apotheker) and Oracle (Ray Lane), both of whom have dealt with similarly tough sales challenges before. But even at best, HP&#8217;s sales force organization, commission structure, and training is going to consume a lot of attention at the very highest levels of HP.</li>
<li>Autonomy manages documents electronically. HP prints them. The markets where that seems synergistic, however, are fairly specialized or small. (E.g., equipment for printing on demand.) Perhaps there&#8217;s some grand joint venture possibility with Xerox here, antitrust permitting.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/08/18/hp-autonomy-vertica/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Investigative analytics and derived data: Enzee Universe 2011 talk</title>
		<link>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/</link>
		<comments>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/#comments</comments>
		<pubDate>Sun, 19 Jun 2011 12:13:04 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4747</guid>
		<description><![CDATA[I&#8217;ll be speaking Monday, June 20 at IBM Netezza&#8217;s Enzee Universe conference. Thus, as is my custom: I&#8217;m posting draft slides. I&#8217;m encouraging comment (especially in the short time window before I have to actually give the talk). I&#8217;m offering links below to more detail on various subjects covered in the talk. The talk concept [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ll be speaking Monday, June 20 at IBM Netezza&#8217;s <a href="http://www.netezza.com/userconference/abstract.html#620_1200">Enzee Universe</a> conference. Thus, as is my custom:</p>
<ul>
<li>I&#8217;m posting draft <a href="http://www.monash.com/uploads/Enzee-Universe-2011-Monash.ppt">slides</a>.</li>
<li>I&#8217;m encouraging comment (especially in the short time window before I have to actually give the talk).</li>
<li>I&#8217;m offering links below to more detail on various subjects covered in the talk.</li>
</ul>
<p>The talk concept started out as &#8220;advanced analytics&#8221; (as opposed to fast query, a subject amply covered in the rest of any Netezza event), as a lunch break in what is otherwise a detailed &#8220;best practices&#8221; session. So I suggested we constrain the subject by focusing on a specific application area &#8212; customer acquisition and retention, something of importance to almost any enterprise, and which exploits most areas of analytic technology. Then I actually prepared the slides &#8212; and guess what? The mix of subjects will be skewed somewhat more toward generalities than I first intended, specifically in the areas of <strong>investigative analytics </strong>and<strong> derived data. </strong>And, as always when I speak, I&#8217;ll try to raise consciousness about the issues of <a href="../../../../../2011/01/10/privacy-dangers-an-overview/">liberty and privacy</a>, our <a href="../../../../../2010/07/04/fair-data-use/">options as a society for addressing them</a>, and the crucial role we play as an industry in <a href="../../../../../2010/04/04/privacy-liberty-continued/">helping policymakers deal with these technologically-intense subjects</a>.</p>
<p>Slide 3 refers back to a post I made last December, saying there are <a href="../../../../../2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/">six useful things you can do with analytic technology</a>:</p>
<ul>
<li><strong>Operational      BI/Analytically-infused operational apps:</strong> You can make an immediate      decision.</li>
<li><strong>Planning      and budgeting:</strong> You can plan in      support of future decisions.</li>
<li><strong>Investigative      analytics (multiple disciplines):</strong> You can research, investigate, and analyze in support of future decisions.</li>
<li><strong>Business      intelligence:</strong> You can monitor      what’s going on, to see when it necessary to decide, plan, or investigate.</li>
<li><strong>More      BI:</strong> You can communicate, to help      other people and organizations do these same things.</li>
<li><strong>DBMS,      ETL, and other &#8220;platform&#8221; technologies:</strong> You can provide support, in      technology or data gathering, for one of the other functions.</li>
</ul>
<p>Slide 4 observes that <a href="http://www.dbms2.com/2011/03/03/investigative-analytics/">investigative analytics</a>:</p>
<ul>
<li>Is the most rapidly advancing of the six areas &#8230;</li>
<li>&#8230; because it most directly exploits performance &amp; scalability.</li>
</ul>
<p>Slide 5 gives my simplest overview of investigative analytics technology to date:  <span id="more-4747"></span></p>
<ul>
<li>Fast query
<ul>
<li>Persistent storage (any data volume)</li>
<li>RAM (10s -100s of gigabytes, or more)</li>
</ul>
</li>
<li>Fast analytics
<ul>
<li>Predictive modeling</li>
<li>Transformation/tagging</li>
<li>Graph</li>
</ul>
</li>
</ul>
<p>Slide 6 points out that this is all supported by cheap data creation and acquisition, specifically in the area of <a href="http://www.dbms2.com/2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>, which gets the full benefit of Moore&#8217;s Law.</p>
<p>Slides 7-13 point out how the example problem domain involves lots of analytic tasks performed on lots of kinds of data. Specific examples cited include <a href="http://www.dbms2.com/2011/04/14/attensity-update/">text analytics</a> and <a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/">graph/relationship analytics</a>.</p>
<p>Slide 14 contains the punch line, so I&#8217;ll quote it in full:</p>
<blockquote><p>Derived data</p>
<ul>
<li>You can’t keep re-analyzing all that in raw form …</li>
<li>&#8230; so don’t.</li>
</ul>
<p><em>If you have one takeaway from this session, let it be the utter importance of derived data. </em></p></blockquote>
<p>Slide 16 lists kinds of <a href="http://www.dbms2.com/2011/05/30/another-category-of-derived-data/">derived data</a> that are important in the single application of reducing telco churn:</p>
<ul>
<li>Normalized data
<ul>
<li>Parsed/sessionized logs</li>
<li>Text/sentiment highlights</li>
<li>Social network graph(s)</li>
<li>Web de-anonymization</li>
<li>Household matching</li>
</ul>
</li>
<li>Scores and buckets
<ul>
<li>Demographic</li>
<li>Psychographic</li>
<li>Offer hot buttons</li>
<li>(Dis)satisfaction</li>
<li>Credit/fraud risk</li>
<li>Lifetime customer value</li>
<li>Influence on others!</li>
</ul>
</li>
</ul>
<p>And finally, Slide 17 is my first pass at best practices for dealing with derived data:</p>
<ul>
<li>Evolving data warehouse schema</li>
<li>Data marts
<ul>
<li>Physical or virtual</li>
<li>Inputs/outputs to “EDW”</li>
</ul>
</li>
<li>“Data science”
<ul>
<li>Research != production</li>
</ul>
</li>
<li>Multiple processing pipelines
<ul>
<li>Log parsing</li>
<li>Text</li>
<li>Predictive analytics</li>
<li>Generic ETL</li>
<li>Streaming “ETL”</li>
</ul>
</li>
</ul>
<p>That last list looks like a starting point for a whole set of interesting future posts.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Terminology: poly-structured data, databases, and DBMS</title>
		<link>http://www.dbms2.com/2011/05/17/poly-structured-database/</link>
		<comments>http://www.dbms2.com/2011/05/17/poly-structured-database/#comments</comments>
		<pubDate>Tue, 17 May 2011 13:16:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Object]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4484</guid>
		<description><![CDATA[My recent argument that the common terms &#8220;unstructured data&#8221; and &#8220;semi-structured data&#8221; are misnomers, and that a word like &#8220;multi-&#8221; or &#8220;poly-structured&#8221;* would be better, seems to have been well-received. But which is it &#8212; &#8220;multi-&#8221; or &#8220;poly-&#8221;? *Everybody seems to like &#8220;poly-structured&#8221; better when it has a hyphen in it &#8212; including me. The [...]]]></description>
			<content:encoded><![CDATA[<p>My recent argument that <a href="../../../../../2011/05/15/what-to-do-about-unstructured-data/">the common terms &#8220;unstructured data&#8221; and &#8220;semi-structured data&#8221; are misnomers</a>, and that a word like &#8220;multi-&#8221; or &#8220;poly-structured&#8221;* would be better, seems to have been well-received. But which is it &#8212; &#8220;multi-&#8221; or &#8220;poly-&#8221;?</p>
<p><em>*Everybody seems to like &#8220;poly-structured&#8221; better when it has a  hyphen in it &#8212; including me. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </em></p>
<p>The big difference between the two is that &#8220;multi-&#8221; just means there are multiple structures, while &#8220;poly-&#8221; further means that the structures are subject to change. Upon reflection, I think the &#8220;subject to change&#8221; part is essential, so <strong>poly-structured </strong>it is.</p>
<p>The definitions I&#8217;m proposing are:</p>
<ul>
<li>A <strong>database</strong> is <strong>poly-structured</strong> to the extent that its structure is apt to be changed in the ordinary course of query, update, or programming.</li>
<li><strong>Data</strong> is <strong>poly-structured</strong> to the extent that it is best represented in a poly-structured database.</li>
<li>A <strong>DBMS</strong> is <strong>poly-structured</strong> to the extent that it is oriented to managing poly-structured databases.</li>
</ul>
<p><em><span id="more-4484"></span>Please note: <a href="../../../../../2011/05/15/what-to-do-about-unstructured-data/"></a></em></p>
<ul>
<li><em><a href="../../../../../2011/05/15/what-to-do-about-unstructured-data/">There are many different degrees of being poly-structured</a></em><em>; that&#8217;s why I used the phrase &#8220;to the extent that&#8221;, instead of a simple &#8220;if&#8221;. </em></li>
<li><em>And as always, <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">no technology categorization is ever precise</a>.</em></li>
</ul>
<p>Examples of poly-structure include:</p>
<ul>
<li>XML or JSON documents/objects describe themselves. Add a new one to a database with a different structure than the others and &#8212; presto! &#8212; you have changed the overall structure. Thus:
<ul>
<li>XML and JSON data is apt to be poly-structured.</li>
<li>XML and JSON databases are apt to be poly-structured.</li>
<li>MarkLogic, MongoDB, et al. are poly-structured DBMS.</li>
</ul>
</li>
<li>A text document is inherently poly-structured. Some queries might look at it as a bag of words; others might group the words via stemming and synonyms; others might actually exploit the document&#8217;s grammatical structure. Text search engines are poly-structured because they support all those kinds of queries.</li>
<li>A single log file can be somewhat poly-structured, in that different views of it might extract different kinds of name-value pair, or different temporal relationships.</li>
<li>A database that seamlessly includes a variety of log files, each with its own structure(s), is quite poly-structured.</li>
<li>A classic relational database is not very poly-structured, because DDL (Data Description Language) isn&#8217;t really in &#8220;the ordinary course&#8221; of programming or update.</li>
<li>However, views add a bit of poly-structure to relational databases that is not present in, say, IMS databases.</li>
<li>An object-oriented DBMS is highly poly-structured, as is <a href="../../../../../2010/08/22/workday-technology-stack/">Workday&#8217;s  internal data store</a>.</li>
</ul>
<p>So what do you think? Do these definitions work?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/05/17/poly-structured-database/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
	</channel>
</rss>

