<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Data types</title>
	<atom:link href="http://www.dbms2.com/category/datatype/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Wed, 08 Feb 2012 22:51:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Sumo Logic and UIs for text-oriented data</title>
		<link>http://www.dbms2.com/2012/02/06/sumo-logic-and-uis-for-text-oriented-data/</link>
		<comments>http://www.dbms2.com/2012/02/06/sumo-logic-and-uis-for-text-oriented-data/#comments</comments>
		<pubDate>Mon, 06 Feb 2012 13:27:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5897</guid>
		<description><![CDATA[I talked with the Sumo Logic folks for an hour Thursday. Highlights included: Sumo Logic does SaaS (Software as a Service) log management. Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as &#8220;Splunk-like&#8221;. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to [...]]]></description>
			<content:encoded><![CDATA[<p>I talked with the Sumo Logic folks for an hour Thursday. Highlights included:</p>
<ul>
<li>Sumo Logic does SaaS (Software as a Service) log management.</li>
<li>Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as &#8220;Splunk-like&#8221;. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to <a href="../../../../../2012/01/10/splunk-update/">branch out</a>.)</li>
<li>Sumo Logic has hacked Lucene for faster indexing, and says 10-30 second latencies are typical.</li>
<li>Sumo Logic&#8217;s main differentiation is <strong>automated classification of events. </strong></li>
<li>There&#8217;s some kind of streaming engine in the mix, to update counters and drive alerts.</li>
<li>Sumo Logic has around 30 &#8220;customers,&#8221; free (mainly) or paying (around 5) as the case may be.</li>
<li>A truly typical Sumo Logic customer has single to low double digits of gigabytes of log data per day. However, Sumo Logic seems highly confident in its ability to handle a terabyte per customer per day, give or take a factor of 2.</li>
<li>When I asked about the implications of shipping that much data to a remote data center, Sumo Logic observed that log data compresses really well.</li>
<li>Sumo Logic recently raised a bunch of venture capital.</li>
<li>Sumo Logic&#8217;s founders are out of ArcSight, a log management company HP paid a bunch of money for.</li>
<li>Sumo Logic coined a marketing term &#8220;LogReduce&#8221;, but it has nothing to do with &#8220;MapReduce&#8221;. Sumo Logic seems to find this amusing.</li>
</ul>
<p>What interests me about Sumo Logic is that automated classification story. I thought I heard Sumo Logic say:<span id="more-5897"></span></p>
<ul>
<li>It&#8217;s largely unsupervised machine learning.</li>
<li>It&#8217;s specific to a particular user/data set.</li>
<li>It can be up and running and classifying things effectively almost instantly (i.e., on seconds&#8217; or minutes&#8217; worth of data).</li>
<li>It&#8217;s informed by what different users tag as false positives. (Or maybe that is planned for future versions.)</li>
</ul>
<p><em>I have a little trouble seeing how all those points fit exactly together, so perhaps I got some details wrong.</em></p>
<p>The payoff is that <strong>machine learning directly informs the Sumo Logic user interface</strong>. In particular, large numbers of events are bundled into a small number of categories, hopefully making it much easier for network operations types to scan the UI and pick out what&#8217;s important.</p>
<p>In general, the idea of machine-learning informing analytic UIs via some sort of classification is common in text-oriented technologies, notably in:</p>
<ul>
<li>Good ol&#8217; text search.</li>
<li>Text mining vendors&#8217; approaches to clustering hits on words or phrases that say substantially the same thing.</li>
</ul>
<p>But otherwise it seems kind of rare, if we stipulate that ad-serving/general internet personalization isn&#8217;t really an analytic UI &#8212; but I&#8217;d love to hear of any interesting examples I&#8217;ve overlooked.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/02/06/sumo-logic-and-uis-for-text-oriented-data/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Splunk update</title>
		<link>http://www.dbms2.com/2012/01/10/splunk-update/</link>
		<comments>http://www.dbms2.com/2012/01/10/splunk-update/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 05:55:08 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5791</guid>
		<description><![CDATA[Splunk is announcing the Splunk 4.3 point release. Before discussing it, let&#8217;s recall a few things about Splunk, starting with: Splunk is first and foremost an analytic DBMS &#8230; &#8230; used to manage logs and similar multistructured data. Splunk&#8217;s DML (Data Manipulation Language) is based on text search, not on SQL. Splunk has extended its [...]]]></description>
			<content:encoded><![CDATA[<p>Splunk is announcing the Splunk 4.3 point release. Before discussing it, let&#8217;s recall a few things about Splunk, starting with:</p>
<ul>
<li>Splunk is first and foremost an analytic DBMS &#8230;</li>
<li>&#8230; used to manage logs and similar multistructured data.</li>
<li>Splunk&#8217;s DML (Data Manipulation Language) is based on text search, not on SQL.</li>
<li>Splunk has extended its DML in natural ways (e.g., you can use it to do calculations and even some statistics).</li>
<li>Splunk bundles some (very) basic, Splunk-specific business intelligence capabilities.</li>
<li>The paradigmatic use of Splunk is to monitor IT operations in real time. However:
<ul>
<li>There also are plenty of non-real-time uses for Splunk.</li>
<li>Splunk is proudest of its growth in non-IT quasi-real-time uses, such as the marketing side of web operations.</li>
</ul>
</li>
</ul>
<p>As in any release, a lot of Splunk 4.3 is about &#8220;Oh, you didn&#8217;t have that before?&#8221; features and <a href="../../../../../2009/08/21/bottleneck-whack-a-mole/">Bottleneck Whack-A-Mole</a> performance speed-up. One performance enhancement is Bloom filters, which are a very hot topic these days. More important is a switch from Flash to HTML5, so as to accommodate mobile devices with less server-side rendering. Splunk reports that its users &#8212; especially the non-IT ones &#8212; really want to get Splunk information on the tablet devices. While this somewhat contradicts <a href="../../../../../2012/01/04/some-issues-in-business-intelligence/">what I wrote a few days ago pooh-poohing mobile BI</a>, let me hasten to point out:</p>
<ul>
<li>Splunk is used for a lot of (quasi) real-time monitoring.</li>
<li>Splunk&#8217;s desktop user interfaces are, by BI standards, quite primitive.</li>
</ul>
<p>That&#8217;s pretty much the ideal scenario for mobile BI: Timeliness matters and prettiness doesn&#8217;t.</p>
<p><span id="more-5791"></span><em>Hmm. Maybe <a href="../../../../../2011/11/10/streambase-liveview-push-based-real-time-bi/">StreamBase LiveView</a> needs a mobile option as well &#8230;</em></p>
<p>Splunk&#8217;s basic use is to take the text string that is a log and make sense of it. But Splunk now also supports JSON structures. It does this via something called spath, which as you might guess from the name has XPath similarities. That probably bore more discussion than we found the time to have.</p>
<p><em>By the way: If you&#8217;re interested in BI over XML, that&#8217;s what my former clients at Skytide were founded to do, before they pivoted a bit. I don&#8217;t think those capabilities have disappeared from the product</em>.</p>
<p><a href="http://www.monash.com/uploads/Splunk-4-3.pdf">Splunk has graciously allowed me to post a slide deck</a>. More stuff in there, including quotes from a customer &#8212; Expedia &#8212; that has 2700 Splunk users.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/10/splunk-update/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Lessons from T-Mobile&#8217;s epic fail</title>
		<link>http://www.dbms2.com/2011/11/04/lessons-from-t-mobiles-epic-fail/</link>
		<comments>http://www.dbms2.com/2011/11/04/lessons-from-t-mobiles-epic-fail/#comments</comments>
		<pubDate>Fri, 04 Nov 2011 06:01:25 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5593</guid>
		<description><![CDATA[When my electric power came back on but my Verizon FiOS internet connection didn&#8217;t, it was time for a mobile hotspot/prepaid wireless internet service. T-Mobile&#8217;s 4G Mobile Hotspot/Prepaid Mobile Broadband offering seemed like a good choice. But the experience of setting it up was a nightmare, and a possible instructive nightmare at that. T-Mobile&#8217;s instructions [...]]]></description>
			<content:encoded><![CDATA[<p>When my electric power came back on but my Verizon FiOS internet connection didn&#8217;t, it was time for a mobile hotspot/prepaid wireless internet service. T-Mobile&#8217;s 4G Mobile Hotspot/Prepaid Mobile Broadband offering seemed like a good choice. But the experience of setting it up was a nightmare, and a possible instructive nightmare at that.</p>
<p>T-Mobile&#8217;s instructions tell you that you need to know the factory defaults for network name and password. That makes sense. They don&#8217;t also tell you that you need to know your SIM card number (included), IMEI number (included), or authorization number (not included).</p>
<p>That&#8217;s right &#8212; you need a number that T-Mobile doesn&#8217;t tell you you need. But the story gets a lot worse from there, because it&#8217;s almost impossible to get the number from them. I eventually talked with approximately 8 T-Mobile call center associates over the course of the evening before getting successfully connected.</p>
<p><span id="more-5593"></span><em>One of the few redeeming features in this story is that T-Mobile call center folks pick up the phone quickly. One of the many non-redeeming ones is that they efficiently give you inaccurate information after they do. The one who finally got me the right answer was a young woman who put me on hold to call an internal resource approximately four times before finally handling the situation correctly.</em></p>
<p>At one point I found somewhat helpful information by searching on T-Mobile&#8217;s website for <em>mobile hotspot activation.</em> However, the same information did not surface on my earlier searches on strings like <em>activate mobile hotspot.</em> Stemming has been a basic feature of search engines since the mid 1990s, but evidently T-Mobile&#8217;s technological choices aren&#8217;t as current as that. Other inexcusable T-Mobile mistakes include:</p>
<ul>
<li>Not providing information about the need for an &#8220;activation number&#8221; in the product&#8217;s paper documentation.</li>
<li>Taking the buyer to a sign-on screen that doesn&#8217;t lead to the call center reps responsible for the product being signed-on for.</li>
<li>Not providing call center operators with the tools they need to get callers to the right place.</li>
</ul>
<p>Perhaps Elbonian* contractors were involved.</p>
<p><em>Elbonia is a fictitious country of outsourcers in Scott Adams&#8217; </em>Dilbert<em> comic strip. <a href="http://dilbert.com/strips/comic/2006-03-25/">Elbonian work is not noted for its high quality</a>. </em></p>
<p>In case you haven&#8217;t guessed yet, the missing T-Mobile  &#8220;activation number&#8221; was tantamount to a telephone number, complete with area code. <strong>T-Mobile was insisting on assigning a telephone number for a service that had nothing to do with making or receiving telephone calls.</strong> While I can believe there was some legitimate database/application design reason for having such inflexibility under the covers, it&#8217;s hard to see why T-Mobile didn&#8217;t get a composite application tool and hack a front-end that automatically generates the number without call-center intervention.</p>
<p>I did eventually get connected, and in my limited experience with T-Mobile’s Prepaid/Mobile Hotspot 4G “Broadband” offering, I get the impression it has good speed but conventionally flaky WiFi reliability. It may well be a T-Mobile service that is worthy of great success. <em>Edit: That&#8217;s not true.*</em> But it&#8217;s not going to experience such success as long as T-Mobile idiotically infuriates its users at the relationship start.</p>
<p><em>*Edit: It turns out that the T-Mobile Mobile Hotspot device has terrible range. We can use it to get online in my office, Linda&#8217;s office, or the living room/dining room area, but no 2 of the 3 at once.</em></p>
<p>My takeaways from this story include:</p>
<ul>
<li>Use competent documentation writers.</li>
<li>Run usability testing on your entire customer-experience processes.</li>
<li>Test your site search engine for usefulness.</li>
</ul>
<p>Beyond that, there&#8217;s not a single part of this story for which there isn&#8217;t a straightforward fix, most of them alluded to above.</p>
<p>If you see anything of your organization in this story, it&#8217;s probably time to raise your standards. This is obviously a different kind of failure as, say, the one last year at <a href="http://www.dbms2.com/category/users/jpmorgan-chase/">Chase</a>. Even so, &#8220;as awful as T-Mobile&#8221; would be a sad state to endure.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/04/lessons-from-t-mobiles-epic-fail/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MarkLogic 5, and why you might care</title>
		<link>http://www.dbms2.com/2011/11/01/marklogic-version-5/</link>
		<comments>http://www.dbms2.com/2011/11/01/marklogic-version-5/#comments</comments>
		<pubDate>Tue, 01 Nov 2011 04:03:59 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5560</guid>
		<description><![CDATA[MarkLogic is releasing MarkLogic 5. Key elements of the announcement are: More-of-the-same in line with MarkLogic’s core positioning. A new bi-directional Hadoop connector. A free MarkLogic Express edition, limited in license terms more than in actual features, as per Slide 27 of the deck MarkLogic graciously supplied for me to post. Also, MarkLogic is early [...]]]></description>
			<content:encoded><![CDATA[<p>MarkLogic is releasing MarkLogic 5. Key elements of the announcement are:</p>
<ul>
<li>More-of-the-same      in line with MarkLogic’s core positioning.</li>
<li>A new      bi-directional Hadoop connector.</li>
<li>A free      MarkLogic Express edition, limited in license terms more than in actual      features, as per Slide 27 of <a href="http://www.monash.com/uploads/MarkLogic-5-Deck.pptx">the deck      MarkLogic graciously supplied for me to post</a>.</li>
</ul>
<p>Also, MarkLogic is early with a feature that most serious DBMS vendors will  soon have – support for tiered storage, with writes going first to  solid-state storage, then being flushed to disk via a caching-style  algorithm.* And as befits a sometime search-engine-substitute, MarkLogic has finally licensed a large set of document filters, from an Australian company called <a href="http://www.isys-search.com/index.html">Isys</a>. Apparently, the special virtue of the Isys filters is that they’re good at extracting not only text, but metadata as well.</p>
<p><em>*If there’s a caching algorithm that doesn’t contain a major element of LRU (Least Recently Used), I don’t recall ever hearing about it.</em></p>
<p>MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:</p>
<ul>
<li>MarkLogic      is a serious, enterprise-class DBMS (see for example Slide 12 of <a href="http://www.monash.com/uploads/MarkLogic-5-Deck.pptx">the MarkLogic      deck</a>) …</li>
<li>…      which has been optimized from the getgo for <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured      data</a>.</li>
<li>MarkLogic      can and does scale out to handle large amounts of data.</li>
<li>MarkLogic      is a general-purpose DBMS, suitable for <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">both      short-request and analytic tasks</a>.</li>
<li>MarkLogic      is particularly well suited for analyses with long chains of “progressive      enhancement” (MarkLogic’s favorite term when talking about <a href="../../../../../2011/05/30/another-category-of-derived-data/">derived      data</a>).</li>
<li><a href="http://blogs.avalonconsult.com/blog/search/is-marklogic-a-search-engine/">MarkLogic      often plays the role of a content assembler and/or search engine</a>, and      the people who use MarkLogic in those ways are commonly doing things that can      be described as research and analysis.</li>
</ul>
<p>Based on that reality, MarkLogic talks a lot about Volume, Velocity, Variety, Big Data, unstructured data, semi-structured data, and big data analytics.</p>
<p><span id="more-5560"></span><em>My <a href="../../../../../2010/11/29/marklogic-and-its-document-dbms/">November, 2010 overview of MarkLogic technology</a> remains pretty relevant. One correction, however: Node heterogeneity configurations, in which “data” and “evaluation” nodes reside on separate servers, are the exception rather than the rule.</em></p>
<p>Like <a href="../../../../../2011/10/18/vertica-community-edition/">Vertica</a>, MarkLogic has laudably said that true academic researchers can get MarkLogic for free without the severe license restrictions. Free MarkLogic should be of particular interest to researchers who:</p>
<ul>
<li>Are      studying natural networks or graphs, such as social networks or biological      pathways. (This might be a fit in the social or biological sciences.)</li>
<li>Are      managing metadata for, say, a variety of disparate kinds of experimental      files. (This might be a fit anywhere in the natural sciences.)</li>
<li>Are      managing actual documents, images, videos, etc., or data about such      things. (This might be a fit in the humanities or social sciences.)</li>
</ul>
<p>MarkLogic provided some disclosable financial substance by email, which I shall quote verbatim:</p>
<ul>
<li><em>MarkLogic      has 45% revenue growth and 55-60% license growth year over year.</em></li>
<li><em>We      expect to finish this year with over $85 million in revenue, up from $55      million last year.</em></li>
</ul>
<p>Arithmetical purists might note that 85/55 is more than 145%, but I’m just going to settle for the information I got and move on.</p>
<p><em>Edit: I posted separately about the <a href="http://www.dbms2.com/2011/11/03/marklogic-hadoop-connector/">MarkLogic Hadoop connector.</a></em> <span style="text-decoration: line-through;">As for that Hadoop connector – stay tuned for a short follow-up post, as writing about it now would not be convenient. (My backup discipline isn’t what it should be, and the only copy of my notes about that product is on a heavy tower computer in a house that doesn’t have working power.)</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/01/marklogic-version-5/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Text data management, Part 3: Analytic and progressively enhanced</title>
		<link>http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/</link>
		<comments>http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 01:59:17 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5420</guid>
		<description><![CDATA[This is Part 3 of a three post series. The posts cover: Confusion about text data management. Choices for text data management (general and short-request). Choices for text data management (analytic). I&#8217;ve gone on for two long posts about text data management already, but even so I&#8217;ve glossed over a major point: Using text data [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is Part 3 of a three post series. The posts cover:</p>
<ol>
<li><a href="../2011/10/10/text-data-management-confusion/">Confusion about text data management</a>.</li>
<li><a href="../2011/10/10/text-data-management-part-2-general-and-short-request/">Choices for text data management (general and short-request)</a>.</li>
<li><a href="../2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/">Choices for text data management (analytic)</a>.</li>
</ol>
<p></em></p>
<p>I&#8217;ve gone on for two long posts about text data management already, but even so I&#8217;ve glossed over a major point:</p>
<p><strong>Using text data commonly involves a long series of data enhancement steps.</strong></p>
<p>Even before you do what we&#8217;d normally think of as &#8220;analysis&#8221;, text markup can include steps such as:</p>
<ul>
<li>Figure out where the words break.</li>
<li>Figure out where the clauses and sentences break.</li>
<li>Figure out where the paragraphs, sections, and chapters break.</li>
<li>(Where necessary) map the words to similar ones &#8212; spelling correction, stemming, etc.</li>
<li>Figure out which words are grammatically which parts of speech.</li>
<li>Figure out which pronouns and so on refer to which other words. (Technical term: Anaphora resolution.)</li>
<li>Figure out what was being said, one clause at a time.</li>
<li>Figure out the emotion &#8212; or &#8220;sentiment&#8221; &#8212; associated with it.</li>
</ul>
<p>Those processes can add up to dozens of steps. And maybe, six months down the road, you&#8217;ll think of more steps yet.</p>
<p><span id="more-5420"></span>So when you manage text, it is convenient to assume <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schemas</a>. That would be an argument for using MarkLogic, NoSQL document stores, and/or Hadoop, rather than strictly relational systems.</p>
<p>That said, text analytics can be done perfectly well in relational databases. Again, I point you to the example of <a href="../../../../../2011/04/14/attensity-update/">Attensity</a>, which will extract for you a large fraction of the information that can be gotten out of the text, put it into a convenient relational schema, and let you get to work. Once the principal extraction has been done, there&#8217;s no reason why your <a href="../../../../../2011/09/06/derived-data-progressive-enhancement-and-schema-evolution/">derived data</a> issues need be any more complex than others you deal with relationally, especially on the analytic side of the house.</p>
<p>But what if you want to do your own text enhancement, rather than using a third party tool? The first thing to ask yourself is &#8212; why? With all due respect to the 10-20 internet-centric companies that are having fun reinventing large portions of the data processing wheel &#8212; if you&#8217;re not at one of those companies, you should probably be trying to use as much third-party software as you possibly can.</p>
<p>I can think of a couple of cases where rolling your own technology make sense, namely:</p>
<ul>
<li>The hard part of what you&#8217;re doing is extracting snippets of text from some data format proprietary to you.</li>
<li>You&#8217;re trying to do very simple things across a variety of languages much broader than the 10-20 that the text analytics vendors currently do a halfway decent job of handling.</li>
</ul>
<p>I can&#8217;t think of many others.</p>
<p>One thing I&#8217;d definitely be wary of is using Hadoop as a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> for individual documents in a variety of formats. I don&#8217;t know what you&#8217;d do with them once they&#8217;re there. Yes, Google invented MapReduce in part to do things like document indexing &#8212; but you&#8217;d probably prefer not to reinvent the Google stack. That&#8217;s quite apart from questions as to whether your document count exceeds Hadoop&#8217;s comfortable <a href="../../../../../2011/08/21/hadoop-evolution/">file-count limit</a>. Solr is a different matter; but while Solr and Hadoop are both open source projects that can be traced back to Doug Cutting, otherwise they&#8217;re rather different things.</p>
<p>A useful way of looking at your choices may be to ask:</p>
<p><strong>After text has run through the main pipeline of manipulation and information extraction:</strong></p>
<ul>
<li><strong>What will the output look like?</strong></li>
<li><strong>Where do I want that output to end up?</strong></li>
</ul>
<p>If the output has to be something that fits into a structured/relational analytic system, then it should probably go into a relational DBMS. If you&#8217;re going to do social network analysis of the sort you&#8217;d ideally like to do in a graph database &#8212; well, unless you&#8217;re an intelligence agency with blank-check resources, you&#8217;ll probably still end up opting for a relational DBMS. If the output consists of simple, homogeneous text files, plus a few fields of metadata, and you&#8217;re not going to do much analysis of it, it can pretty much go anywhere; either SQL or NoSQL might suit your purposes. If you want maximum power and flexibility, MarkLogic may be the ideal destination.</p>
<p>From there, the next question is:</p>
<ul>
<li><strong>What pipeline should the text run through to get to its final destination?</strong></li>
</ul>
<p>Often, as I&#8217;ve argued, the right answer is a third-party text analytic system. Those can generally consume text in almost any kind of file format. Other times &#8212; less often than you may think &#8212; it&#8217;s Hadoop. OK, then pass it through Hadoop. Other possibilities could come up as well (text search engines aren&#8217;t really as usually as I may have seemed to be suggesting).</p>
<p>Anyhow, when you&#8217;ve established where text starts out (that&#8217;s usually a given), what it passes through (please see above), and where its best parts need to end up (ditto), you&#8217;ve done the hardest parts. Figuring out the rest of your text management architecture should be relatively easy by comparison.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Text data management, Part 2: General and short-request</title>
		<link>http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/</link>
		<comments>http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 01:58:49 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5419</guid>
		<description><![CDATA[This is Part 2 of a three post series. The posts cover: Confusion about text data management. Choices for text data management (general and short-request). Choices for text data management (analytic). I&#8217;ve recently given widely varied advice about managing text (and similar files &#8212; images and so on), ranging from Sure, just keep going with [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is Part 2 of a three post series. The posts cover:</em></p>
<ol>
<li> <em><a href="../2011/10/10/text-data-management-confusion/">Confusion about text data management</a>.</em></li>
<li><em><a href="../2011/10/10/text-data-management-part-2-general-and-short-request/">Choices for text data management (general and short-request)</a>.</em></li>
<li><em><a href="../2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/">Choices for text data management (analytic)</a>.</em></li>
</ol>
<p>I&#8217;ve recently given widely varied advice about managing text (and similar files &#8212; images and so on), ranging from</p>
<blockquote><p>Sure, just keep going with your old strategy of keeping .PDFs in the file system and pointing to them from the relational database. That&#8217;s an easy performance optimization vs. having the RDBMS manage them as BLOBs.</p></blockquote>
<p>to</p>
<blockquote><p>I suspect MongoDB isn&#8217;t heavyweight enough for your document management needs, let alone just dumping everything into Hadoop. Why don&#8217;t you take a look at MarkLogic?</p></blockquote>
<p>Here are some reasons why.</p>
<p>There are three basic kinds of text management use case:</p>
<ul>
<li><strong>Text as payload. </strong></li>
<li><strong>Text as search parameter.</strong></li>
<li><strong>Text as analytic input.</strong></li>
</ul>
<p><span id="more-5419"></span>The simplest way to manage text electronically is to:</p>
<ul>
<li>Store it as whole documents &#8212; scanned images, .PDFs, word processing files, whatever.</li>
<li>Find it only via fielded metadata, perhaps manually created &#8212; title, author, date, and so on.</li>
</ul>
<p>For example, an application for college admission is accompanied by recommendation letters, transcripts, and so on; those are then moved around as dumb payloads, until such time as an admissions officer reads them. Most relational database management systems can manage BLOBs (Binary Large OBjects), but performance may be better if you leave the big objects outside the relational system. For text-as-payload, that way of managing documents can often suffice.</p>
<p>In other cases, the text may be so short that it naturally fits into a character field in a relational database. This is particularly likely when the text is typed in at the time of record creation, e.g. by a call center operator, a doctor, or a customer entering a support ticket. In such cases, leaving it under the management of an RDBMS makes perfect sense.</p>
<p>In many situations you actually want to search based on the context of the text. Unless you&#8217;re doing simple search on short text snippets in relational character fields, that generally calls for some kind of <strong>text index.</strong> Text indexes are generally found in text search engines. That said, however:</p>
<ul>
<li>Anything that can be done in a standalone text search engine can in principle also be integrated into relational DBMS and other data stores. (How well that works in practice is another matter.)</li>
<li>Text search engines commonly index data <em>in situ</em> that they don&#8217;t actually manage.</li>
</ul>
<p>In theory, then:</p>
<ul>
<li>You can store your text in your chosen DBMS or outside it, as pleases you.</li>
<li>Your chosen DBMS can have a text search capability.</li>
<li>This text search capability can be integrated with the query method used to get at the rest of the data managed by that DBMS.</li>
</ul>
<p>That theory is commonly reflected in actual products, such as Oracle and DB2.</p>
<p>As so often, then, the choice of how to manage text comes down to issues such as performance, programming ease, and other components of total cost of ownership (or of some other general &#8220;goodness&#8221; metric, such as time-to-value). As a general rule, it seems:</p>
<ul>
<li>Text indexing inside relational DBMS has poorer performance than in, say, text search engines, often drastically so.</li>
<li>BLOB management inside relational databases has poorer performance than leaving the files outside the DBMS&#8217; purview.</li>
<li>Relational DBMS do just fine at managing text strings up to, say, 2048 characters long.</li>
<li>Tight integration between  text search and SQL is valuable in a few applications, but irrelevant to many others.</li>
</ul>
<p>And so, to a first approximation:</p>
<ul>
<li><strong>If you just have short text snippets</strong>, it can make sense to <strong>leave them in your relational database.</strong></li>
<li><strong>If performance is not an issue,</strong> you can just leave your BLOBs in your relational database too.</li>
<li><strong>If performance is an issue,</strong> you probably want to have <strong>your larger text files outside your RDBMS&#8217; control.</strong></li>
</ul>
<p>However, that phrasing assumes the default option is a relational DBMS, which may not be the case at all. Other choices include:</p>
<ul>
<li><strong>Standalone text search engines.</strong> If you want the best available text search, get a search engine. But attempts to start with a search engine and wind up with an application platform have generally run into difficulty.</li>
<li><strong>Document-oriented (or other) NoSQL </strong>systems. The story here is surprisingly like that for relational DBMS. I&#8217;ve previously noted that <a href="../../../../../2011/02/07/notes-on-document-oriented-nosql/">document-oriented NoSQL systems manage objects, not &#8220;documents&#8221; in the ordinary sense of the word</a>. Even so, conceptually they&#8217;re no less suited for management of true documents than relational DBMS are. I&#8217;d guess that the correlation between use cases involving true documents and use cases where document-oriented NoSQL is suitable is positive, but not very strong.</li>
<li><strong>MarkLogic.</strong> From one standpoint, MarkLogic is just a heavier-weight version of document-oriented NoSQL. But MarkLogic&#8217;s XML (and XQuery) orientation, tuned-for-years indexing, and built-in search engine put it on a different level for document management than the upstarts have reached.</li>
</ul>
<p>I&#8217;ll cover the Hadoop option in the next, more analytically-focused post.</p>
<p>I hope I&#8217;ve demonstrated that there are appropriate use cases for each of:</p>
<ul>
<li>Letting documents be managed by the file system (and pointing to them from your preferred DBMS).</li>
<li>Sticking documents straight into your preferred DBMS (SQL or non-SQL as the case may be).</li>
<li>Using a specialty system such as MarkLogic (or of course, in some cases, an enterprise search engine).</li>
</ul>
<p>And that&#8217;s even before we move on to<em> analytically-oriented text data management.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Text data management, Part 1: Confusion</title>
		<link>http://www.dbms2.com/2011/10/10/text-data-management-confusion/</link>
		<comments>http://www.dbms2.com/2011/10/10/text-data-management-confusion/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 01:58:03 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5421</guid>
		<description><![CDATA[This is Part 1 of a three post series. The posts cover: Confusion about text data management. Choices for text data management (general and short-request). Choices for text data management (analytic). There&#8217;s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include: The terminology around text [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is Part 1 of a three post series. The posts cover:</em></p>
<ol>
<li> <em><a href="http://www.dbms2.com/2011/10/10/text-data-management-confusion/">Confusion about text data management</a>.</em></li>
<li><em><a href="http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/">Choices for text data management (general and short-request)</a>.</em></li>
<li><em><a href="http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/">Choices for text data management (analytic)</a>.</em></li>
</ol>
<p>There&#8217;s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:</p>
<ul>
<li>The terminology around text data is inaccurate.</li>
<li>Data volume estimates for text are misleading.</li>
<li>Multiple different technologies are in the mix, including:
<ul>
<li>Enterprise text search.</li>
<li>Text analytics &#8212; <a href="http://www.texttechnologies.com/category/software-as-a-service-saas/category/text-mining/">text mining</a>, sentiment analysis, etc.</li>
<li>Document stores &#8212; e.g. document-oriented NoSQL, or MarkLogic.</li>
<li>Log management and parsing &#8212; e.g. Splunk.</li>
<li>Text archiving &#8212; e.g., various specialty email archiving products I couldn&#8217;t even name.</li>
<li>Public web search &#8212; Google et al.</li>
</ul>
</li>
<li>Text search vendors have disappointed, especially technically.</li>
<li>Text analytics vendors have disappointed, especially financially.</li>
<li>Other analytic technology vendors ignore <a href="http://www.texttechnologies.com/2010/12/01/state-of-the-art-text-analytics-mining-applications/">what the text analytic vendors actually have accomplished</a>, and reinvent inferior wheels rather than OEM the state of the art.</li>
</ul>
<p>Above all: <a href="http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/">The use cases for text data vary greatly</a>, just as the use cases for simply-structured databases do.</p>
<p>There are probably fewer people now than there were six years ago who need to be told that <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">text and relational database management are very different things</a>. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: <span id="more-5421"></span></p>
<ul>
<li><strong> The terms &#8220;unstructured&#8221; or &#8220;semi-structured&#8221; data are inherently misleading</strong>. That&#8217;s why <a href="../../../../../2011/05/17/poly-structured-database/">I favor &#8220;multi-structured&#8221; or &#8220;poly-structured&#8221; instead</a>. (&#8220;Multi-structured&#8221; seems to be winning; e.g., it&#8217;s been adopted by Teradata and Teradata/Aster.)</li>
<li>The &#8220;social media&#8221; text data any one enterprise brings in house isn&#8217;t all that much. For example, <a href="../../../../../2011/04/14/attensity-update/">Attensity serves many different enterprises&#8217; social media needs from a single 20-terabyte data store</a>, and reports that no single enterprise has required as much as 1 terabyte of text yet. <strong>Text data may consume a lot of storage </strong>on spinning disks somewhere,<strong> but it&#8217;s not that big a factor in future DBMS industry growth.</strong> (That 20 terabyte figure does seem low.)</li>
<li><strong>Structured databases are typically worth a lot more per bit than other kinds.</strong> The most valuable electronic data, per-bit, is probably records of significant economic transactions &#8212; purchases, sales, money transfers, etc. The least valuable may be sensor log files, whose contents consist mainly of &#8220;Nothing going on here; ping you again in a minute.&#8221; Email logs, web interaction data and many other kinds fall somewhere in between. Highly valuable documents &#8212; such as signed contracts &#8212; generally persist in paper as well as electronic forms. <strong>Investors commonly overlook this point.</strong></li>
<li><strong>The enterprise text search industry is screwed up.</strong>
<ul>
<li>FAST was a goofy company before it was acquired for far too much money by Microsoft.</li>
<li>Autonomy was a goofy company before it was acquired for far too much money by HP.</li>
<li>Google&#8217;s enterprise efforts are quiet.</li>
<li>The integration of text search and relational DBMS &#8212; e.g. at Oracle &#8212; has languished, with poor performance and evident lack of management attention.</li>
<li>Smaller text search vendors don&#8217;t seem to be getting a lot of traction &#8212; e.g., <a href="http://www.texttechnologies.com/category/vendors/coveo/">Coveo</a> has a decent reputation, but when&#8217;s the last time you heard much about them? What has Attivio actually accomplished?</li>
</ul>
</li>
<li><strong>Text analytics is a small business</strong>. Add up the revenue for Attensity, Clarabridge, Lexalytics, Temis, and all the others, and you might poke above $100 million, especially now that Attensity had a three-way merger. Then again, you might not.</li>
<li>Even so, <strong>the text analytics vendors have developed sophisticated technology.</strong> In particular, you can use it to get a pretty good idea as to what people are writing about you, individually or as groups.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/10/text-data-management-confusion/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Defining NoSQL</title>
		<link>http://www.dbms2.com/2011/10/02/defining-nosql/</link>
		<comments>http://www.dbms2.com/2011/10/02/defining-nosql/#comments</comments>
		<pubDate>Mon, 03 Oct 2011 00:32:02 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Schooner Information Technology]]></category>
		<category><![CDATA[dbShards and CodeFutures]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5394</guid>
		<description><![CDATA[A reporter tweeted:  &#8221;Is there a simple plain English definition for NoSQL?&#8221; After reminding him of my cynical yet accurate Third Law of Commercial Semantics, I gave it a serious try, and came up with the following. More precisely, I tweeted the bolded parts of what&#8217;s below; the rest is commentary added for this post. NoSQL [...]]]></description>
			<content:encoded><![CDATA[<p>A reporter tweeted:  &#8221;Is there a simple plain English definition for NoSQL?&#8221; After reminding him of my cynical yet accurate <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">Third Law of Commercial Semantics</a>, I gave it a serious try, and came up with the following. More precisely, I tweeted the bolded parts of what&#8217;s below; the rest is commentary added for this post.</p>
<p><strong>NoSQL is most easily defined by what it excludes: SQL, joins, strong analytic alternatives to those, and some forms of database integrity. If you leave all four out, and you have a strong scale-out story, you&#8217;re in the NoSQL mainstream.</strong>   <span id="more-5394"></span></p>
<ul>
<li>Thus, I&#8217;d say Cassandra, HBase, Mongo DB, and Couchbase are prime examples, in no particular order. Riak as well.</li>
<li>I might have phrased that better if I&#8217;d used a different word than simply &#8220;strong&#8221; &#8212; but hey, there was a 140-character limit, and he was on deadline.</li>
</ul>
<p><strong>Using NoSQL can make sense when at least one of two things is paramount: low-cost scale-out or dynamic schemas.</strong></p>
<ul>
<li>There are some seriously sensible use cases for <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schemas</a>.</li>
<li>&#8220;Low-cost&#8221; generally boils down to:
<ul>
<li>Performance.</li>
<li>Open source free-like-beer.</li>
<li>Not a lot of database administration.</li>
</ul>
</li>
</ul>
<p>I&#8217;ve generally given object-oriented DBMS vendors and also MarkLogic hard times whenever they consider saying they&#8217;re &#8220;NoSQL&#8221;. Reasons include:</p>
<ul>
<li>Closed source.</li>
<li>Database administration overhead (even if you get good stuff for incurring that overhead, like MarkLogic&#8217;s comprehensive indexing).</li>
</ul>
<p>Also, NoSQL started out being ACID-unfriendly.</p>
<p><strong>What you give up are the query flexibility and the easily automatic data integrity of SQL-based systems.</strong> I should have added something about a mature ecosystem.</p>
<p>In the most recent live example, I influenced a <a href="../../../../../2011/09/19/oltp-disk-solid-state/">client</a> away from Cassandra and toward scale-out MySQL (dbShards and/or Schooner flavors, most likely). Part of the reason was the ability to do joins, which are useful in their application. Another part is that their development practices obviated any significant benefit from dynamic schemas. But perhaps the most important &#8212; or at least resonant &#8212; reason of all was that they really, really cared about .NET support.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/02/defining-nosql/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Oracle NoSQL is unlikely to be a big deal</title>
		<link>http://www.dbms2.com/2011/09/30/oracle-nosql/</link>
		<comments>http://www.dbms2.com/2011/09/30/oracle-nosql/#comments</comments>
		<pubDate>Fri, 30 Sep 2011 18:20:53 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Schooner Information Technology]]></category>
		<category><![CDATA[Structured documents]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5384</guid>
		<description><![CDATA[Alex Williams noticed that there will be a NoSQL session at Oracle OpenWorld next week, and is wondering whether this will be a big deal. I think it won&#8217;t be. There really are three major points to NoSQL. Dynamic schemas. This is the only one of the three that truly depends on NoSQL. Scale-out short-request [...]]]></description>
			<content:encoded><![CDATA[<p>Alex Williams noticed that there will be a NoSQL session at Oracle OpenWorld next week, and is wondering whether this will be a big deal. I think it won&#8217;t be.</p>
<p>There really are three major points to NoSQL.</p>
<ul>
<li><strong><a href="http://www.dbms2.com/2011/07/31/dynamic-fixed-schema-databases/">Dynamic schemas</a>.</strong> This is the only one of the three that truly depends on NoSQL.</li>
<li><strong>Scale-out <a href="http://www.dbms2.com/2011/03/30/short-request-and-analytic-processing/">short-request processing</a>.</strong> If you want to scale out efficiently at high request volumes, you&#8217;re best off not using all the flexibility SQL/relational DBMS offer. (In particular, you don&#8217;t want to do cross-node joins). Not coincidentally, a number of the best scale-out offerings were built to be NoSQL.</li>
<li><strong>Open source</strong>. Doing a relational DBMS is a big project. It seems easier to build NoSQL ones.</li>
</ul>
<p>Oracle can address the latter two points as aggressively as it wishes via MySQL. It so happens I would generally recommend MySQL enhanced by dbShards, Schooner, and/or dbShards/Schooner, rather than Oracle-only MySQL &#8230; but that&#8217;s a detail. In some form or other, Oracle&#8217;s MySQL is a huge player in the scale-out, open source, short-request database management market.</p>
<p>So that leaves us with dynamic schemas. Oracle has at least four different sets of technology in that area:</p>
<ul>
<li> As <a href="http://www.dbms2.com/2010/08/22/workday-technology-stack/">Workday</a> noticed years ago, MySQL can be used as a functional, basic key-value store.</li>
<li>Oracle also has XML-based Berkeley DB/SleepyCat kicking around.*</li>
<li>The XML extensions to Oracle&#8217;s core DBMS could be alleged to have a dynamic schema/NoSQL flavor. (Blech.)</li>
<li>A dynamic schema argument could also be made for object-oriented DBMS technology. While Oracle doesn&#8217;t to my knowledge exactly sell that, it does have the <a href="http://www.dbms2.com/2007/03/25/oracle-tangosol-objects-caching-and-disruption/">Tangosol</a> Coherence line of technology, with a potentially similar programming model.</li>
</ul>
<p>If Oracle is now refreshing and rebranding one or more of these as &#8220;NoSQL&#8221;, there&#8217;s no reason to view that as a big deal at all.</p>
<p><em>*That&#8217;s Mike Olson&#8217;s former company, if you&#8217;re keeping score at home.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/30/oracle-nosql/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>The database architecture of salesforce.com, force.com, and database.com</title>
		<link>http://www.dbms2.com/2011/09/15/database-architecture-salesforce-com-force-com-and-database/</link>
		<comments>http://www.dbms2.com/2011/09/15/database-architecture-salesforce-com-force-com-and-database/#comments</comments>
		<pubDate>Thu, 15 Sep 2011 16:09:32 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[salesforce.com]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5237</guid>
		<description><![CDATA[salesforce.com, force.com, and database.com use exactly the same database infrastructure and architecture. That&#8217;s the good news. The bad news is that salesforce.com is somewhat obscure about technical details, for reasons such as: A long-ago marketing decision to not give infrastructure details, so as to convey a &#8220;Don&#8217;t worry; we&#8217;ll take care of everything&#8221; message. Even [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.dbms2.com/2011/09/15/salesforce-force-database-data-heroku/">salesforce.com, force.com, and database.com use exactly the same database infrastructure and architecture</a>. That&#8217;s the good news. The bad news is that salesforce.com is somewhat obscure about technical details, for reasons such as:</p>
<ul>
<li>A long-ago marketing decision to not give infrastructure details, so as to convey a &#8220;Don&#8217;t worry; we&#8217;ll take care of everything&#8221; message.</li>
<li>Even so, a long-ago and perhaps now-regretted marketing decision to disclose and even exaggerate salesforce.com&#8217;s reliance on Oracle, as part of an early-days attempt to prove salesforce was using enterprise-class technology.</li>
<li>A desire to hide the recipe for salesforce.com&#8217;s secret sauce.</li>
<li>Force of habit &#8212; I&#8217;m not sure salesforce even knows how to tell its technical story with any clarity.</li>
</ul>
<p>Actually, salesforce.com has moved some kinds of data out of Oracle that previously used to be stored there. Besides Oracle, salesforce uses at least a file system and a RAM-based data store about which I have no details. Even so, much of salesforce.com&#8217;s data is stored in Oracle &#8212; a single instance of Oracle, which it believes may be the largest instance of Oracle in the world.</p>
<p><span id="more-5237"></span>Salesforce did spell out some of its database story in <a href="http://www.salesforce.com/au/assets/pdf/Force.com_Multitenancy_WP_101508.pdf">a 2008 force.com white paper</a>,<em> </em>which is good stuff, but potentially misleading in one important way. The paper tells of a level of abstraction, whereby what the application sees as logical &#8220;columns&#8221; are stored in a very different schema than one might assume. However, it doesn&#8217;t spell out a second level of abstraction, whereby that logical schema also isn&#8217;t how the database is actually laid out.</p>
<p><em>Another flaw in the paper is that it spins &#8220;We had to do this, to support multitenancy, so we did.&#8221; issues as &#8220;Because we&#8217;re multitenant, we can do this, while single-tenant systems can&#8217;t.&#8221; One example is the query optimization step around &#8220;user visibility&#8221; in Figure 11. Welcome to marketing.</em></p>
<p>At the first level of abstraction, data seems to be kept mainly in a single wide table, with hundreds of columns. What&#8217;s more, many of those are &#8220;flex columns&#8221;; a flex column can hold data of many different kinds and even datatypes. Notwithstanding the second level of abstraction, I imagine the idea of stuffing different kinds of thing into the same column has something to do with the fact that <a href="../../../../../2011/03/13/so-how-many-columns-can-a-single-table-have-anyway/">Oracle&#8217;s physical limit on columns</a> falls far short of the number of logical columns salesforce wants to use.</p>
<p>If we imagine that the different kinds of data in a flex column were each in their own column instead, the whole thing might sound like BigTable/Cassandra/HBase-style column-group NoSQL. Thus, much as <a href="../../../../../2010/08/22/workday-technology-stack/">Workday uses MySQL to simulate a key-value store</a>, salesforce.com can be said to use Oracle to simulate a different kind of NoSQL. In both cases, what&#8217;s going on seems to be a kind of object/relational mapping, but with the relational aspect strongly deemphasized. Or, if you take a more relational view, we could say that salesforce.com&#8217;s tables are a lot wider than any one user organization&#8217;s, because each user sees only its own custom columns (plus the standard ones common to all users).</p>
<p>The second layer of abstraction has a lot to do with multitenancy. If you want to stick data for many different user organizations into the same huge table, then you have to label it in some way to show who is permitted to see or update each part. Logically, this leads to a join, between one table carrying data plus a simple key showing which users/roles are entitled to see it, and a second table showing who actually is that kind of user/has that kind of role. But that join makes a lot of sense to store in a denormalized way, all the more because data is partitioned across the computer cluster in line with which user organization it actually belongs to.</p>
<p><em>Multitenant security isn&#8217;t the only reason for this denormalization, but it appears to be the biggest one.</em></p>
<p>The whole thing is doing 550 million or so transactions per day. salesforce.com thinks that fact should be regarded as evidence that it works. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/15/database-architecture-salesforce-com-force-com-and-database/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
	</channel>
</rss>

