<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Log analysis</title>
	<atom:link href="http://www.dbms2.com/category/applications/log-file-logfile-analysis/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Sumo Logic and UIs for text-oriented data</title>
		<link>http://www.dbms2.com/2012/02/06/sumo-logic-and-uis-for-text-oriented-data/</link>
		<comments>http://www.dbms2.com/2012/02/06/sumo-logic-and-uis-for-text-oriented-data/#comments</comments>
		<pubDate>Mon, 06 Feb 2012 13:27:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5897</guid>
		<description><![CDATA[I talked with the Sumo Logic folks for an hour Thursday. Highlights included: Sumo Logic does SaaS (Software as a Service) log management. Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as &#8220;Splunk-like&#8221;. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to [...]]]></description>
			<content:encoded><![CDATA[<p>I talked with the Sumo Logic folks for an hour Thursday. Highlights included:</p>
<ul>
<li>Sumo Logic does SaaS (Software as a Service) log management.</li>
<li>Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as &#8220;Splunk-like&#8221;. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to <a href="../../../../../2012/01/10/splunk-update/">branch out</a>.)</li>
<li>Sumo Logic has hacked Lucene for faster indexing, and says 10-30 second latencies are typical.</li>
<li>Sumo Logic&#8217;s main differentiation is <strong>automated classification of events. </strong></li>
<li>There&#8217;s some kind of streaming engine in the mix, to update counters and drive alerts.</li>
<li>Sumo Logic has around 30 &#8220;customers,&#8221; free (mainly) or paying (around 5) as the case may be.</li>
<li>A truly typical Sumo Logic customer has single to low double digits of gigabytes of log data per day. However, Sumo Logic seems highly confident in its ability to handle a terabyte per customer per day, give or take a factor of 2.</li>
<li>When I asked about the implications of shipping that much data to a remote data center, Sumo Logic observed that log data compresses really well.</li>
<li>Sumo Logic recently raised a bunch of venture capital.</li>
<li>Sumo Logic&#8217;s founders are out of ArcSight, a log management company HP paid a bunch of money for.</li>
<li>Sumo Logic coined a marketing term &#8220;LogReduce&#8221;, but it has nothing to do with &#8220;MapReduce&#8221;. Sumo Logic seems to find this amusing.</li>
</ul>
<p>What interests me about Sumo Logic is that automated classification story. I thought I heard Sumo Logic say:<span id="more-5897"></span></p>
<ul>
<li>It&#8217;s largely unsupervised machine learning.</li>
<li>It&#8217;s specific to a particular user/data set.</li>
<li>It can be up and running and classifying things effectively almost instantly (i.e., on seconds&#8217; or minutes&#8217; worth of data).</li>
<li>It&#8217;s informed by what different users tag as false positives. (Or maybe that is planned for future versions.)</li>
</ul>
<p><em>I have a little trouble seeing how all those points fit exactly together, so perhaps I got some details wrong.</em></p>
<p>The payoff is that <strong>machine learning directly informs the Sumo Logic user interface</strong>. In particular, large numbers of events are bundled into a small number of categories, hopefully making it much easier for network operations types to scan the UI and pick out what&#8217;s important.</p>
<p>In general, the idea of machine-learning informing analytic UIs via some sort of classification is common in text-oriented technologies, notably in:</p>
<ul>
<li>Good ol&#8217; text search.</li>
<li>Text mining vendors&#8217; approaches to clustering hits on words or phrases that say substantially the same thing.</li>
</ul>
<p>But otherwise it seems kind of rare, if we stipulate that ad-serving/general internet personalization isn&#8217;t really an analytic UI &#8212; but I&#8217;d love to hear of any interesting examples I&#8217;ve overlooked.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/02/06/sumo-logic-and-uis-for-text-oriented-data/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Splunk update</title>
		<link>http://www.dbms2.com/2012/01/10/splunk-update/</link>
		<comments>http://www.dbms2.com/2012/01/10/splunk-update/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 05:55:08 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5791</guid>
		<description><![CDATA[Splunk is announcing the Splunk 4.3 point release. Before discussing it, let&#8217;s recall a few things about Splunk, starting with: Splunk is first and foremost an analytic DBMS &#8230; &#8230; used to manage logs and similar multistructured data. Splunk&#8217;s DML (Data Manipulation Language) is based on text search, not on SQL. Splunk has extended its [...]]]></description>
			<content:encoded><![CDATA[<p>Splunk is announcing the Splunk 4.3 point release. Before discussing it, let&#8217;s recall a few things about Splunk, starting with:</p>
<ul>
<li>Splunk is first and foremost an analytic DBMS &#8230;</li>
<li>&#8230; used to manage logs and similar multistructured data.</li>
<li>Splunk&#8217;s DML (Data Manipulation Language) is based on text search, not on SQL.</li>
<li>Splunk has extended its DML in natural ways (e.g., you can use it to do calculations and even some statistics).</li>
<li>Splunk bundles some (very) basic, Splunk-specific business intelligence capabilities.</li>
<li>The paradigmatic use of Splunk is to monitor IT operations in real time. However:
<ul>
<li>There also are plenty of non-real-time uses for Splunk.</li>
<li>Splunk is proudest of its growth in non-IT quasi-real-time uses, such as the marketing side of web operations.</li>
</ul>
</li>
</ul>
<p>As in any release, a lot of Splunk 4.3 is about &#8220;Oh, you didn&#8217;t have that before?&#8221; features and <a href="../../../../../2009/08/21/bottleneck-whack-a-mole/">Bottleneck Whack-A-Mole</a> performance speed-up. One performance enhancement is Bloom filters, which are a very hot topic these days. More important is a switch from Flash to HTML5, so as to accommodate mobile devices with less server-side rendering. Splunk reports that its users &#8212; especially the non-IT ones &#8212; really want to get Splunk information on the tablet devices. While this somewhat contradicts <a href="../../../../../2012/01/04/some-issues-in-business-intelligence/">what I wrote a few days ago pooh-poohing mobile BI</a>, let me hasten to point out:</p>
<ul>
<li>Splunk is used for a lot of (quasi) real-time monitoring.</li>
<li>Splunk&#8217;s desktop user interfaces are, by BI standards, quite primitive.</li>
</ul>
<p>That&#8217;s pretty much the ideal scenario for mobile BI: Timeliness matters and prettiness doesn&#8217;t.</p>
<p><span id="more-5791"></span><em>Hmm. Maybe <a href="../../../../../2011/11/10/streambase-liveview-push-based-real-time-bi/">StreamBase LiveView</a> needs a mobile option as well &#8230;</em></p>
<p>Splunk&#8217;s basic use is to take the text string that is a log and make sense of it. But Splunk now also supports JSON structures. It does this via something called spath, which as you might guess from the name has XPath similarities. That probably bore more discussion than we found the time to have.</p>
<p><em>By the way: If you&#8217;re interested in BI over XML, that&#8217;s what my former clients at Skytide were founded to do, before they pivoted a bit. I don&#8217;t think those capabilities have disappeared from the product</em>.</p>
<p><a href="http://www.monash.com/uploads/Splunk-4-3.pdf">Splunk has graciously allowed me to post a slide deck</a>. More stuff in there, including quotes from a customer &#8212; Expedia &#8212; that has 2700 Splunk users.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/10/splunk-update/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Big data terminology and positioning</title>
		<link>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/</link>
		<comments>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 01:35:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5768</guid>
		<description><![CDATA[Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions: Bigness &#8212; Volume, Velocity, size Structure &#8212; Variety, Variability, Complexity given that High-velocity &#8220;big data&#8221; problems are usually high-volume as well.* Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction. But the conflation should [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I observed that <a href="../../../../../2011/09/11/big-data-has-jumped-the-shark/">Big Data terminology is seriously broken</a>. It is reasonable to reduce the subject to two quasi-dimensions:</p>
<ul>
<li><strong>Bigness</strong> &#8212; Volume, Velocity, size</li>
<li><strong>Structure</strong> &#8212; Variety, Variability, Complexity</li>
</ul>
<p>given that</p>
<ul>
<li>High-velocity &#8220;big data&#8221; problems are usually high-volume as well.*</li>
<li>Variety, variability, and complexity all relate to the <a href="../../../../../2011/05/17/poly-structured-database/">simply-structured/poly-structured</a> distinction.</li>
</ul>
<p>But the conflation should stop there.</p>
<p><em>*Low-volume/high-velocity problems are commonly referred to as <a href="../2011/08/25/renaming-cep-or-not/">&#8220;event processing&#8221; and/or &#8220;streaming&#8221;</a>.</em></p>
<p>When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2&#215;2 matrix of possibilities. For want of better alternatives, my suggestions are:</p>
<ul>
<li><strong>Relational big data</strong> is data of high volume that fits well into a relational DBMS.</li>
<li><strong>Multi-structured big data</strong> is data of high volume that doesn&#8217;t fit well into a relational DBMS. <em>Alternative: Poly-structured big data.</em></li>
<li><strong>Conventional relational data</strong> is data of not-so-high volume that fits well into a relational DBMS. <em>Alternatives: Ordinary/normal/smaller relational data.</em></li>
<li><strong>Smaller poly-structured data</strong> is data for which <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schema</a> capabilities are important, but which doesn&#8217;t rise to &#8220;big data&#8221; volume.</li>
</ul>
<p><span id="more-5768"></span>Notes on all this include:</p>
<ul>
<li>&#8220;Relational big data&#8221; is commonly what you need a scalable analytic relational DBMS for. But there are non-analytic use cases as well.</li>
<li>The paradigmatic example of &#8220;multi-structured big data&#8221; is log files. Thus, multi-structured big data is commonly what you need a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> for.</li>
<li>One might want to equate non-analytic relational big data technology to &#8220;NewSQL&#8221;. However, I&#8217;m struggling to think of a database size range in which the entire NewSQL industry can match Oracle&#8217;s market share alone.</li>
<li>One might want to equate non-analytic multi-structured big data technology to &#8220;NoSQL&#8221;. However:
<ul>
<li>&#8220;NoSQL&#8221; is also used to encompass not-so-big-data use cases, such as prototyping in MongoDB.</li>
<li><a href="../../../../../2011/10/02/defining-nosql/">&#8220;NoSQL&#8221; has non-ACID/low(er)-data-integrity connotations</a> that aren&#8217;t appropriate for all non-relational systems.</li>
</ul>
</li>
<li>Up to a point, you can analyze relational big data in a conventional relational DBMS, but an analytic RDBMS will usually win on TCO (Total Cost of Ownership). In particular, reasonable thresholds for moving an analytic database off Oracle might be:
<ul>
<li>1-2 terabytes if you&#8217;ve never bought anything past Oracle Standard Edition.</li>
<li>5-10 terabytes if you&#8217;re already paying for Oracle Enterprise Edition.</li>
<li>A lot higher than that if you actually find Oracle Exadata to be cost-effective.</li>
</ul>
</li>
<li>Depending on how big one acknowledges as &#8220;big&#8221;, the market share leader in &#8220;big bit bucket&#8221; use cases is either Splunk or Hadoop.</li>
<li>If we look at multi-structured big data management overall, MarkLogic joins the list of market share contenders, as do various NoSQL alternatives.</li>
<li>It is wrong to say that the large web companies invented &#8220;big data&#8221; technology. But it is more reasonable to say they invented much of &#8220;multi-structured big data&#8221; management. In particular (and this is just a partial list), Google, Amazon, Yahoo, Facebook, et al. can reasonably be credited with Hadoop, Cassandra, HBase and various predecessors to same.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>What those nested data structures are about</title>
		<link>http://www.dbms2.com/2011/10/19/nested-data-structures/</link>
		<comments>http://www.dbms2.com/2011/10/19/nested-data-structures/#comments</comments>
		<pubDate>Wed, 19 Oct 2011 17:29:59 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5506</guid>
		<description><![CDATA[As I&#8217;ve noted before, the very big web companies have an issue with nested data structures. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a [...]]]></description>
			<content:encoded><![CDATA[<p>As I&#8217;ve noted before, <a href="http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/">the very big web companies have an issue with nested data structures</a>. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a good tactic.</p>
<p>The explanation was led by Oliver Ratzesberger, late of eBay*and progenitor of <a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay&#8217;s Singularity project</a>. In simplest terms, <strong>one event can spawn a lot of event attribute information, perhaps in the form of name-value pairs,</strong> which it then makes sense to store together in some way. The example Oliver dwelled on was that, on any given web page, there can be 100+ pieces of information to record, including:</p>
<ul>
<li>All 50 search results you were shown, and their positions in the search rankings.</li>
<li>Every ad, image, or graphical element.</li>
<li>An ID as to which test you were participating in (every page you see on eBay has some element being tested).</li>
</ul>
<p><em>*Oliver is leaving eBay for a still-secret large company. I would conjecture that Michael McIntire is on the move too, either to replace Oliver or to go with him, but Oliver did a very good job of not commenting on the matter.</em></p>
<p>There are several reasons why one might wish to store this information in ways that grieve relational purists. First, reconstructing all this information via joins would be brutally expensive. What&#8217;s more, reconstructing all this information via joins could be impractical. Some comes from third party ad servers, which might not reproduce the same ads upon demand. Other is in the form of rankings, which can&#8217;t always be reliably reproduced from one query to the next. (That&#8217;s just one of several reasons <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">text search and relational DBMS are an awkward fit</a>.)</p>
<p>Also, there&#8217;s a strong <a href="http://www.dbms2.com/2011/07/31/dynamic-fixed-schema-databases/">dynamic schema</a> flavor to these databases. The list of attributes for one web click might be very different in kind from the list for the next page. Forcing that kind of variability into a fixed relational schema, while theoretically possible, doesn&#8217;t necessarily make a lot of sense.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/19/nested-data-structures/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Text data management, Part 1: Confusion</title>
		<link>http://www.dbms2.com/2011/10/10/text-data-management-confusion/</link>
		<comments>http://www.dbms2.com/2011/10/10/text-data-management-confusion/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 01:58:03 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5421</guid>
		<description><![CDATA[This is Part 1 of a three post series. The posts cover: Confusion about text data management. Choices for text data management (general and short-request). Choices for text data management (analytic). There&#8217;s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include: The terminology around text [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is Part 1 of a three post series. The posts cover:</em></p>
<ol>
<li> <em><a href="http://www.dbms2.com/2011/10/10/text-data-management-confusion/">Confusion about text data management</a>.</em></li>
<li><em><a href="http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/">Choices for text data management (general and short-request)</a>.</em></li>
<li><em><a href="http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/">Choices for text data management (analytic)</a>.</em></li>
</ol>
<p>There&#8217;s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:</p>
<ul>
<li>The terminology around text data is inaccurate.</li>
<li>Data volume estimates for text are misleading.</li>
<li>Multiple different technologies are in the mix, including:
<ul>
<li>Enterprise text search.</li>
<li>Text analytics &#8212; <a href="http://www.texttechnologies.com/category/software-as-a-service-saas/category/text-mining/">text mining</a>, sentiment analysis, etc.</li>
<li>Document stores &#8212; e.g. document-oriented NoSQL, or MarkLogic.</li>
<li>Log management and parsing &#8212; e.g. Splunk.</li>
<li>Text archiving &#8212; e.g., various specialty email archiving products I couldn&#8217;t even name.</li>
<li>Public web search &#8212; Google et al.</li>
</ul>
</li>
<li>Text search vendors have disappointed, especially technically.</li>
<li>Text analytics vendors have disappointed, especially financially.</li>
<li>Other analytic technology vendors ignore <a href="http://www.texttechnologies.com/2010/12/01/state-of-the-art-text-analytics-mining-applications/">what the text analytic vendors actually have accomplished</a>, and reinvent inferior wheels rather than OEM the state of the art.</li>
</ul>
<p>Above all: <a href="http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/">The use cases for text data vary greatly</a>, just as the use cases for simply-structured databases do.</p>
<p>There are probably fewer people now than there were six years ago who need to be told that <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">text and relational database management are very different things</a>. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: <span id="more-5421"></span></p>
<ul>
<li><strong> The terms &#8220;unstructured&#8221; or &#8220;semi-structured&#8221; data are inherently misleading</strong>. That&#8217;s why <a href="../../../../../2011/05/17/poly-structured-database/">I favor &#8220;multi-structured&#8221; or &#8220;poly-structured&#8221; instead</a>. (&#8220;Multi-structured&#8221; seems to be winning; e.g., it&#8217;s been adopted by Teradata and Teradata/Aster.)</li>
<li>The &#8220;social media&#8221; text data any one enterprise brings in house isn&#8217;t all that much. For example, <a href="../../../../../2011/04/14/attensity-update/">Attensity serves many different enterprises&#8217; social media needs from a single 20-terabyte data store</a>, and reports that no single enterprise has required as much as 1 terabyte of text yet. <strong>Text data may consume a lot of storage </strong>on spinning disks somewhere,<strong> but it&#8217;s not that big a factor in future DBMS industry growth.</strong> (That 20 terabyte figure does seem low.)</li>
<li><strong>Structured databases are typically worth a lot more per bit than other kinds.</strong> The most valuable electronic data, per-bit, is probably records of significant economic transactions &#8212; purchases, sales, money transfers, etc. The least valuable may be sensor log files, whose contents consist mainly of &#8220;Nothing going on here; ping you again in a minute.&#8221; Email logs, web interaction data and many other kinds fall somewhere in between. Highly valuable documents &#8212; such as signed contracts &#8212; generally persist in paper as well as electronic forms. <strong>Investors commonly overlook this point.</strong></li>
<li><strong>The enterprise text search industry is screwed up.</strong>
<ul>
<li>FAST was a goofy company before it was acquired for far too much money by Microsoft.</li>
<li>Autonomy was a goofy company before it was acquired for far too much money by HP.</li>
<li>Google&#8217;s enterprise efforts are quiet.</li>
<li>The integration of text search and relational DBMS &#8212; e.g. at Oracle &#8212; has languished, with poor performance and evident lack of management attention.</li>
<li>Smaller text search vendors don&#8217;t seem to be getting a lot of traction &#8212; e.g., <a href="http://www.texttechnologies.com/category/vendors/coveo/">Coveo</a> has a decent reputation, but when&#8217;s the last time you heard much about them? What has Attivio actually accomplished?</li>
</ul>
</li>
<li><strong>Text analytics is a small business</strong>. Add up the revenue for Attensity, Clarabridge, Lexalytics, Temis, and all the others, and you might poke above $100 million, especially now that Attensity had a three-way merger. Then again, you might not.</li>
<li>Even so, <strong>the text analytics vendors have developed sophisticated technology.</strong> In particular, you can use it to get a pretty good idea as to what people are writing about you, individually or as groups.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/10/text-data-management-confusion/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Hadoop notes</title>
		<link>http://www.dbms2.com/2011/09/12/hadoop-notes/</link>
		<comments>http://www.dbms2.com/2011/09/12/hadoop-notes/#comments</comments>
		<pubDate>Mon, 12 Sep 2011 09:03:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Health care]]></category>
		<category><![CDATA[Hortonworks]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapR]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5218</guid>
		<description><![CDATA[I visited California recently, and chatted with numerous companies involved in Hadoop &#8212; Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I&#8217;ll defer further Hadoop technical discussions for now &#8212; my target to restart them is later this month &#8212; but that still leaves some other issues to discuss, namely adoption and partnering. The total number [...]]]></description>
			<content:encoded><![CDATA[<p>I visited California recently, and chatted with numerous companies involved in Hadoop &#8212; Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I&#8217;ll defer further <a href="../../../../../2011/08/21/hadoop-evolution/">Hadoop technical discussions</a> for now &#8212; my target to restart them is later this month &#8212; but that still leaves some other issues to discuss, namely adoption and partnering.</p>
<p>The total number of enterprises in the world paying subscription and license fees that they would regard as being for &#8220;Hadoop or something Hadoop-related&#8221; probably is not much over 100 right now, but I&#8217;d expect to see pretty rapid growth. Beyond that, let&#8217;s divide customers into three groups:</p>
<ul>
<li>Internet businesses.</li>
<li>Traditional enterprises &#8216; internet operations.</li>
<li>Traditional enterprises&#8217; other operations.</li>
</ul>
<p>Hadoop vendors, in different mixes, claim to be doing well in all three segments. Even so, almost all use cases involve some kind of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>, with one exception being a credit card vendor crunching a large database of transaction details. Multiple kinds of machine-generated data come into play &#8212; web/network/mobile device logs, financial trade data, scientific/experimental data, and more. In particular, pharmaceutical research got some mentions, which makes sense, in that it&#8217;s one area of scientific research that actually enjoys fat for-profit research budgets.</p>
<p><span id="more-5218"></span>On the partnering side, I heard things about a Hortonworks conference call that do not seem to have been contradicted by my visit to Hortonworks. Namely, Hortonworks promised prospective partners, such as analytic DBMS vendors, hardware vendors, or large system integrators, that it wouldn&#8217;t compete with them, in that Hortonworks pledges not to introduce its own products for at least two years. This is presumably targeted most directly at <a href="../../../../../2010/10/10/partnering-with-cloudera/">Cloudera</a>, which has lots of partners, but also some <a href="../../../../../2010/06/30/cloudera-enterprise-hadoop-evolution/">proprietary code</a> of its own. MapR, I&#8217;d think, would be the #2 target, but that&#8217;s just speculation.</p>
<p>The other big part of <a href="../../../../../2011/07/10/cloudera-and-hortonworks/">Hortonworks&#8217; story</a> is the claim that it holds the axe in Apache Hadoop development. Nobody doubts that a large fraction of the work on Hadoop&#8217;s core projects was done by Yahoo employees. Many of those indeed moved to Hortonworks; others left Yahoo earlier; Hadoop creator Doug Cutting is actually at Cloudera. So just how dominant Hortonworks really is in core Hadoop development is a bit unclear. Meanwhile, Cloudera people seem to be leading a number of Hadoop companion or sub-projects, including the first two I can think of that relate to Hadoop integration or connectivity, namely Sqoop and Flume. So I&#8217;m not persuaded that the &#8220;we know this stuff better&#8221; part of the Hortonworks partnering story really holds up.</p>
<p>What I am persuaded of is that the Hadoop platform competition is a good thing. Whichever vendors and projects win will be healthier from having had to outcompete worthy alternatives.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/12/hadoop-notes/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>MongoDB users and use cases</title>
		<link>http://www.dbms2.com/2011/07/27/mongodb-users-and-use-cases/</link>
		<comments>http://www.dbms2.com/2011/07/27/mongodb-users-and-use-cases/#comments</comments>
		<pubDate>Wed, 27 Jul 2011 18:14:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5031</guid>
		<description><![CDATA[I spoke with Eliot Horowitz and Max Schierson of 10gen last month about MongoDB users and use cases. The biggest clusters they came up with weren&#8217;t much over 100 nodes, but clusters an order of magnitude bigger were under development. The 100 node one we talked the most about had 33 replica sets, each with [...]]]></description>
			<content:encoded><![CDATA[<p>I spoke with Eliot Horowitz and Max Schierson of 10gen last month about MongoDB users and use cases. The biggest clusters they came up with weren&#8217;t much over 100 nodes, but clusters an order of magnitude bigger were under development. The 100 node one we talked the most about had 33 replica sets, each with about 100 gigabytes of data, so that&#8217;s in the 3-4 terabyte range total. In general, the largest MongoDB databases are 20-30 TB; I&#8217;d guess those really do use the bulk of available disk space.   <span id="more-5031"></span></p>
<p>10gen recommends solid-state storage in many cases. In some cases solid-state lets you get away with fewer total nodes. 10gen also likes Flashcache (Facebook-developed technology to put a flash cache in front of hard disks). But the 100-node example mentioned above uses spinning disk.</p>
<p>Use cases 10gen is proud of include:</p>
<ul>
<li>Lots of user profile maintenance, including at online ad companies. This includes full user ad impression data. (I&#8217;ve argued for a while that <a href="../../../../../2010/09/17/jp-morgan-chase-oracle-database-outage/">user profile information belongs in something like a NoSQL database</a>.)</li>
<li>A big-name web company that wants to inspect every packet that enters their network, and replaced Splunk with MongoDB for performance reasons.</li>
<li>A big-name photo/video site whose metadata is all in MongoDB. (That&#8217;s the kind of thing that often makes for good <a href="../../../../../2011/05/30/another-category-of-derived-data/">MarkLogic</a> use cases.)</li>
</ul>
<p>But actually, the reason we had the call was to review cases where MongoDB&#8217;s <strong>schemaless</strong> nature was significant. Examples of those included:</p>
<ul>
<li>A couple of top examples were of the kind &#8220;A bunch of apps, similar but not the same.&#8221; For MTV, it&#8217;s a single content management system for a bunch of websites. For Disney Playdom, it&#8217;s different schemas for every game.</li>
<li>For a wireless telco, the issue was a product catalog in which devices and service plans called for very different schemas, and which the telco felt had thus become unmanageable in Oracle.</li>
<li>For Craigslist, the issue wasn&#8217;t programming so much as performance &#8212; <a href="http://blog.zawodny.com/2010/04/27/i-want-a-new-data-store/">ALTER TABLE operations took months in MySQL</a>, and that&#8217;s not a typo, although I&#8217;ll confess to not understanding why this was the case.</li>
</ul>
<p>The 10gen guys went on to claim that schemalessness is helpful for incremental development in general, the point being that you don&#8217;t have a database-modification step. To some extent, changes can even be rolled back more easily than if you actually changed your schemas.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/27/mongodb-users-and-use-cases/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Remote machine-generated data</title>
		<link>http://www.dbms2.com/2011/07/26/remote-machine-generated-data/</link>
		<comments>http://www.dbms2.com/2011/07/26/remote-machine-generated-data/#comments</comments>
		<pubDate>Tue, 26 Jul 2011 08:45:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Truviso]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5012</guid>
		<description><![CDATA[I refer often to machine-generated data, which is commonly generated inexpensively and in log-like formats, and is often best aggregated in a big bit bucket before you try to do much analysis on it. The term has caught on, to the point that perhaps it&#8217;s time to distinguish more carefully among different kinds of machine-generated [...]]]></description>
			<content:encoded><![CDATA[<p>I refer often to <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>, which is commonly generated inexpensively and in log-like formats, and is often best aggregated in a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> before you try to do much analysis on it. The term has caught on, to the point that perhaps it&#8217;s time to distinguish more carefully among different <em>kinds</em> of machine-generated data. In particular, I think it may be useful to distinguish between:</p>
<ul>
<li><strong>Log-stream machine-generated data,</strong> when what you&#8217;re looking at &#8212; at least initially &#8212; is the entire output of verbose logging systems.</li>
<li><strong>Remote machine-generated data.</strong></li>
</ul>
<p>Here&#8217;s what I&#8217;m thinking of for the second category. I rather frequently hear of cases in which data is generated by large numbers of remote machines, which occasionally send messages home. For example:  <span id="more-5012"></span></p>
<ul>
<li>I heard yesterday about a case with 10s of millions of machines, phoning home every 5 minutes, and another case with 10s of 1000s calling in every 5 seconds, both of them sending data initially to MySQL. (Application details weren&#8217;t given.)</li>
<li>I heard not long ago about a set-top box case that the vendor hoped would also grow to 10s of millions of machines, which I guessed might send a small number of messages per hour each.</li>
<li>I also heard recently about a remote security monitoring case whose data was destined for (probably) Netezza, although in that case I&#8217;m not sure about the &#8220;occasionally&#8221; aspect of the communication.</li>
<li>The last time I visited Splunk, I got the sense that energy-sensor use cases (especially in the electric grid) had finally emerged. I believe these sensors are periodic message senders &#8212; they wake up, take their temperature (figuratively or literally as the case may be), send a message, snooze, repeat.</li>
<li>I would guess that the <a href="../../../../../2009/10/14/infobright-notes/">energy use cases</a> Infobright talked about in 2009 were of a similar kind.</li>
<li>An April, 2010 comment on the post linked above talks about <a href="../../../../../2010/04/08/machine-generated-data-example/#comment-165006">many kinds of sensor data</a>.</li>
<li>Back in 2007, <a href="../../../../../2007/08/12/applications-for-not-so-low-latency-cep/">Coral8</a> talked of a truck phone-home use case (on-board GPS data and also, e.g., refrigeration level, sending messages once a minute or so). Truviso seemed to have one similar deal before one of its frequent changes in strategic direction, and not coincidentally cites UPS as an investor.</li>
<li>In principle, there are a lot of RFID use cases out there, even if I rarely seem to hear of any. (That would be a shorter &#8220;phone call&#8221; home than most of the other examples, of course, but might be otherwise technically similar.)</li>
</ul>
<p>Many technologies can be used to collect and manage remote machine-generated data, but a few common points are worth nothing.</p>
<ul>
<li>If a device takes the trouble to send a message across a wide-area network, that message may be somewhat more valuable than the average piece of log-vomit. Perhaps such information doesn&#8217;t need to be stored in the cheapest possible way.</li>
<li>Similarly, a message that is sent occasionally over time, or upon a specified event, may be more structured than a random log entry. Perhaps such data is suitable for sending straight to a <strong>relational database</strong>.</li>
<li>If there&#8217;s no central place the data originates, there may also be no favored place for the data to end up. It may make great sense to collect and analyze remote machine-generated data in the <strong>cloud. </strong>(Exceptions may of course arise if you want to use the data in connection with other information, and you hence want to bring it to that other information&#8217;s location.)</li>
<li>In a number of use cases, the whole point is to identify anomalies, and respond to them rapidly. I.e., remote machine-generated data use cases commonly raise challenges in low-latency <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integration of short-request and analytic processing</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/26/remote-machine-generated-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>HBase is not broken</title>
		<link>http://www.dbms2.com/2011/07/18/hbase-is-not-broken/</link>
		<comments>http://www.dbms2.com/2011/07/18/hbase-is-not-broken/#comments</comments>
		<pubDate>Mon, 18 Jul 2011 05:25:27 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4990</guid>
		<description><![CDATA[It turns out that my impression that HBase is broken was unfounded, in at least two ways. The smaller is that something wrong with the HBase/Hadoop interface or Hadoop&#8217;s HBase support cannot necessarily be said to be wrong with HBase (especially since HBase is no longer a Hadoop subproject). The bigger reason is that, according [...]]]></description>
			<content:encoded><![CDATA[<p>It turns out that my impression that <a href="http://www.dbms2.com/2011/07/10/hadoop-futures-and-enhancements/">HBase is broken</a> was unfounded, in at least two ways. The smaller is that something wrong with the HBase/Hadoop interface or Hadoop&#8217;s HBase support cannot necessarily be said to be wrong with HBase (especially since HBase is no longer a Hadoop subproject). The bigger reason is that, according to consensus, <strong>HBase has worked pretty well since the .90 release</strong> in January of this year.</p>
<p>After Michael Stack of StumbleUpon beat me up for a while,* Omer Trajman of Cloudera was kind enough to walk me through HBase usage. He is informed largely by 18 Cloudera customers using, plus a handful of other well-known HBase users such as Facebook, StumbleUpon, and Yahoo. Of the 18 Cloudera customers Omer was thinking of, 15 are in HBase production, one is in HBase &#8220;early production&#8221;, one is still doing R&amp;D in the area of HBase, and one is a classified government customer not providing such details.<span id="more-4990"></span></p>
<p><em>*Just kidding &#8212; he was actually extremely gentle.</em></p>
<p>In the use cases that Omer offered, what&#8217;s stored in HBase is almost always <strong>records of web or network activity. </strong>Specific examples included clickstream information (at 5 different ad companies), crash reports (at Mozilla), and messages (at Facebook). Sometimes the data gets into Hadoop twice &#8212; once excerpted via HBase and once as part of a full log &#8212; and may even live in two different Hadoop clusters.</p>
<p>What&#8217;s served out from HBase in Omer&#8217;s examples is usually <a href="../../../../../2011/06/19/investigative-analytics-derived-data/">derived data</a>, such as a user profile, an ad selection, a text index, etc. That makes sense, not least because if you&#8217;re going to keep enhancing your data, schema-free programming &#8212; which HBase offers &#8212; looks ever more appealing. Omer further said that there are a growing number of cases in which HBase is being used to serve up reference data for batch MapReduce jobs, but he didn&#8217;t have specifics. A counterexample to the derived data emphasis would be, if I understood correctly, a case where HBase manages shopping carts.</p>
<p>I haven&#8217;t put much effort into unearthing open source or other third-party HBase-based projects, but two examples are Open  TSDB  (Time Series DataBase) and Lily CMS (Content Management Systems). <em>(Edit: But see the comment about Lily below.)</em></p>
<p>Omer is perhaps my top go-to guy on <a href="../../../../../2011/07/06/petabyte-hadoop-clusters/">database and cluster sizes</a>, so of course I asked him for HBase metrics as well. He responded (approximately) that Cloudera HBase customer installations average 20-30 nodes, but that half a dozen are in the 100-200 node range.</p>
<p>Finally, there&#8217;s the matter of latency. As a general rule, the HBase users Omer sees are using HBase with at least several minutes latency. (Again , that shopping cart case would seem to be a counterexample.) So, for example, the data recorded when you click on a page isn&#8217;t immediately applied toward tweaking your profile to determine which ad you&#8217;ll see next &#8212; but it might come into play after you spend a few minutes reading the page you&#8217;re on. Naturally, Omer knows of efforts to use HBase with lower latency yet, and I won&#8217;t be surprised if already-working examples of low-latency HBase show up in the comment thread to this post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/18/hbase-is-not-broken/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Petabyte-scale Hadoop clusters (dozens of them)</title>
		<link>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/</link>
		<comments>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/#comments</comments>
		<pubDate>Wed, 06 Jul 2011 05:15:21 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4886</guid>
		<description><![CDATA[I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop. Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more [...]]]></description>
			<content:encoded><![CDATA[<p>I recently learned that there are <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">7 Vertica clusters with a petabyte</a> (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.</p>
<p>Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo&#8217;s latest stated figures are:</p>
<ul>
<li>42,000 Hadoop nodes &#8230;</li>
<li>&#8230; holding 180-200 petabytes of data.</li>
</ul>
<p><span id="more-4886"></span>That works out near the low end of the range I came up with for Yahoo&#8217;s newest gear, namely <a href="http://www.dbms2.com/2011/07/06/hadoop-hardware-and-compression/">36-90 TB/node</a>. Yahoo&#8217;s biggest clusters are little over 4,000 nodes (a limitation that&#8217;s getting worked on), and Yahoo has over 20 clusters in total.</p>
<p>Based on those numbers, it would seem that 10 or more of Yahoo&#8217;s Hadoop clusters are probably in the petabyte range. Facebook no doubt has a few petabyte-scale Hadoop clusters as well. So we&#8217;re probably over 3 dozen petabyte+ Hadoop clusters, just counting Yahoo, Facebook, and CDH users. There surely are others too, running Apache Hadoop without Cloudera&#8217;s help.</p>
<p>We also have some more information about the scale of Hadoop usage, and the markets it is being used in, because Omer Trajman of Cloudera kindly wrote the following &#8212; lightly edited as usual &#8212; for quotation:</p>
<blockquote><p>The number of Petabyte+ Hadoop clusters expanded dramatically over the past year, with our recent count reaching 22 in production (in addition to the well-known clusters at Yahoo! and Facebook). Just as our poll back at Hadoop World 2010 showed the average cluster size at just over 60 nodes, today it tops 200. While mean is not the same as median (most clusters are under 30 nodes), there are some beefy ones pulling up that average. Outside of the well-known large clusters at Yahoo and Facebook, we count today 16 organizations running PB+ clusters running CDH across a diverse number of industries including online advertising, retail, government, financial services, online publishing, web analytics and academic research. We expect to see many more in the coming years, as Hadoop gets easier to use and more accessible to a wide variety of enterprise organizations.</p></blockquote>
<p>Omer went on to add:</p>
<blockquote><p>The biggest number of PB clusters are in the advertising space. I often tell people that every ad you see on the internet touched at least one Hadoop cluster (or the Google equivalent).</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>

