<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Splunk</title>
	<atom:link href="http://www.dbms2.com/category/products-and-vendors/splunk/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Splunk update</title>
		<link>http://www.dbms2.com/2012/01/10/splunk-update/</link>
		<comments>http://www.dbms2.com/2012/01/10/splunk-update/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 05:55:08 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5791</guid>
		<description><![CDATA[Splunk is announcing the Splunk 4.3 point release. Before discussing it, let&#8217;s recall a few things about Splunk, starting with: Splunk is first and foremost an analytic DBMS &#8230; &#8230; used to manage logs and similar multistructured data. Splunk&#8217;s DML (Data Manipulation Language) is based on text search, not on SQL. Splunk has extended its [...]]]></description>
			<content:encoded><![CDATA[<p>Splunk is announcing the Splunk 4.3 point release. Before discussing it, let&#8217;s recall a few things about Splunk, starting with:</p>
<ul>
<li>Splunk is first and foremost an analytic DBMS &#8230;</li>
<li>&#8230; used to manage logs and similar multistructured data.</li>
<li>Splunk&#8217;s DML (Data Manipulation Language) is based on text search, not on SQL.</li>
<li>Splunk has extended its DML in natural ways (e.g., you can use it to do calculations and even some statistics).</li>
<li>Splunk bundles some (very) basic, Splunk-specific business intelligence capabilities.</li>
<li>The paradigmatic use of Splunk is to monitor IT operations in real time. However:
<ul>
<li>There also are plenty of non-real-time uses for Splunk.</li>
<li>Splunk is proudest of its growth in non-IT quasi-real-time uses, such as the marketing side of web operations.</li>
</ul>
</li>
</ul>
<p>As in any release, a lot of Splunk 4.3 is about &#8220;Oh, you didn&#8217;t have that before?&#8221; features and <a href="../../../../../2009/08/21/bottleneck-whack-a-mole/">Bottleneck Whack-A-Mole</a> performance speed-up. One performance enhancement is Bloom filters, which are a very hot topic these days. More important is a switch from Flash to HTML5, so as to accommodate mobile devices with less server-side rendering. Splunk reports that its users &#8212; especially the non-IT ones &#8212; really want to get Splunk information on the tablet devices. While this somewhat contradicts <a href="../../../../../2012/01/04/some-issues-in-business-intelligence/">what I wrote a few days ago pooh-poohing mobile BI</a>, let me hasten to point out:</p>
<ul>
<li>Splunk is used for a lot of (quasi) real-time monitoring.</li>
<li>Splunk&#8217;s desktop user interfaces are, by BI standards, quite primitive.</li>
</ul>
<p>That&#8217;s pretty much the ideal scenario for mobile BI: Timeliness matters and prettiness doesn&#8217;t.</p>
<p><span id="more-5791"></span><em>Hmm. Maybe <a href="../../../../../2011/11/10/streambase-liveview-push-based-real-time-bi/">StreamBase LiveView</a> needs a mobile option as well &#8230;</em></p>
<p>Splunk&#8217;s basic use is to take the text string that is a log and make sense of it. But Splunk now also supports JSON structures. It does this via something called spath, which as you might guess from the name has XPath similarities. That probably bore more discussion than we found the time to have.</p>
<p><em>By the way: If you&#8217;re interested in BI over XML, that&#8217;s what my former clients at Skytide were founded to do, before they pivoted a bit. I don&#8217;t think those capabilities have disappeared from the product</em>.</p>
<p><a href="http://www.monash.com/uploads/Splunk-4-3.pdf">Splunk has graciously allowed me to post a slide deck</a>. More stuff in there, including quotes from a customer &#8212; Expedia &#8212; that has 2700 Splunk users.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/10/splunk-update/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Big data terminology and positioning</title>
		<link>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/</link>
		<comments>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 01:35:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5768</guid>
		<description><![CDATA[Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions: Bigness &#8212; Volume, Velocity, size Structure &#8212; Variety, Variability, Complexity given that High-velocity &#8220;big data&#8221; problems are usually high-volume as well.* Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction. But the conflation should [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I observed that <a href="../../../../../2011/09/11/big-data-has-jumped-the-shark/">Big Data terminology is seriously broken</a>. It is reasonable to reduce the subject to two quasi-dimensions:</p>
<ul>
<li><strong>Bigness</strong> &#8212; Volume, Velocity, size</li>
<li><strong>Structure</strong> &#8212; Variety, Variability, Complexity</li>
</ul>
<p>given that</p>
<ul>
<li>High-velocity &#8220;big data&#8221; problems are usually high-volume as well.*</li>
<li>Variety, variability, and complexity all relate to the <a href="../../../../../2011/05/17/poly-structured-database/">simply-structured/poly-structured</a> distinction.</li>
</ul>
<p>But the conflation should stop there.</p>
<p><em>*Low-volume/high-velocity problems are commonly referred to as <a href="../2011/08/25/renaming-cep-or-not/">&#8220;event processing&#8221; and/or &#8220;streaming&#8221;</a>.</em></p>
<p>When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2&#215;2 matrix of possibilities. For want of better alternatives, my suggestions are:</p>
<ul>
<li><strong>Relational big data</strong> is data of high volume that fits well into a relational DBMS.</li>
<li><strong>Multi-structured big data</strong> is data of high volume that doesn&#8217;t fit well into a relational DBMS. <em>Alternative: Poly-structured big data.</em></li>
<li><strong>Conventional relational data</strong> is data of not-so-high volume that fits well into a relational DBMS. <em>Alternatives: Ordinary/normal/smaller relational data.</em></li>
<li><strong>Smaller poly-structured data</strong> is data for which <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schema</a> capabilities are important, but which doesn&#8217;t rise to &#8220;big data&#8221; volume.</li>
</ul>
<p><span id="more-5768"></span>Notes on all this include:</p>
<ul>
<li>&#8220;Relational big data&#8221; is commonly what you need a scalable analytic relational DBMS for. But there are non-analytic use cases as well.</li>
<li>The paradigmatic example of &#8220;multi-structured big data&#8221; is log files. Thus, multi-structured big data is commonly what you need a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> for.</li>
<li>One might want to equate non-analytic relational big data technology to &#8220;NewSQL&#8221;. However, I&#8217;m struggling to think of a database size range in which the entire NewSQL industry can match Oracle&#8217;s market share alone.</li>
<li>One might want to equate non-analytic multi-structured big data technology to &#8220;NoSQL&#8221;. However:
<ul>
<li>&#8220;NoSQL&#8221; is also used to encompass not-so-big-data use cases, such as prototyping in MongoDB.</li>
<li><a href="../../../../../2011/10/02/defining-nosql/">&#8220;NoSQL&#8221; has non-ACID/low(er)-data-integrity connotations</a> that aren&#8217;t appropriate for all non-relational systems.</li>
</ul>
</li>
<li>Up to a point, you can analyze relational big data in a conventional relational DBMS, but an analytic RDBMS will usually win on TCO (Total Cost of Ownership). In particular, reasonable thresholds for moving an analytic database off Oracle might be:
<ul>
<li>1-2 terabytes if you&#8217;ve never bought anything past Oracle Standard Edition.</li>
<li>5-10 terabytes if you&#8217;re already paying for Oracle Enterprise Edition.</li>
<li>A lot higher than that if you actually find Oracle Exadata to be cost-effective.</li>
</ul>
</li>
<li>Depending on how big one acknowledges as &#8220;big&#8221;, the market share leader in &#8220;big bit bucket&#8221; use cases is either Splunk or Hadoop.</li>
<li>If we look at multi-structured big data management overall, MarkLogic joins the list of market share contenders, as do various NoSQL alternatives.</li>
<li>It is wrong to say that the large web companies invented &#8220;big data&#8221; technology. But it is more reasonable to say they invented much of &#8220;multi-structured big data&#8221; management. In particular (and this is just a partial list), Google, Amazon, Yahoo, Facebook, et al. can reasonably be credited with Hadoop, Cassandra, HBase and various predecessors to same.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Text data management, Part 1: Confusion</title>
		<link>http://www.dbms2.com/2011/10/10/text-data-management-confusion/</link>
		<comments>http://www.dbms2.com/2011/10/10/text-data-management-confusion/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 01:58:03 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5421</guid>
		<description><![CDATA[This is Part 1 of a three post series. The posts cover: Confusion about text data management. Choices for text data management (general and short-request). Choices for text data management (analytic). There&#8217;s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include: The terminology around text [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is Part 1 of a three post series. The posts cover:</em></p>
<ol>
<li> <em><a href="http://www.dbms2.com/2011/10/10/text-data-management-confusion/">Confusion about text data management</a>.</em></li>
<li><em><a href="http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/">Choices for text data management (general and short-request)</a>.</em></li>
<li><em><a href="http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/">Choices for text data management (analytic)</a>.</em></li>
</ol>
<p>There&#8217;s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:</p>
<ul>
<li>The terminology around text data is inaccurate.</li>
<li>Data volume estimates for text are misleading.</li>
<li>Multiple different technologies are in the mix, including:
<ul>
<li>Enterprise text search.</li>
<li>Text analytics &#8212; <a href="http://www.texttechnologies.com/category/software-as-a-service-saas/category/text-mining/">text mining</a>, sentiment analysis, etc.</li>
<li>Document stores &#8212; e.g. document-oriented NoSQL, or MarkLogic.</li>
<li>Log management and parsing &#8212; e.g. Splunk.</li>
<li>Text archiving &#8212; e.g., various specialty email archiving products I couldn&#8217;t even name.</li>
<li>Public web search &#8212; Google et al.</li>
</ul>
</li>
<li>Text search vendors have disappointed, especially technically.</li>
<li>Text analytics vendors have disappointed, especially financially.</li>
<li>Other analytic technology vendors ignore <a href="http://www.texttechnologies.com/2010/12/01/state-of-the-art-text-analytics-mining-applications/">what the text analytic vendors actually have accomplished</a>, and reinvent inferior wheels rather than OEM the state of the art.</li>
</ul>
<p>Above all: <a href="http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/">The use cases for text data vary greatly</a>, just as the use cases for simply-structured databases do.</p>
<p>There are probably fewer people now than there were six years ago who need to be told that <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">text and relational database management are very different things</a>. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: <span id="more-5421"></span></p>
<ul>
<li><strong> The terms &#8220;unstructured&#8221; or &#8220;semi-structured&#8221; data are inherently misleading</strong>. That&#8217;s why <a href="../../../../../2011/05/17/poly-structured-database/">I favor &#8220;multi-structured&#8221; or &#8220;poly-structured&#8221; instead</a>. (&#8220;Multi-structured&#8221; seems to be winning; e.g., it&#8217;s been adopted by Teradata and Teradata/Aster.)</li>
<li>The &#8220;social media&#8221; text data any one enterprise brings in house isn&#8217;t all that much. For example, <a href="../../../../../2011/04/14/attensity-update/">Attensity serves many different enterprises&#8217; social media needs from a single 20-terabyte data store</a>, and reports that no single enterprise has required as much as 1 terabyte of text yet. <strong>Text data may consume a lot of storage </strong>on spinning disks somewhere,<strong> but it&#8217;s not that big a factor in future DBMS industry growth.</strong> (That 20 terabyte figure does seem low.)</li>
<li><strong>Structured databases are typically worth a lot more per bit than other kinds.</strong> The most valuable electronic data, per-bit, is probably records of significant economic transactions &#8212; purchases, sales, money transfers, etc. The least valuable may be sensor log files, whose contents consist mainly of &#8220;Nothing going on here; ping you again in a minute.&#8221; Email logs, web interaction data and many other kinds fall somewhere in between. Highly valuable documents &#8212; such as signed contracts &#8212; generally persist in paper as well as electronic forms. <strong>Investors commonly overlook this point.</strong></li>
<li><strong>The enterprise text search industry is screwed up.</strong>
<ul>
<li>FAST was a goofy company before it was acquired for far too much money by Microsoft.</li>
<li>Autonomy was a goofy company before it was acquired for far too much money by HP.</li>
<li>Google&#8217;s enterprise efforts are quiet.</li>
<li>The integration of text search and relational DBMS &#8212; e.g. at Oracle &#8212; has languished, with poor performance and evident lack of management attention.</li>
<li>Smaller text search vendors don&#8217;t seem to be getting a lot of traction &#8212; e.g., <a href="http://www.texttechnologies.com/category/vendors/coveo/">Coveo</a> has a decent reputation, but when&#8217;s the last time you heard much about them? What has Attivio actually accomplished?</li>
</ul>
</li>
<li><strong>Text analytics is a small business</strong>. Add up the revenue for Attensity, Clarabridge, Lexalytics, Temis, and all the others, and you might poke above $100 million, especially now that Attensity had a three-way merger. Then again, you might not.</li>
<li>Even so, <strong>the text analytics vendors have developed sophisticated technology.</strong> In particular, you can use it to get a pretty good idea as to what people are writing about you, individually or as groups.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/10/text-data-management-confusion/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>MongoDB users and use cases</title>
		<link>http://www.dbms2.com/2011/07/27/mongodb-users-and-use-cases/</link>
		<comments>http://www.dbms2.com/2011/07/27/mongodb-users-and-use-cases/#comments</comments>
		<pubDate>Wed, 27 Jul 2011 18:14:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5031</guid>
		<description><![CDATA[I spoke with Eliot Horowitz and Max Schierson of 10gen last month about MongoDB users and use cases. The biggest clusters they came up with weren&#8217;t much over 100 nodes, but clusters an order of magnitude bigger were under development. The 100 node one we talked the most about had 33 replica sets, each with [...]]]></description>
			<content:encoded><![CDATA[<p>I spoke with Eliot Horowitz and Max Schierson of 10gen last month about MongoDB users and use cases. The biggest clusters they came up with weren&#8217;t much over 100 nodes, but clusters an order of magnitude bigger were under development. The 100 node one we talked the most about had 33 replica sets, each with about 100 gigabytes of data, so that&#8217;s in the 3-4 terabyte range total. In general, the largest MongoDB databases are 20-30 TB; I&#8217;d guess those really do use the bulk of available disk space.   <span id="more-5031"></span></p>
<p>10gen recommends solid-state storage in many cases. In some cases solid-state lets you get away with fewer total nodes. 10gen also likes Flashcache (Facebook-developed technology to put a flash cache in front of hard disks). But the 100-node example mentioned above uses spinning disk.</p>
<p>Use cases 10gen is proud of include:</p>
<ul>
<li>Lots of user profile maintenance, including at online ad companies. This includes full user ad impression data. (I&#8217;ve argued for a while that <a href="../../../../../2010/09/17/jp-morgan-chase-oracle-database-outage/">user profile information belongs in something like a NoSQL database</a>.)</li>
<li>A big-name web company that wants to inspect every packet that enters their network, and replaced Splunk with MongoDB for performance reasons.</li>
<li>A big-name photo/video site whose metadata is all in MongoDB. (That&#8217;s the kind of thing that often makes for good <a href="../../../../../2011/05/30/another-category-of-derived-data/">MarkLogic</a> use cases.)</li>
</ul>
<p>But actually, the reason we had the call was to review cases where MongoDB&#8217;s <strong>schemaless</strong> nature was significant. Examples of those included:</p>
<ul>
<li>A couple of top examples were of the kind &#8220;A bunch of apps, similar but not the same.&#8221; For MTV, it&#8217;s a single content management system for a bunch of websites. For Disney Playdom, it&#8217;s different schemas for every game.</li>
<li>For a wireless telco, the issue was a product catalog in which devices and service plans called for very different schemas, and which the telco felt had thus become unmanageable in Oracle.</li>
<li>For Craigslist, the issue wasn&#8217;t programming so much as performance &#8212; <a href="http://blog.zawodny.com/2010/04/27/i-want-a-new-data-store/">ALTER TABLE operations took months in MySQL</a>, and that&#8217;s not a typo, although I&#8217;ll confess to not understanding why this was the case.</li>
</ul>
<p>The 10gen guys went on to claim that schemalessness is helpful for incremental development in general, the point being that you don&#8217;t have a database-modification step. To some extent, changes can even be rolled back more easily than if you actually changed your schemas.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/27/mongodb-users-and-use-cases/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Remote machine-generated data</title>
		<link>http://www.dbms2.com/2011/07/26/remote-machine-generated-data/</link>
		<comments>http://www.dbms2.com/2011/07/26/remote-machine-generated-data/#comments</comments>
		<pubDate>Tue, 26 Jul 2011 08:45:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Truviso]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5012</guid>
		<description><![CDATA[I refer often to machine-generated data, which is commonly generated inexpensively and in log-like formats, and is often best aggregated in a big bit bucket before you try to do much analysis on it. The term has caught on, to the point that perhaps it&#8217;s time to distinguish more carefully among different kinds of machine-generated [...]]]></description>
			<content:encoded><![CDATA[<p>I refer often to <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>, which is commonly generated inexpensively and in log-like formats, and is often best aggregated in a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> before you try to do much analysis on it. The term has caught on, to the point that perhaps it&#8217;s time to distinguish more carefully among different <em>kinds</em> of machine-generated data. In particular, I think it may be useful to distinguish between:</p>
<ul>
<li><strong>Log-stream machine-generated data,</strong> when what you&#8217;re looking at &#8212; at least initially &#8212; is the entire output of verbose logging systems.</li>
<li><strong>Remote machine-generated data.</strong></li>
</ul>
<p>Here&#8217;s what I&#8217;m thinking of for the second category. I rather frequently hear of cases in which data is generated by large numbers of remote machines, which occasionally send messages home. For example:  <span id="more-5012"></span></p>
<ul>
<li>I heard yesterday about a case with 10s of millions of machines, phoning home every 5 minutes, and another case with 10s of 1000s calling in every 5 seconds, both of them sending data initially to MySQL. (Application details weren&#8217;t given.)</li>
<li>I heard not long ago about a set-top box case that the vendor hoped would also grow to 10s of millions of machines, which I guessed might send a small number of messages per hour each.</li>
<li>I also heard recently about a remote security monitoring case whose data was destined for (probably) Netezza, although in that case I&#8217;m not sure about the &#8220;occasionally&#8221; aspect of the communication.</li>
<li>The last time I visited Splunk, I got the sense that energy-sensor use cases (especially in the electric grid) had finally emerged. I believe these sensors are periodic message senders &#8212; they wake up, take their temperature (figuratively or literally as the case may be), send a message, snooze, repeat.</li>
<li>I would guess that the <a href="../../../../../2009/10/14/infobright-notes/">energy use cases</a> Infobright talked about in 2009 were of a similar kind.</li>
<li>An April, 2010 comment on the post linked above talks about <a href="../../../../../2010/04/08/machine-generated-data-example/#comment-165006">many kinds of sensor data</a>.</li>
<li>Back in 2007, <a href="../../../../../2007/08/12/applications-for-not-so-low-latency-cep/">Coral8</a> talked of a truck phone-home use case (on-board GPS data and also, e.g., refrigeration level, sending messages once a minute or so). Truviso seemed to have one similar deal before one of its frequent changes in strategic direction, and not coincidentally cites UPS as an investor.</li>
<li>In principle, there are a lot of RFID use cases out there, even if I rarely seem to hear of any. (That would be a shorter &#8220;phone call&#8221; home than most of the other examples, of course, but might be otherwise technically similar.)</li>
</ul>
<p>Many technologies can be used to collect and manage remote machine-generated data, but a few common points are worth nothing.</p>
<ul>
<li>If a device takes the trouble to send a message across a wide-area network, that message may be somewhat more valuable than the average piece of log-vomit. Perhaps such information doesn&#8217;t need to be stored in the cheapest possible way.</li>
<li>Similarly, a message that is sent occasionally over time, or upon a specified event, may be more structured than a random log entry. Perhaps such data is suitable for sending straight to a <strong>relational database</strong>.</li>
<li>If there&#8217;s no central place the data originates, there may also be no favored place for the data to end up. It may make great sense to collect and analyze remote machine-generated data in the <strong>cloud. </strong>(Exceptions may of course arise if you want to use the data in connection with other information, and you hence want to bring it to that other information&#8217;s location.)</li>
<li>In a number of use cases, the whole point is to identify anomalies, and respond to them rapidly. I.e., remote machine-generated data use cases commonly raise challenges in low-latency <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integration of short-request and analytic processing</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/26/remote-machine-generated-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Dirty data, stored dirt cheap</title>
		<link>http://www.dbms2.com/2011/06/04/dirty-data-stored-dirt-cheap/</link>
		<comments>http://www.dbms2.com/2011/06/04/dirty-data-stored-dirt-cheap/#comments</comments>
		<pubDate>Sat, 04 Jun 2011 22:49:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Splunk]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4606</guid>
		<description><![CDATA[A major driver of Hadoop adoption is the &#8220;big bit bucket&#8221; use case. Users take a whole lot of data, often machine-generated data in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing. Hadoop hardware doesn&#8217;t need to be that costly either. And once you get that data [...]]]></description>
			<content:encoded><![CDATA[<p>A major driver of Hadoop adoption is the &#8220;big bit bucket&#8221; use case. Users take a whole lot of data, often <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing. <a href="http://www.dbms2.com/2011/06/04/hardware-for-hadoop/">Hadoop hardware</a> doesn&#8217;t need to be that costly either. And once you get that data into Hadoop, there are <a href="../../../../../2009/10/10/enterprises-using-hadoo/">a whole lot of things you can do with it</a>.</p>
<p>Of course, there are various outfits who&#8217;d like to sell you not-so-cheap bit buckets. Contending technologies include <a href="../../../../../2011/06/02/why-you-would-want-an-appliance-and-when-you-wouldnt/">Hadoop appliances</a> (which I don&#8217;t believe in), <a href="../../../../../2009/10/18/technical-introduction-to-splunk/">Splunk</a> (which in many use cases I do), and <a href="../../../../../2010/11/29/marklogic-and-its-document-dbms/">MarkLogic</a> (ditto, but often the cases are different from Splunk&#8217;s). Cloudera and IBM, among other vendors, would also like to sell you some proprietary software to go with your standard Apache Hadoop code.</p>
<p>So the question arises &#8212; <strong>why would you want to spend serious money to look after your low-value data? </strong>The answer, of course, is that <strong>maybe your log data isn&#8217;t so low-value.</strong> <span id="more-4606"></span>True, the signal-to-noise ratio in purely machine-generated data is rarely high (web logs and so on may be an exception). But if the signal is sufficiently important, the overall data set may have decent average value. Intelligence work is one case where the occasional black swan might justify gilded cages for the whole aviary; the same might go for other forms of especially paranoid security.</p>
<p>For example, I was told of one big bank that was pulling 5 GB of logs every half hour into Splunk (selected for performance), or at least planning to. The application was forensics to protect against internal fraudulent trading, something that&#8217;s been a multi-hundred million or even multi-billion dollar problem at various investment banks in the past. I have no idea what the retention policy on those logs is, but clearly the core application can support higher-than-Hadoop pricing.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/04/dirty-data-stored-dirt-cheap/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>What to do about &#8220;unstructured data&#8221;</title>
		<link>http://www.dbms2.com/2011/05/15/what-to-do-about-unstructured-data/</link>
		<comments>http://www.dbms2.com/2011/05/15/what-to-do-about-unstructured-data/#comments</comments>
		<pubDate>Sun, 15 May 2011 21:54:30 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4449</guid>
		<description><![CDATA[We hear much these days about unstructured or semi-structured (as opposed to) structured data. Those are misnomers, however, for at least two reasons. First, it&#8217;s not really the data that people think is un-, semi-, or fully structured; it&#8217;s databases.* Relational databases are highly structured, but the data within them is unstructured &#8212; just lists [...]]]></description>
			<content:encoded><![CDATA[<p>We hear much these days about <em>unstructured</em> or <em>semi-structured</em> (as opposed to) <em>structured data.</em> Those are misnomers, however, for at least two reasons. First<strong>, it&#8217;s not really the data that people think is un-, semi-, or fully structured; it&#8217;s databases.</strong>* Relational databases are highly structured, but the data within them is unstructured &#8212; just lists of numbers or character strings, whose only significance derives from the structure that the database imposes.</p>
<p><em>*Here I&#8217;m using the term &#8220;database&#8221; literally, rather than as a concise  synonym for &#8220;database management system&#8221;. But see below.<br />
</em></p>
<p>Second, a more accurate distinction is<strong> not whether a database has one structure or none </strong>&#8211; it&#8217;s<strong> whether a database has one structure or many.</strong> The easiest way to see this is for databases that have clearly-defined schemas. A relational database has one schema (even if it is just the union of various unrelated sub-schemas); an XML database, however, can have as many schemas as it contains documents.</p>
<p>One small terminological problem is easily handled, namely that people don&#8217;t talk about true databases very often, at least when they&#8217;re discussing generalities; rather, they talk about data and DBMS.* So let&#8217;s talk of DBMS being &#8220;structured&#8221; singly or multiply or whatever, just as the databases they&#8217;re designed to manage are.</p>
<p><em>*And they refer to the DBMS as &#8220;databases,&#8221; because they don&#8217;t have much other use for the word. </em></p>
<p>All that said &#8212; I think that <strong>single vs. multiple database structures isn&#8217;t a bright-line binary distinction</strong>; rather, it&#8217;s a <strong>spectrum.</strong> For example:  <span id="more-4449"></span></p>
<ul>
<li>IMS is the most structured DBMS of all. The data in an IMS database is in a hierarchy, and that&#8217;s that.</li>
<li>CODASYL and other kinds of what used to be called <em>network</em> DBMS (before the word got so overloaded) &#8212; e.g. RDB, IDMS, or TOTAL &#8212; are/were almost as structured as IMS.</li>
<li>Relational databases were invented because their structure was more flexible than that of linked-list databases. The whole point of relational DBMS is that you can view the data in a multitude of ways. Still, I see classical relational databases as being toward the single-structure end of the spectrum. (I say &#8220;classical&#8221; because Oracle and DB2 actually can manage combinations of XML, text, and traditional relational tables, if you choose.)</li>
<li>A multivalue DBMS is a little more multi-structured than a relational one, because of how a field can be filled one or multiple times.</li>
<li><a href="../../../../../2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay Singularity</a> (as implemented on Teradata gear) has, in essence, two structures (that I know of). One structure is just the relational schema. The other is the structure you would get if each kind of name-value pair truly had its own column.</li>
<li>A Splunk collection of log data can reasonably be said to have a different structure for every type or source of log. It further can be said to have multiple structures in somewhat the same way that eBay Singularity does.</li>
<li>So-called <a href="../../../../../2011/02/07/notes-on-document-oriented-nosql/">document stores</a> can be very multi-structured. MongoDB, Couchbase, et al. let you have a different structure for every document, if you choose. The same goes for XML-based MarkLogic.</li>
<li>HBase and Cassandra are also very multi-structured. Theoretically, each record gets to decide which column sets it does or doesn&#8217;t fit into.</li>
</ul>
<p>As a general rule &#8212; the more structures a database can have at once, the easier it is to change those structures, even on the fly (e.g., by inserting yet another bit of self-describing data). Thus, I sometimes use the term <strong>polystructured </strong>instead of<strong> multi-structured </strong>or <strong>multistructured.</strong> Thoughts as to which term I should choose going forward would be much appreciated.</p>
<p>As for an actual definition &#8212; well, here&#8217;s something I drafted 3 1/2 years ago but never published:</p>
<blockquote><p>These problems with the relational paradigm are big enough to be worth coining a word for – polystructured. Polystructured data is data with structure that:</p>
<ul>
<li>Can be exploited to provide most of the benefits of a highly structured database (e.g., a tabular/relational one) &#8230;</li>
<li>&#8230; but cannot be described in the concise, consistent form such highly structured systems require.</li>
</ul>
<p>Specifically, we’ll call a database “polystructured” if it is characterized by at least two of the following:</p>
<ol>
<li>Data suitable for being queried by      simple predicate-based matching (e.g., equality to certain values, falling      with in ranges, etc.)</li>
<li>(Other) data suitable for being queried      by more complex matching (e.g., text search relevancy rankings)</li>
<li>Subsets that are more neatly structured      than the whole.</li>
</ol>
<p>Equivalently, we’ll just say that <strong>polystructured data is data that has considerable structure, but whose structure is in some important way unpredictable.</strong></p></blockquote>
<p>NoSQL document or &#8220;column&#8221; stores would satisfy #1 and #3, as would Splunk. MarkLogic would satisfy all three criteria. #1 + #2 is sort of like what happens when text queries are allowed to go against (groups of) relational columns &#8230; and the vagueness with which I&#8217;m saying that makes me suspect that at least the unbolded/first definition doesn&#8217;t really fly.</p>
<p>Finally, here&#8217;s what led up to those definitions (the whole thing is from the introduction to a never-completed white paper). Please forgive any  anachronisms in it. A number of the points in it have also been addressed in posts  here; e.g.,</p>
<ul>
<li>In December, 2005 I expounded on <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">the  mismatch between text data and the relational model</a>.</li>
<li>In June, 2010 I elucidated <a href="http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/">the  variety of data that could go into an individual&#8217;s marketing-oriented  profile</a>.</li>
<li>In February, 2008 I predicted that <a href="../2008/02/15/non-relational-database-management/">flexible-schema   DBMS would gain share</a>.</li>
</ul>
<blockquote><p><strong>The case for polystructured data</strong></p>
<p>Traditional computer databases amount to sets of records.   There usually are a limited number of record formats, which each instance of a particular format containing parallel kinds of information.  Business transactions, web page visits, instrument readings&#8211; whatever the nature of the information, application designers stick it into the simplest structure they think makes sense.</p>
<p>These records are arranged into a variety of data structures.</p>
<ul>
<li>Log files are widely used, especially to track web site visits, in other networking uses, and for other kinds of instrument readings.</li>
<li>Computer user administration is commonly in LDAP (Lightweight Directory Access Protocol) format.</li>
<li>There are still a lot of installations of legacy “linked-list” DBMS (DataBase Management Systems) such as IBM&#8217;s IMS.</li>
<li>Some decision support applications use data in multidimensional arrays.</li>
</ul>
<p>Even so, most new business applications are written over relational DBMS, in the well-known rows-and-tables paradigm.</p>
<p>There are good reasons for the dominance of the relational model and of rows and tables.  (Strictly speaking, “relational” equates neither to “rows and tables” nor to “SQL”, but in practice the three concepts are closely linked.) In particular:</p>
<ul>
<li>Data integrity is (fairly) easy to ensure.</li>
<li>From some standpoints, relational databases are flexible; you can construct almost any kind of query, without having to do any kind of database reorganization (except perhaps for performance).</li>
<li>SQL programmers are easy to find.</li>
<li>There&#8217;s simply been much more engineering effort invested in making good relational DBMS than in any other kind.</li>
</ul>
<p>But the relational database paradigm also has some major drawbacks.  Three of the big ones are:</p>
<ul>
<li>Queries must have strictly match/fail answers; there&#8217;s no natural way for a relational DBMS to handle “somewhat relevant” hits.</li>
<li>Relational databases can get cumbersome when large fractions of the potential data happen to be missing. (Hence the decades-long debates about the problems with NULL values.)</li>
<li>While you have good flexibility in querying against any particular data structure, you do have to predefine your structure before you start accepting input.</li>
</ul>
<p>The last point is why you wind up with all those NULL values in the first place; if a kind of information can be in any record in a set, the database is set up to assume that its present in all of them.  Or if you normalize your database so highly as to avert missing values, then you wind up with a huge number of tables, making queries (and updates) complicated from both the programmer&#8217;s and the machine&#8217;s standpoint.</p>
<p>Text apps suffer from RDBMS&#8217; inelegant handing of relevancy. What&#8217;s more, documents can have almost unlimited internal structures, in three senses:</p>
<ol>
<li>They can have chapters, sections, subsections,      sidebars, footnotes, and so on, in any combination.</li>
<li>Semantic references can link words, phrases,      sentences, and paragraphs in a near-infinite number of ways.</li>
<li>Documents can explicitly contain fielded data, such      as numbers, addresses, dates, or geo-encodings.</li>
</ol>
<p>Another group of apps that suffer from RDBMS&#8217; limitations are in the area of personalization and similar fine-grained marketing analysis. Analysis of web clicks throws away most kinds of path information.  Analysis of written or verbal communication isn&#8217;t well-integrated with that of fielded data.  Different customers and prospects give different kinds of contact information, and are “touched” by different marketing initiatives; current systems do a poor job of integrating all that scattered information.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/05/15/what-to-do-about-unstructured-data/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>Updating our vendor client disclosures</title>
		<link>http://www.dbms2.com/2011/02/28/updating-our-vendor-client-disclosures/</link>
		<comments>http://www.dbms2.com/2011/02/28/updating-our-vendor-client-disclosures/#comments</comments>
		<pubDate>Mon, 28 Feb 2011 08:03:39 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[About this blog]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[EMC]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[SAP AG]]></category>
		<category><![CDATA[Schooner Information Technology]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Tableau Software]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[dbShards and CodeFutures]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3906</guid>
		<description><![CDATA[From time to time, I disclose our vendor client lists. Another iteration is below. To be clear: This is a list of Monash Advantage members. All our vendor clients are Monash Advantage members, unless &#8230; &#8230; we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen [...]]]></description>
			<content:encoded><![CDATA[<p>From time to time, I <a href="http://www.monashreport.com/2010/01/06/updating-our-disclosures/">disclose</a> our vendor client lists. Another iteration is below. To be clear:</p>
<ul>
<li>This is a list of <a href="http://www.monash.com/advantage.html"><strong><em>Monash Advantage</em></strong></a> members.</li>
<li>All our vendor clients are <strong><em>Monash Advantage</em></strong> members, unless &#8230;</li>
<li>&#8230; we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)</li>
<li>We do not usually disclose our user clients.</li>
<li>We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.</li>
<li>Included in the list below are two expired <strong><em>Monash Advantage</em></strong> members who haven&#8217;t said they will renew, as mentioned in <a href="http://www.strategicmessaging.com/money-analyst-attention-and-implied-analyst-endorsement/2011/02/28/">my recent post on analyst bias</a>. (You can probably imagine a couple of reasons for that obfuscation.)</li>
</ul>
<p>With that said, our vendor client disclosures at this time are:</p>
<ul>
<li>Aster Data</li>
<li>Cloudera</li>
<li>CodeFutures/dbShards</li>
<li>Couchbase</li>
<li>EMC/Greenplum</li>
<li>Endeca</li>
<li>IBM/Netezza</li>
<li>Infobright</li>
<li>Intel</li>
<li>MarkLogic</li>
<li>ParAccel</li>
<li>QlikTech</li>
<li>salesforce.com/database.com</li>
<li>SAND Technology</li>
<li>SAP/Sybase</li>
<li>Schooner Information Technology</li>
<li>Skytide</li>
<li>Splunk</li>
<li>Teradata</li>
<li>Vertica</li>
</ul>
<p><span id="more-3906"></span>That list includes the two I&#8217;m obfuscating, plus one more who just emailed to say a signed renewal contract is arriving this week. It does not include others who, less concretely, have said they will sign up soon.</p>
<p>Also, I guess there&#8217;s a bit of a gray area for Tableau. As far as I&#8217;m concerned, I&#8217;m doing <a href="http://www.dbms2.com/2011/02/12/upcoming-webinar-on-investigative-analytics/">an upcoming co-sponsored webinar</a> just for <em><strong>Monash Advantage</strong></em> member Aster Data. Indeed, I declined to contract with or bill Tableau directly for its share,  because I had no good way to do that paperwork. But even so, Tableau is a cosponsor, was involved in the planning discussions and, behind the scenes, is surely footing part of the bill.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/02/28/updating-our-vendor-client-disclosures/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Clearing up MapReduce confusion, yet again</title>
		<link>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/</link>
		<comments>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/#comments</comments>
		<pubDate>Wed, 30 Dec 2009 10:50:53 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Splunk]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1371</guid>
		<description><![CDATA[I&#8217;m frustrated by a constant need &#8212; or at least urge &#8212; to correct myths and errors about MapReduce. Let&#8217;s try one more time: MapReduce was named and popularized &#8212; but not invented &#8212; by Google. &#8220;MapReduce&#8221; variously refers to: A programming paradigm Execution engines that implement the programming paradigm Distributed file systems that work [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m frustrated by a constant need &#8212; or at least urge <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  &#8212; to correct <a href="http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/">myths and errors about MapReduce</a>. Let&#8217;s try one more time:<span id="more-1371"></span></p>
<ul>
<li>MapReduce was named and popularized &#8212; but not invented &#8212; by Google.</li>
<li>&#8220;MapReduce&#8221; variously refers to:
<ul>
<li>A programming paradigm</li>
<li>Execution engines that implement the programming paradigm</li>
<li>Distributed file systems that work with the execution engines</li>
</ul>
</li>
<li>In particular, Hadoop is a MapReduce execution engine that includes or is closely associated with HDFS (Hadoop Distributed File System).</li>
<li>MapReduce and analytic DBMS can interact in a number of different ways, including:
<ul>
<li>Tight integration between a DBMS and exposed MapReduce functionality, e.g. <a href="http://www.dbms2.com/2009/10/15/mapreduce-webinar-slides/">Aster Data&#8217;s SQL/MapReduce</a> or Greenplum.</li>
<li>Integrated MapReduce &#8220;under the covers&#8221;, e.g. SenSage or <a href="http://www.dbms2.com/2009/10/06/oracle-mapreduce/">Oracle</a>. This may or may not follow all the rules Google laid out for MapReduce, but it&#8217;s at least similar in spirit.</li>
<li>Looser coupling between DBMS and a MapReduce system, e.g. <a href="http://www.dbms2.com/2009/08/04/verticas-version-of-mapreduce-integration/">Vertica/Hadoop</a>, in which MapReduce may or may not run on a different cluster than the DBMS.</li>
<li>Not at all, except perhaps insofar as a quasi-DBMS such as <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/">Hive</a> is implemented over a MapReduce system such as Hadoop/HDFS.</li>
</ul>
</li>
<li>As predicted by <a href="http://www.strategicmessaging.com/monashs-first-law-of-commercial-semantics-explained/2009/01/09/">Monash&#8217;s First Law of Commercial Semantics</a>, different vendors have individual variants on those themes. For example, as per <a href="http://www.splunk.com/product">a registration-required white paper</a>, Splunk is moving to publicly expose a not-quite-complete form of MapReduce.</li>
<li>MapReduce implementations such as Hadoop are sometimes regarded as part of the <a href="http://www.dbms2.com/2009/12/12/legit-nosql-key-value-store/">NoSQL</a> &#8220;movement&#8221;. When they are, many generalities about NoSQL &#8212; such as that it doesn&#8217;t deal with analytics &#8212; are falsified.</li>
<li>So far as I can tell, mainstream enterprise (as opposed to web, scientific, investment, etc.) data mining folks may be looking at MapReduce for data mining, but they haven&#8217;t done much to adopt it yet. Probably that&#8217;s because the outfits who have the greatest need are the same ones that have the largest sunk investments in more traditional ways of doing data mining.</li>
<li>Cloudera != Hadoop. On the other hand, if you want to use Hadoop, it makes a lot of sense to do business with Cloudera.</li>
<li>Non-DBMS MapReduce != Hadoop. On the other hand, Hadoop is the default choice for non-DBMS MapReduce.</li>
<li>MapReduce != Hadoop, period. DBMS-based MapReduce is also a legitimate technical strategy.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Technical introduction to Splunk</title>
		<link>http://www.dbms2.com/2009/10/18/technical-introduction-to-splunk/</link>
		<comments>http://www.dbms2.com/2009/10/18/technical-introduction-to-splunk/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 16:01:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1124</guid>
		<description><![CDATA[As noted in my other introductory post, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it&#8217;s probably OK to assume they&#8217;re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">As noted in <a href="http://www.dbms2.com/2009/10/18/general-introduction-to-splunk/">my other introductory post</a>, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it&#8217;s probably OK to assume they&#8217;re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used for general schema-free analytics, but that&#8217;s in early days at best.</p>
<p style="margin-bottom: 0in;">Splunk&#8217;s core technology indexes text and XML files or streams, especially log files. Technical highlights of that part include:<span id="more-1124"></span></p>
<ul>
<li>Splunk software both reads logs 	and indexes them. The same code runs both on the nodes that do the 	indexing and on machines that simply emit logs. However, in the 	latter case indexing is turned off. Thus, Splunk does not portray 	its software as “agentless.” However, it asserts that its 	agent-like software runs without “material” overhead.</li>
<li>The fundamental thing that Splunk 	looks at is an increment to a log – i.e., whatever has been added 	to the log since Splunk last looked at it.</li>
<li>Splunk tries to figure out what 	the individual entries are in a section of log it looks at.  In 	particular:
<ul>
<li>Time stamps are a big clue in this 	“inferencing” process, but they are not the be-all and end-all.</li>
<li>Nor are line boundaries, if logs 	are naturally broken up into lines. (Splunk threw that latter 	comment in as a shot at SenSage.)</li>
</ul>
</li>
<li>I get the impression that most 	Splunk entity extraction is done at search time, not at indexing 	time. Splunk says that, if a &lt;name, value&gt; pair is clearly 	marked, its software does a good job of recognizing same. Beyond 	that, fields seem to be specified by users when they define 	searches.</li>
<li>Splunk has a simple ILM 	(Information Lifecycle management) story based on time. I didn&#8217;t 	probe for details.</li>
</ul>
<p style="margin-bottom: 0in;">Given its text search engine, Splunk does – well, it does text searches. And it stores searches, so they can be used for alerting or reporting. Indeed, Splunk persists and presumably updates results to stored searches, in a rough analog to materialized views.</p>
<p style="margin-bottom: 0in;">Apparently, Splunk&#8217;s indexing is typically done via MapReduce jobs. I don&#8217;t know whether any actual Splunk searches are also done via MapReduce; surely they aren&#8217;t all, given the discussion of a near-real-time alerting engine and so on. Splunk fondly believes its MapReduce is an order of magnitude faster than SQL (I didn&#8217;t ask which SQL engines Splunk has in mind when they say this), and 5-10X faster than Hadoop. One efficiency trick is to look ahead and do Reduces in place where possible. This seems to be done automatically in the execution plan, ala Aster&#8217;s SQL-MapReduce, rather than having to be hand-coded. Splunk says its software can “easily” index 1-200 gigabytes of data per day on a commodity 8-core server, while maintaining an active search load, and 3-400 gigabytes are doable.</p>
<p style="margin-bottom: 0in;">Splunk&#8217;s capabilities right now in tabular-style analytics seem to be limited to a command-line report builder, plus a GUI wizard that generates the command line. A few users have asked for support of third-party business intelligence tools, but Splunk hasn&#8217;t provided that yet. Nor can I find much evidence of ODBC/JDBC drivers for Splunk. But then, I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.</p>
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/18/technical-introduction-to-splunk/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
	</channel>
</rss>

