<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS2 -- DataBase Management System Services &#187; Web analytics</title>
	<atom:link href="http://www.dbms2.com/category/applications/web-analytics/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 18 Mar 2010 05:19:19 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Three broad categories of data</title>
		<link>http://www.dbms2.com/2010/01/17/three-broad-categories-of-data/</link>
		<comments>http://www.dbms2.com/2010/01/17/three-broad-categories-of-data/#comments</comments>
		<pubDate>Sun, 17 Jan 2010 15:31:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1421</guid>
		<description><![CDATA[People often try to draw a distinction between:

Traditional data of the sort 	that&#8217;s stored in relational databases, aka “structured.”
Everything else, aka 	“unstructured” or “semi-structured” or “complex.”

There are plenty of problems with these formulations, not the least of which is that the supposedly “unstructured” data is the kind that actually tends to have interesting internal structures. [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">People often try to draw a distinction between:</p>
<ul>
<li>Traditional data of the sort 	that&#8217;s stored in relational databases, aka “structured.”</li>
<li>Everything else, aka 	“unstructured” or “semi-structured” or “complex.”</li>
</ul>
<p style="margin-bottom: 0in;">There are plenty of problems with these formulations, not the least of which is that the supposedly “unstructured” data is the kind that actually tends to have interesting internal structures. But of the many reasons why these distinctions don&#8217;t tend to work very well, I think the most important one is that:</p>
<p><strong>Databases shouldn&#8217;t be divided into just two categories. </strong><span style="font-weight: normal;"> Even as a rough-cut approximation, </span><strong>they should be divided into three,</strong><span style="font-weight: normal;"> namely:</span></p>
<ul>
<li><strong>Human/Tabular</strong> data &#8211;i.e., human-generated data that fits well 	into relational tables or arrays</li>
<li><strong>Human/Nontabular</strong> data &#8212; i.e., all other data generated by humans</li>
<li><strong>Machine-Generated</strong> data</li>
</ul>
<p style="margin-bottom: 0in;">Even that trichotomy is grossly oversimplified, for reasons such as:</p>
<ul>
<li>These categories overlap.</li>
<li>There are kinds of data that get 	into fuzzy border zones.</li>
<li>Not all data in each category has 	all the same properties.</li>
</ul>
<p style="margin-bottom: 0in;">But at least as a starting point, I think this basic categorization has some value.<span id="more-1421"></span></p>
<p style="margin-bottom: 0in;">By <strong>human-generated data that fits well into relational tables or arrays,</strong> what I really mean is: <strong>the input from most conventional kinds transactions</strong> – purchase/sale, inventory/manufacturing, employment status change, etc. This is the core data managed by OLTP relational DBMS everywhere. It is also the core data in analytic relational or MOLAP databases. The vast majority of what we think or know about “database management” applies primarily to data of this kind, in large part because of two fundamental properties of this information:</p>
<ul>
<li>It is meaningful to contemplate 	this data as being 100% accurate and complete (even if that goal is 	difficult to achieve in the real world).</li>
<li>This data is precise – i.e., one 	can check predicates against it and (give or take regrettable data 	imperfections) get inarguable yes/no answers.</li>
</ul>
<p style="margin-bottom: 0in;">For most enterprises, this is the most important data they have. It was created as a result of expensive business activities. It deals directly with money, employees, physical goods, and the rest of the things that make an enterprise go. It can be fruitfully analyzed in ever more ways, which is why it should never be thrown out or even entirely relegated to tape, now that data warehouse software, hardware, and storage has become so cheap. (“Disk is the new tape.”) And because of the importance of both preserving and accessing it, it should often be stored in multiple copies – OLTP, data warehouse, data mart, in-memory analytics, near-line quasi-archive, MOLAP cubes (if you must) and so on, plus of course replicas for high throughput and availability.</p>
<p style="margin-bottom: 0in;">But <strong>humans generate many other kinds of data as well,</strong> especially in a form directly suitable for <strong>communication</strong> – text (in many formats), documents (text or otherwise), pictures, videos, etc. <a href="../2005/12/09/relational-dbms-versus-text-data/">Traditional relational databases are a poor home for this kind of data</a> because:</p>
<ul>
<li>This data often deals with 	opinions or aesthetic judgments – there is little concept of 	perfect accuracy.</li>
<li>Similarly, there is little concept 	of perfect completeness.</li>
<li>There&#8217;s also little concept of 	perfectly, unarguably accurate query results – different people 	will have different opinions as to what comprises good results for a 	search.</li>
<li>Queries don&#8217;t lend themselves to 	binary answers; rather, documents can have differing degrees of 	relevancy.</li>
</ul>
<p style="margin-bottom: 0in;">Systems for managing this kind of data are much less advanced than relational database managers. Nobody knows how to get all the information out of a text document, or query all of it if they could, and the story is even worse for non-text examples. The systems that give the best query results aren&#8217;t necessarily the same ones that have the best database administration features. Basically, this area is still a mess, and it&#8217;s a mess that consumes a huge fraction of all the data storage products sold today.</p>
<p style="margin-bottom: 0in;">But give or take questions of storage efficiency and deduplication, if humans created that kind of data, they put a lot of effort into it, so it&#8217;s worth keeping. Besides, compliance regulations commonly mandate that we do so – except, perhaps, when they mandate that we throw it away.</p>
<p style="margin-bottom: 0in;"><strong>Machine-generated data</strong> is a whole other can of worms. Paradigmatic examples of what I mean by “machine-generated data” include:</p>
<ul>
<li>Computer, network, and other 	equipment logs</li>
<li>Satellite and similar telemetry 	(whether for espionage or science)</li>
<li>Location data such as RFID chip 	readings, GPS system output, etc.</li>
<li>Temperature and other 	environmental sensor readings</li>
<li>Sensor readings from factories, 	pipelines, etc.</li>
<li>Output from many kinds of medical 	device, in hospitals and (increasingly) homes alike</li>
</ul>
<p style="margin-bottom: 0in;">Unlike human-generated data, whose growth is constrained by macro factors such as population and total level of economic activity, <strong>machine-generated data will continue to grow as fast as Moore&#8217;s Law lets it. </strong><span style="font-weight: normal;">That fact has two profound consequences:</span></p>
<ul>
<li><strong>It is unrealistic to hope ever 	to keep most or all machine-generated data,</strong><span style="font-weight: normal;"> whereas I think that&#8217;s exactly what should and will happen with human-generated data</span></li>
<li><span style="font-weight: normal;">Before 	long, </span><strong>most data (by volume) will be machine-generated</strong></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">And so it is not really an exaggeration to say that <strong>machine-generated data is the future of data management.</strong></span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">I&#8217;d like to close this long post by immediately pointing out some of the flaws in this simple trichotomy. One obvious gray area lies in<strong> hybrid human/machine-generated data,</strong> three big examples of which are:</span></p>
<ul>
<li><span style="font-weight: normal;">Web 	clickstreams</span></li>
<li><span style="font-weight: normal;">Call 	detail records (CDR)</span></li>
<li><span style="font-weight: normal;">Stock 	trades</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">In all three cases, we are quickly getting to the point where this data is preserved in its entirety (even if the network event data associated with the web logs is reduced before storage). And in each case it fits pretty well into RDBMS, although Hadoop has a role to play as well. So pretending it&#8217;s purely human-generated probably isn&#8217;t all that misleading.<br />
</span>
</p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">Another gray area lies in text that gets linguistically processed – i.e. via <a href="http://www.texttechnologies.com/2007/12/23/text-mining-myths-realities/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">text-mining</a> tools – with the output placed into a relational database. I don&#8217;t immediately see a workaround for that flaw in my labeling scheme.  So let&#8217;s just say no taxonomy is perfect.*</span></p>
<p style="margin-bottom: 0in;"><em><span style="font-weight: normal;">*Come to think of it, that&#8217;s one of the problems holding back text-mining technology. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </span></em></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;"><span style="font-weight: normal;">And of course some of the <a href="../2009/12/12/legit-nosql-key-value-store/">NoSQL</a> folks would note that I was oversimplifying when I tied my first category specifically to relational DBMS. So would the folks at <a href="../2010/01/15/intersystems-cache-highlights/">Intersystems</a>.</span></span></p>
<p style="margin-bottom: 0in; font-style: normal; font-weight: normal;">But the biggest oversimplification stems from this:</p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">As Mike Stonebraker* and I argued a couple of years ago, I really <a href="../2008/04/10/my-own-data-management-software-taxonomy/">think that database management technologies should be divided into 10+ categories.</a> </span></p>
<p style="margin-bottom: 0in;"><em><span style="font-weight: normal;">*Note: The links to Stonebraker&#8217;s own posts will be broken until Vertica&#8217;s webmaster gets his/her act together. But you can find them under other URLs via web search.)</span></em></p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/01/17/three-broad-categories-of-data/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>A framework for thinking about data warehouse growth</title>
		<link>http://www.dbms2.com/2009/12/07/data-warehouse-volume-growth/</link>
		<comments>http://www.dbms2.com/2009/12/07/data-warehouse-volume-growth/#comments</comments>
		<pubDate>Mon, 07 Dec 2009 13:50:47 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1278</guid>
		<description><![CDATA[There are only three ways that the amount of data stored in data warehouses can grow:

The same kinds of data are 	stored as before, with more being added over time.
The same kinds of data are stored 	as before, but in more detail.
New kinds of data are 	stored.

The first of those three ways doesn&#8217;t lead to [...]]]></description>
			<content:encoded><![CDATA[<p>There are only three ways that the amount of data stored in data warehouses can grow:</p>
<ul>
<li><strong>The same kinds of data are 	stored as before, </strong>with more being added over time.</li>
<li>The same kinds of data are stored 	as before, but in <strong>more detail.</strong></li>
<li><strong>New kinds</strong> of data are 	stored.</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-1278"></span>The first of those three ways doesn&#8217;t lead to dramatic growth. If a data warehouse goes up from 5 years of data to 6, then its overall size will grow a little over 20%.  (How little depends on what the underlying business growth is –  i.e., on how many more business events you have next year than you had 3 years ago.) That&#8217;s almost certainly going to be well-handled, by whatever technology manages your data warehouse today, given that:</p>
<ul>
<li>Chips are still subject to 	something resembling Moore&#8217;s Law.</li>
<li>Disk capacity is still subject to 	Kryder&#8217;s Law, which is like Moore&#8217;s Law but with yet faster growth 	rates.</li>
<li>DBMS software gets more performant 	over time.</li>
</ul>
<p style="margin-bottom: 0in;">So <strong>the cost of managing your same-as-before data will go down every year,</strong> even as the volume of that data grows.</p>
<p style="margin-bottom: 0in;">True, <a href="../2005/11/13/breaking-the-disk-speed-barrier/">disk rotation speeds have only increased 12.5 times since the Eisenhower Administration</a>. But <a href="../2009/10/25/teradata-hardware-strategy-and-tactics/">solid-state drives (SSDs) are getting practical for data warehousing</a> fast, so even that bottleneck eventually will get swept away. And since what we&#8217;re discussing is, basically, the first and hence presumably highest-value data to be warehoused, it&#8217;s apt to wind up on SSDs before some other kinds of data warrant that treatment.  So it&#8217;s the two other factors that drive the greatest data warehouse growth.</p>
<p style="margin-bottom: 0in;">As costs go down, the wisdom of keeping <strong>detailed data</strong> goes up. I&#8217;d go so far as to say that <strong>every piece of data generated by a human being should be preserved and kept online,</strong> legal and privacy considerations permitting.* Most forms of capital-, labor-, and/or location-based competitive advantage being commoditized and/or globalized away. But information remains a unique corporate asset.  Don&#8217;t discard it lightly.</p>
<p style="margin-bottom: 0in;"><em>*Unless there&#8217;s an explicit law mandating data destruction, legal considerations </em>should <em>permit. The idea “Let&#8217;s destroy something of irreplaceable value today, against the possibility we might be brought to judgment tomorrow” is both morally and pragmatically weird. Privacy, however, may be a different matter.</em></p>
<p style="margin-bottom: 0in; font-style: normal;">What that means in practice is that “disk is the new tape.” No-apologies performance can be had on data warehouse systems for <a href="http://www.dbms2.com/2009/07/30/the-netezza-price-point/" >$20,000/terabyte</a> or less – perhaps even <a href="http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/" >a lot less</a>. Tolerable performance may cost 3-4X less than that. I think a lot of the growth in data warehouse volumes is of exactly this kind.</p>
<p style="margin-bottom: 0in; font-style: normal;">Ultimately, however, the greatest growth in data warehouse volumes will come from <strong>new kinds of data,</strong> especially data that is partly or wholly <strong>machine-generated.</strong><span> Moore&#8217;s Law applied to sensor chips tells us that data creation will grow just as fast as the data storage capacity. And thus </span><strong>we will be throwing away most machine-generated data forever.</strong><span> But what we keep will grow – well, it probably will grow at Moore&#8217;s/Kryder&#8217;s Law speeds.</span></p>
<p style="margin-bottom: 0in; font-style: normal;"><span>That&#8217;s not to say new kinds of data are all high-volume/machine-generated. Back in 2005, I wrote<a href="http://www.computerworld.com/s/article/103054/More_Data_Makes_Your_Business_Grow?taxonomyId=9&amp;pageNumber=2" onclick="javascript:pageTracker._trackPageview('/www.computerworld.com');"> </a></span><span><a href="http://www.computerworld.com/s/article/103054/More_Data_Makes_Your_Business_Grow?taxonomyId=9&amp;pageNumber=2" onclick="javascript:pageTracker._trackPageview('/www.computerworld.com');">two</a> <a href="http://blogs.computerworld.com/node/512" onclick="javascript:pageTracker._trackPageview('/blogs.computerworld.com');">pieces</a></span><span> for </span><em><span>Computerworld</span></em><span> advocating aggressive pursuit of new data sources, and the examples I mentioned were:</span></p>
<ul>
<li><span>Loyalty cards, especially 	in gaming</span></li>
<li>Location-based analytics</li>
<li>Extra customer feedback (e.g., 	opinion surveys)</li>
<li>Price/offer testing</li>
<li>Text mining 	in general</li>
<li>Medical 	records</li>
</ul>
<p style="margin-bottom: 0in;">Today I&#8217;d add (among others):</p>
<ul>
<li>RFID</li>
<li>The raw 	output from medical test devices</li>
<li>Sensors up and down the energy supply chain</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">But some of those older, low-data-volume ideas still head my list of low-hanging analytic fruit.</p>
<p style="margin-bottom: 0in; font-style: normal;"><span>One more complication – these buckets I&#8217;m outlining are less than precise. For example:</span></p>
<ul>
<li><span>Telecom 	CDRs (Call Detail Records) are machine-generated from a seed of 	human activity. They have long been stored, but now are being kept 	in much more detail. This is why telecommunications is one of the 	top markets for data warehouse technology.</span></li>
<li><span>Stock 	trade data used to be based on human decisions. Now most of it is 	just machines buying and selling from each other. Either way, 	increasingly many investment institutions want to keep 	100-terabyte-scale databases of complete historical trade detail. 	And that is why financial services is another huge market for data 	warehouse technology.</span></li>
<li><span>Not 	long ago, web and network event logs. didn&#8217;t even exist, or were 	tiny where they did. Now they fill the largest known commercial 	databases, at firms such as </span><span><a href="http://www.dbms2.com/2009/10/01/yahoos-decapetabyte-data-warehousinghadoop/" >Yahoo</a>, 	<a href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/" >eBay</a>, and <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >Facebook</a>.</span><span> Even so, more is thrown away than kept, especially on the network 	event side, which is a multiple of the size of the pure clickstream 	data.</span></li>
<li><span>We 	don&#8217;t know exactly what all data intelligence agencies collect from 	telemetry, from monitoring commercial telecommunication traffic, and 	so on. But they&#8217;re surely throwing the vast majority away, even as 	the small part they keep is </span><span><a href="http://www.dbms2.com/2009/09/30/facts-and-rumors/" >petabyte-scale</a>.</span></li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">But none of that interferes with my main points, which are:</p>
<ul>
<li><strong>Databases 	will continue to grow very quickly.</strong></li>
<li>One big driver 	is <strong>the increasing detail in which data is kept online.</strong></li>
<li>An even bigger 	driver will be <strong>the unending ability of machines to generate ever 	greater streams of at-least-somewhat interesting data.</strong></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/07/data-warehouse-volume-growth/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Webinar on MapReduce for complex analytics (Thursday, December 3, 10 am and 2 pm Eastern)</title>
		<link>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/</link>
		<comments>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/#comments</comments>
		<pubDate>Wed, 02 Dec 2009 20:57:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1267</guid>
		<description><![CDATA[The second in my two-webinar series for Aster Data will occur tomorrow, twice (both live), at 10 am and 2 pm Eastern time. The other presenters will be Jonathan Goldman, who was Principal Scientist at LinkedIn but now has joined Aster himself, and Steve Wooledge of Aster (playing host). Key links are:

Registration for tomorrow&#8217;s webinars
Replay [...]]]></description>
			<content:encoded><![CDATA[<p>The second in my two-webinar series for Aster Data will occur tomorrow, twice (both live), at 10 am and 2 pm Eastern time. The other presenters will be Jonathan Goldman, who was Principal Scientist at LinkedIn but now has joined Aster himself, and Steve Wooledge of Aster (playing host). Key links are:</p>
<ul>
<li>Registration for <a href="http://www.asterdata.com/wc_091203_masteringmapreduce/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');">tomorrow&#8217;s webinars</a></li>
<li>Replay of the <a href="http://www.asterdata.com/masteringmapreduce2/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');"> first webinar</a></li>
<li>My slides from the <a href="http://www.dbms2.com/2009/10/15/mapreduce-webinar-slides/" >first webinar</a></li>
</ul>
<p>The main subjects of the webinar will be:</p>
<ul>
<li>Some review of material from the first webinar (all three presenters)</li>
<li>Discussion of how MapReduce can help with three kinds of analytics:
<ul>
<li>Pattern matching (Jonathan will give detail)</li>
<li>Number-crunching (I&#8217;ll cover that, and it will be short)</li>
<li>Graph analytics (I haven&#8217;t written the slides yet, but my starting point will be some of the <a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/" >relationship analytics</a> ideas we discussed in August)</li>
</ul>
</li>
</ul>
<p>Arguably, aspects of data transformation fit into each of those three categories, which may help explain why data transformation has been so prominent among the early applications of MapReduce.</p>
<p>As you can see from Aster&#8217;s title for the webinar (which they picked while I was on vacation), at least their portion will be focused on customer analytics, e.g. web analytics.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Boston Big Data Summit keynote outline</title>
		<link>http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/</link>
		<comments>http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/#comments</comments>
		<pubDate>Mon, 23 Nov 2009 06:25:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[DBMS product categories]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Humor]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1227</guid>
		<description><![CDATA[Last month, Bob Zurek asked me to give a talk on “Big Data”, where “big” is anything from a few terabytes on up, then moderate a panel on cloud computing. We agreed that I could talk just from notes, without slides. So, since I have them typed up, I&#8217;m posting them below.

The top two points [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Last month, Bob Zurek asked me to give a talk on <a href="http://www.dbms2.com/2009/10/09/presentations-upcoming/" >“Big Data”, where “big” is anything from a few terabytes on up</a>, then moderate a panel on cloud computing. We agreed that I could talk just from notes, without slides. So, since I have them typed up, I&#8217;m posting them below.</p>
<p><span id="more-1227"></span></p>
<p style="margin-bottom: 0in;">The top two points from Q&amp;A probably were:</p>
<ul>
<li><strong>Big Data and the cloud actually 	have relatively little to do with each other,</strong> <a href="http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/" >a few exceptions</a> notwithstanding, especially if the data is in a shared-nothing DBMS 	(as opposed to, say, a MapReduce-oriented file cluster). Two 	principal reasons are:
<ul>
<li>Redistributing data from node to 	node is a little slow, undermining some of the elasticity benefits 	of the cloud.</li>
<li><a href="http://www.dbms2.com/2009/05/29/sneakernet-to-the-cloud/" >Getting data into the cloud in the 	first place is a lot slow</a>.</li>
</ul>
</li>
<li><strong>The NoSQL movement is a lot like 	the Ron Paul campaign</strong> &#8212; it consists of people who are dissatisfied 	with the status quo, whose dissatisfaction has a lot to do with 	insufficient liberty and/or excessive expenditure, and who otherwise 	don&#8217;t have a whole lot in common with each other.</li>
</ul>
<p style="margin-bottom: 0in;">Anyhow, here are my notes for the talk, edited in just a couple of places for readability or linkage.</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><strong>Quick introduction</strong></p>
<ul>
<li>Big Data vs. cloud</li>
<li>How big is Big Data?</li>
<li>At the low end of that range, 	there&#8217;s little you can&#8217;t do with conventional technology if you 	have:
<ul>
<li>An unlimited budget for hardware</li>
<li>An unlimited budget for software</li>
<li>An unlimited budget for people, 	especially Oracle DBAs</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Big Data in OLTP</strong></p>
<ul>
<li>Hard-core OLTP
<ul>
<li>Focus of DBMS technology for a 	long-time</li>
<li>Big budgets because each 	transaction has significant value</li>
<li>Tough to get users to change 	technologies</li>
</ul>
</li>
<li>Lighter-weight OLTP
<ul>
<li>Classic example = web companies
<ul>
<li>Big ones &#8212;  retail-oriented ones 	(eBay, Amazon) partially excepted &#8212; <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >rolled their own technology 	stacks</a></li>
<li>Reluctant to give money to anybody
<ul>
<li>Open source, etc.</li>
</ul>
</li>
</ul>
</li>
<li>Difficulty finding market
<ul>
<li>Product vs. feature
<ul>
<li>Clustering/HA/DR/whatever</li>
<li>Ditto cloud enablement</li>
</ul>
</li>
<li>True products haven&#8217;t found much 	traction yet</li>
</ul>
</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Analytic Big Data use cases</strong></p>
<ul>
<li>Kinds of data for analytics
<ul>
<li>More of same != big</li>
<li>More detail and/or new kinds
<ul>
<li>Complete data sets</li>
<li>Transactions</li>
<li>Call details</li>
<li>Tick/trade history</li>
<li>Web clickstreams</li>
<li>Network event logs</li>
<li>Other machine-generated data</li>
<li>CAM bottom line
<ul>
<li>Anything human-generated should 	and will be retained in its entirety</li>
<li>Quantities of machine-generated 	data retained should and will grow roughly in line w/ computing cost 	reductions (Moore&#8217;s Law, etc.)</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>Analytic uses of Big Data
<ul>
<li>Analytics is mainly about three 	things
<ul>
<li>Problem detection</li>
<li>Customer relationship improvement
<ul>
<li>(Those overlap when the customer 	relationship is bad)</li>
</ul>
</li>
<li>Financial statements on steroids</li>
</ul>
</li>
</ul>
<ul>
<li>Main kinds of analytics
<ul>
<li>What BI vendors traditionally sell
<ul>
<li>General reporting and dashboards</li>
<li>Ad-hoc query (now driven from 	those reports and dashboards)</li>
<li>Planning (allegedly integrated 	with BI)</li>
</ul>
</li>
<li>Research
<ul>
<li>Ad hoc relational query (worth 	mentioning twice because it drives so much of the market)</li>
<li>Data mining</li>
<li>Most web search and web mining</li>
</ul>
</li>
<li>Operational/near-real-time</li>
<li>Archiving/compliance</li>
</ul>
</li>
<li>What gets Big?
<ul>
<li>Mainly research and archiving</li>
<li>But when reporting or operational 	get Big, you have really interesting computing problems</li>
</ul>
</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Technology issues and trends</strong></p>
<ul>
<li>Moore&#8217;s Law
<ul>
<li>CPUs &#8212; All about cores, hence 	parallelism is key</li>
<li>RAM</li>
<li>SSDs – hence replace disks</li>
<li>Sensors – hence generate lots 	more data</li>
</ul>
</li>
<li>Kryder&#8217;s Law
<ul>
<li>But <a href="http://www.dbms2.com/2005/11/13/breaking-the-disk-speed-barrier/" >rotational speeds up only 	12.5X since Eisenhower Administration</a></li>
<li>Hence solid-state memory (or RAM) 	will soon take over</li>
</ul>
</li>
<li>In the mean time, I/O bottlenecks 	have had to be beaten
<ul>
<li>Hence sequential scans</li>
<li>Hence <a href="http://www.dbms2.com/2007/03/26/index-light-mpp-data-warehouse-appliances/" >index-light</a> architectures</li>
<li>Hence columnar</li>
</ul>
</li>
<li>DBMS “overhead”
<ul>
<li>Raw license and maintenance fees – 	software increasing fraction of total</li>
<li>OLTP vestiges – locking and all 	that</li>
<li>DBAs
<ul>
<li>People costs = huge fraction of 	total</li>
<li>Index-lightness addresses</li>
<li>So does appliance</li>
</ul>
</li>
<li>Many people don&#8217;t really know how to 	write SQL</li>
</ul>
</li>
<li>Configuration
<ul>
<li>Appliance/tightly-balanced
<ul>
<li>Netezza</li>
<li>Teradata earlier</li>
<li>Greenplum/Sun</li>
<li>Oracle</li>
<li>IBM</li>
<li>Microsoft/Madison</li>
</ul>
</li>
<li>Commodity/do what you want
<ul>
<li>Vertica</li>
<li>Greenplum now</li>
<li>Infobright, Aster and others</li>
<li>MapReduce-oriented file systems</li>
</ul>
</li>
<li><a href="http://www.dbms2.com/2009/10/25/data-warehouse-balanced-hardware-configuration/" >Extreme rigidity is silly</a>
<ul>
<li><a href="http://www.dbms2.com/2009/10/25/teradata-hardware-strategy-and-tactics/" >Teradata, Oracle have both 	signaled moving to more modularity</a></li>
<li>Big driver of that = heterogeneous 	storage
<ul>
<li>Cheap disk</li>
<li>Expensive disk</li>
<li>Solid-state</li>
<li>RAM</li>
</ul>
</li>
</ul>
<ul>
<li>CPU/storage ratio is even more of a 	driver</li>
</ul>
</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Theoretically defensible ways to segment the market</strong></p>
<ul>
<li><a href="http://www.dbms2.com/2009/09/10/analytic-speed-latency/" >Latency requirements</a>
<ul>
<li>High availability and low latency 	go together</li>
</ul>
</li>
<li>Query types
<ul>
<li>Simultaneous users for same</li>
</ul>
</li>
<li>Database size</li>
<li>Budget</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Actual segments right now</strong></p>
<ul>
<li><a href="http://www.dbms2.com/2009/08/24/teradatas-active-enterprise-data-warehouse-story/" >Utter ADW/EDW</a></li>
<li>Data mart
<ul>
<li>Size</li>
<li>Naturally columnar vs. naturally 	row-based</li>
</ul>
</li>
<li>Operational/frontline</li>
<li>Less dramatic/smaller EDW</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Three big myths about MapReduce</title>
		<link>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/</link>
		<comments>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 16:14:37 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Michael Stonebraker]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1135</guid>
		<description><![CDATA[Once again, I find myself writing and talking a lot about MapReduce.  But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:

MapReduce is something very new
MapReduce involves strict 	adherence to the Map-Reduce programming paradigm
MapReduce is a single technology

So let&#8217;s give it a try.
When Dave DeWitt and Mike [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Once again, I find myself writing and talking a lot about MapReduce.  But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:</p>
<ul>
<li>MapReduce is something very new</li>
<li>MapReduce involves strict 	adherence to the Map-Reduce programming paradigm</li>
<li>MapReduce is a single technology</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-1135"></span>So let&#8217;s give it a try.</p>
<p style="margin-bottom: 0in;">When Dave DeWitt and Mike Stone<span style="font-style: normal;">braker leveled <a href="../2008/01/18/the-great-mapreduce-debate/">their famous blast at MapReduce</a>, many people thought they overstated their case. But one part of their story – one that both Mike and Dave say was most central to their case – was never effectively refuted, n</span>amely the claim that these ideas aren&#8217;t particularly new. I haven&#8217;t actually read enough computer science literature to have an independent opi<span style="font-style: normal;">nion on that issue. But I&#8217;ll say this – claims from companies such as <a href="../2009/10/18/introduction-to-sensage/">SenSage</a>, <a href="../2009/10/06/oracle-mapreduce/">Oracle</a>, or <a href="../2009/10/18/technical-introduction-to-splunk/">Splunk</a> that “We&#8217;ve be</span>en doing MapReduce all along” seem pretty credible to me.</p>
<p style="margin-bottom: 0in;">True, what those companies were doing things may not have looked exactly like the instant-classic MapReduce programming paradigm. But the same is true of many things almost everybody would agree count as MapReduce.  In particular, it is often not the case that you alternate Map and Reduce steps, each of whose outputs is a set of simple &lt;Key, Value&gt; pairs, with data redistributed based on Key at every step.</p>
<p style="margin-bottom: 0in;">Here are some examples of what I mean, drawn from <a href="http://www.asterdata.com/blog/index.php/2009/10/15/mastering-mapreduce/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');">my recent MapReduce webinar</a>.</p>
<ul>
<li>If you do text indexing in 	MapReduce, your goal is to wind up with a text index. So at some 	point you Reduce to a pair &lt;WordName, {all the (DocumentID, 	offset) pairs for the whole corpus, suitably ordered}&gt;.  That&#8217;s a 	heckuva compound “Value”.</li>
<li>The goal of data mining is usually 	to estimate a rather small number of parameters based on a large 	overall data set, often – depending on algorithm – in the form 	of a single vector. When you do that in MapReduce. you partition 	data among nodes, calculate something on each node that is 	structured more or less like your final vector. So when it comes 	time for the reduce, you just ship all of your vectors – one per 	node – to a single Reduce node, and do the appropriate math. 	Redistribution based on Key would be quite pointless.</li>
<li>When you sessionize clickstream 	logs in MapReduce, you may have just as many output records as input 	records. However, they now are reformatted, and might have a 	SessionID appended. In those cases, Reduce isn&#8217;t doing much by the 	way of reduction.</li>
<li>And as I happens in some 	<a href="../2009/08/04/verticas-version-of-mapreduce-integration/">Vertica-Hadoop</a> use cases around mortgage trading, sometimes MapReduce can even make 	data s<span style="font-style: normal;">ets vastly larger.</span></li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">By no means do I think this is a weakness of the MapReduce programming paradigm. Rather, I think it&#8217;s a MapReduce strength. But it&#8217;s not quite the way MapReduce has been promoted and explained to the IT public.</p>
<p style="margin-bottom: 0in; font-style: normal;">Finally: MapReduce, as commonly conceived, spans two different – albeit closely related – technology domains:</p>
<ul>
<li>Parallel 	programming</li>
<li>Distributed 	data management</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">For example, I imagine Greenplum&#8217;s and Vertica&#8217;s MapReduce/SQL combined syntaxes are very similar to each others. But Vertica&#8217;s data management implementation of MapReduce, which relies on Hadoop, is very different from Greenplum&#8217;s, which is tied into the Greenplum DBMS. Similary, non-DBMS MapReduce implementations are commonly associated with distributed file systems – notably HDFS (Hadoop Distributed File Systems) or Google&#8217;s internal GFS (Google File System). In those systems, the parallel language execution part should be aware of how the distributed file management part works – but perhaps that awareness can be pretty lightweight.</p>
<p style="margin-bottom: 0in; font-style: normal;">Right now, this is a distinction pretty much without a difference. If you choose an implementation of MapReduce &#8212; like pure Hadoop (say in the Cloudera distribution) or Hadoop-Vertica or Aster Data&#8217;s SQL/MapReduce – you&#8217;re basically picking an entire technology stack. But those stacks are going to do a whole lot of changing and maturing in the near future – and as they do, it&#8217;s likely that projects will interact or even combine in all sorts of interesting ways.</p>
<p style="margin-bottom: 0in; font-style: normal;"><strong>Bottom line: There are a lot of different ways to exploit MapReduce-related technology.</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Technical introduction to Splunk</title>
		<link>http://www.dbms2.com/2009/10/18/technical-introduction-to-splunk/</link>
		<comments>http://www.dbms2.com/2009/10/18/technical-introduction-to-splunk/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 16:01:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1124</guid>
		<description><![CDATA[As noted in my other introductory post, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it&#8217;s probably OK to assume they&#8217;re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">As noted in <a href="http://www.dbms2.com/2009/10/18/general-introduction-to-splunk/" >my other introductory post</a>, Splunk sells software called Splunk, which is used for log analysis. These can be logs of various kinds, but for the purpose of understanding Splunk technology, it&#8217;s probably OK to assume they&#8217;re clickstream/network event logs. In addition, Splunk seems to have some aspirations of having its software used for general schema-free analytics, but that&#8217;s in early days at best.</p>
<p style="margin-bottom: 0in;">Splunk&#8217;s core technology indexes text and XML files or streams, especially log files. Technical highlights of that part include:<span id="more-1124"></span></p>
<ul>
<li>Splunk software both reads logs 	and indexes them. The same code runs both on the nodes that do the 	indexing and on machines that simply emit logs. However, in the 	latter case indexing is turned off. Thus, Splunk does not portray 	its software as “agentless.” However, it asserts that its 	agent-like software runs without “material” overhead.</li>
<li>The fundamental thing that Splunk 	looks at is an increment to a log – i.e., whatever has been added 	to the log since Splunk last looked at it.</li>
<li>Splunk tries to figure out what 	the individual entries are in a section of log it looks at.  In 	particular:
<ul>
<li>Time stamps are a big clue in this 	“inferencing” process, but they are not the be-all and end-all.</li>
<li>Nor are line boundaries, if logs 	are naturally broken up into lines. (Splunk threw that latter 	comment in as a shot at SenSage.)</li>
</ul>
</li>
<li>I get the impression that most 	Splunk entity extraction is done at search time, not at indexing 	time. Splunk says that, if a &lt;name, value&gt; pair is clearly 	marked, its software does a good job of recognizing same. Beyond 	that, fields seem to be specified by users when they define 	searches.</li>
<li>Splunk has a simple ILM 	(Information Lifecycle management) story based on time. I didn&#8217;t 	probe for details.</li>
</ul>
<p style="margin-bottom: 0in;">Given its text search engine, Splunk does – well, it does text searches. And it stores searches, so they can be used for alerting or reporting. Indeed, Splunk persists and presumably updates results to stored searches, in a rough analog to materialized views.</p>
<p style="margin-bottom: 0in;">Apparently, Splunk&#8217;s indexing is typically done via MapReduce jobs. I don&#8217;t know whether any actual Splunk searches are also done via MapReduce; surely they aren&#8217;t all, given the discussion of a near-real-time alerting engine and so on. Splunk fondly believes its MapReduce is an order of magnitude faster than SQL (I didn&#8217;t ask which SQL engines Splunk has in mind when they say this), and 5-10X faster than Hadoop. One efficiency trick is to look ahead and do Reduces in place where possible. This seems to be done automatically in the execution plan, ala Aster&#8217;s SQL-MapReduce, rather than having to be hand-coded. Splunk says its software can “easily” index 1-200 gigabytes of data per day on a commodity 8-core server, while maintaining an active search load, and 3-400 gigabytes are doable.</p>
<p style="margin-bottom: 0in;">Splunk&#8217;s capabilities right now in tabular-style analytics seem to be limited to a command-line report builder, plus a GUI wizard that generates the command line. A few users have asked for support of third-party business intelligence tools, but Splunk hasn&#8217;t provided that yet. Nor can I find much evidence of ODBC/JDBC drivers for Splunk. But then, I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.</p>
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/18/technical-introduction-to-splunk/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>General introduction to Splunk</title>
		<link>http://www.dbms2.com/2009/10/18/general-introduction-to-splunk/</link>
		<comments>http://www.dbms2.com/2009/10/18/general-introduction-to-splunk/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 15:59:56 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Fox and MySpace]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1119</guid>
		<description><![CDATA[I dropped by log analysis software vendor Splunk a few weeks ago for a chat with Marketing VP Steve Sommer (who some you may know from Cognos and/or Informix), Product Management VP Christina Noren, and above all co-founder/CTO Erik Swan. Splunk turns out to be a pretty interesting company, from both business and technical standpoints. [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I dropped by log analysis software vendor Splunk a few weeks ago for a chat with Marketing VP Steve Sommer (who some you may know from Cognos and/or Informix), Product Management VP Christina Noren, and above all co-founder/CTO Erik Swan. Splunk turns out to be a pretty interesting company, from both business and technical standpoints. For one thing, Splunk seems highly regarded by most people I mention it to.</p>
<p style="margin-bottom: 0in;">Splunk&#8217;s technical stories include:</p>
<ul>
<li>Text search over log files.</li>
<li>Business intelligence over text 	search. (That part sounds a lot like <a href="http://www.texttechnologies.com/2007/12/12/attivio-tries-to-do-it-all/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">Attivio</a>.)</li>
<li>MapReduce with schema flexibility 	and smart multi-stage execution plans. (That part sounds a lot like 	Aster Data.)</li>
</ul>
<p style="margin-bottom: 0in;">More on those in <a href="http://www.dbms2.com/2009/10/18/technical-introduction-to-splunk/" >a separate post</a>.</p>
<p style="margin-bottom: 0in;">Less technical Splunk highlights include:<span id="more-1119"></span></p>
<ul>
<li>Splunk has ~1200 paying customers, 	and is adding a couple hundred more per quarter.</li>
<li>Splunk has ~160 people.</li>
<li>~80% of Splunk sales are in North 	America.</li>
<li>Typical Splunk sales prices are in 	the $10-50K range, with an average around $25K, or maybe that 	average is a bit over $30K. Some Splunk deals are six- or even 	seven-figure.</li>
<li>Splunk is “quite profitable.”</li>
<li>Splunk&#8217;s eponymous product is 	priced according to how much data is indexed per day. If you index 	half a gigabyte of logs per day or less, Splunk is completely free. 	So, while Splunk is closed-source, there&#8217;s something of an 	open-source-like Splunk adoption model.</li>
<li>Splunk has been selling product 	for a couple of years. I gather Splunk 4 was recently released.</li>
<li>Splunk&#8217;s biggest industry segments 	are, not too surprisingly,
<ul>
<li>Telco</li>
<li>Financial services</li>
<li>Government</li>
<li>“Online”</li>
</ul>
</li>
<li>Splunk&#8217;s paying customers seem to 	use it mainly for:
<ul>
<li>Web logs and associated network 	event logs (this seems to be the biggest area)</li>
<li>Security and perhaps other general 	IT log analysis</li>
<li>Physical security logs (mainly in 	the government)</li>
<li>Anti-fraud (I&#8217;m not sure how that 	works)</li>
</ul>
</li>
<li>One would think Splunk would be 	used to manage a lot of intelligence telemetry, but that wasn&#8217;t 	particularly hinted at.</li>
<li>In general, the core problem 	Splunk is used for is log analysis for trouble-shooting purposes.</li>
<li>Splunk&#8217;s nonpaying users are more 	diverse; examples mentioned included windmill operations and protein 	research.</li>
<li>Splunk&#8217;s customers include Aster 	Data flagship accounts MySpace and LinkedIn. I bet many other top 	web companies are Splunk customers as well.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/18/general-introduction-to-splunk/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Infobright notes</title>
		<link>http://www.dbms2.com/2009/10/14/infobright-notes/</link>
		<comments>http://www.dbms2.com/2009/10/14/infobright-notes/#comments</comments>
		<pubDate>Wed, 14 Oct 2009 19:32:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Kickfire]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1091</guid>
		<description><![CDATA[I had lunch w/ Bob Zurek and Susan Davis of Infobright today. This wasn&#8217;t primarily a briefing, but a few takeaways are:

Infobright now has &#62;100 paying customers.
Typical database size is from the low 100s of gigabytes to the low single-digit number of terabytes.
Agile development is at or approaching two-week release cycles.
Like Kickfire, Infobright  has a [...]]]></description>
			<content:encoded><![CDATA[<p>I had lunch w/ Bob Zurek and Susan Davis of Infobright today. This wasn&#8217;t primarily a briefing, but a few takeaways are:</p>
<ul>
<li>Infobright now has &gt;100 paying customers.</li>
<li>Typical database size is from the low 100s of gigabytes to the low single-digit number of terabytes.</li>
<li>Agile development is at or approaching two-week release cycles.</li>
<li>Like Kickfire, Infobright  has a multi-year deal with MySQL that insulates it against many potential Oracle/MySQL shenanigans.</li>
<li>From an industry perspective, Infobright&#8217;s customer base sounds a lot like other vendors&#8217;:
<ul>
<li>Data mart outsourcing/online analytics</li>
<li>Log files for websites</li>
<li>Telecommunications</li>
<li>Financial services</li>
<li>OEM, especially in the markets cited above</li>
<li>&#8220;Hey, we&#8217;re beginning to see the occasional energy deal&#8221;</li>
<li>A few random others</li>
</ul>
</li>
<li>Infobright is seeing some household-name customers, who surely have big-name analytic DBMS products, but who also have a policy that open source is the default choice, and if open source can get the job done then the favorite closed-source choices aren&#8217;t used.</li>
<li>Infobright has the usual open-source community story &#8212; lots of involvement and engagement in the forums, but contributions are limited mainly to connectivity, utility scripts, etc. (Maybe some national language translation too; I&#8217;m not sure.)</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/14/infobright-notes/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>How 30+ enterprises are using Hadoop</title>
		<link>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/</link>
		<comments>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/#comments</comments>
		<pubDate>Sat, 10 Oct 2009 10:19:29 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1073</guid>
		<description><![CDATA[MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera&#8217;s files. Facts and metrics ranged widely, of course:

Some are in heavy production with 	Hadoop, and closely engaged with Cloudera. [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of <a href="http://www.dbms2.com/2009/10/01/mapreduce-tidbits/" >Hadoop World</a>, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera&#8217;s files. Facts and metrics ranged widely, of course:</p>
<ul>
<li>Some are in heavy production with 	Hadoop, and closely engaged with Cloudera. Others are active Hadoop 	users but are very secretive. Yet others signed up for initial 	Hadoop training last week.</li>
<li>Some have Hadoop clusters in the 	thousands of nodes. Many have Hadoop clusters in the 50-100 node 	range. Others are just prototyping Hadoop use. And one seems to be 	&#8220;OEMing&#8221; a small Hadoop cluster in each piece of equipment 	sold.</li>
<li>Many export data from Hadoop to a 	relational DBMS; many others just leave it in HDFS (Hadoop 	Distributed File System), e.g. with <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >Hive</a> as the query 	language, or in exactly one case Jaql.</li>
<li>Some are household names, in web 	businesses or otherwise. Others seem to be pretty obscure.</li>
<li>Industries include financial 	services, telecom (Asia only, and quite new), bioinformatics (and 	other research), intelligence, and lots of web and/or 	advertising/media.</li>
<li>Application areas mentioned &#8212; and 	these overlap in some cases &#8212; include:
<ul>
<li>Log and/or clickstream analysis of 	various kinds</li>
<li>Marketing analytics</li>
<li>Machine learning and/or 	sophisticated data mining</li>
<li>Image processing</li>
<li>Processing of XML messages</li>
<li>Web crawling and/or text 	processing</li>
<li>General archiving, including of 	relational/tabular data, e.g. for compliance</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-1073"></span>We went over this list so quickly that we didn&#8217;t go into much detail on any one user. But one example that stood out was of an ad serving firm that had an &#8220;aggregation pipeline&#8221; consisting of 70-80 MapReduce jobs.</p>
<p style="margin-bottom: 0in;">I also talked yesterday again w/ Omer Trajman of Vertica, who surprised me by indicating a high single-digit number of Vertica&#8217;s customers were in production with Hadoop &#8212; i.e., over 10% of Vertica&#8217;s production customers.  (Vertica recently made its 100th sale, and of course not all those buyers are in production yet.) <a href="http://www.dbms2.com/2009/08/04/verticas-version-of-mapreduce-integration/" >Vertica/Hadoop</a> usage seems to have started in Vertica&#8217;s financial services stronghold &#8212; specifically in financial trading &#8212; with web analytics and the like coming on afterwards. Based on current prototyping efforts, Omer expects bioinformatics to be the third production market for Vertica/Hadoop, with telecommunications coming in fourth.</p>
<p style="margin-bottom: 0in;">Unsurprisingly, the general Vertica/Hadoop usage model seems to be:</p>
<ul>
<li>Do something to the data in Hadoop</li>
<li>Dump it into Vertica to be queried</li>
</ul>
<p style="margin-bottom: 0in;">What I did find surprising is that the data often isn&#8217;t reduced by this analysis, but rather exploded in size.  E.g., a complete store of mortgage trading data might be a few terabytes in size, but Hadoop-based post processing can increase that by 1 or 2 orders of magnitude. (Analogies to the importance and magnitude of <em>&#8220;cooked&#8221; data</em> in scientific data processing come to mind.)</p>
<p style="margin-bottom: 0in;">And finally, I talked to Aster a few days ago about the usage of its nCluster/Hadoop connector. Aster characterized Aster/Hadoop users&#8217; Hadoop usage as being of the batch/ETL variety, which is the classic use case one concedes to Hadoop even if one believes that MapReduce should commonly be done right in the DBMS.</p>
<p style="margin-bottom: 0in;"><em><strong>Related link</strong></em></p>
<ul>
<li><a href="../2008/08/26/known-applications-of-mapreduce/">An 	August, 2008 round-up of MapReduce applications</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Yahoo wants to do decapetabyte-scale data warehousing in Hadoop</title>
		<link>http://www.dbms2.com/2009/10/01/yahoos-decapetabyte-data-warehousinghadoop/</link>
		<comments>http://www.dbms2.com/2009/10/01/yahoos-decapetabyte-data-warehousinghadoop/#comments</comments>
		<pubDate>Thu, 01 Oct 2009 07:05:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=974</guid>
		<description><![CDATA[My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo&#8217;s Hadoop effort &#8212; everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get [...]]]></description>
			<content:encoded><![CDATA[<p>My old client <a href="http://www.dbms2.com/2007/08/10/the-essence-of-cep-according-to-coral8" >Mark Tsimelzon</a> moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo&#8217;s Hadoop effort &#8212; everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.</p>
<p style="margin-bottom: 0in;">Highlights of our visit included:</p>
<ul>
<li>There are dozens of people at 	Yahoo doing Hadoop development that will wind up getting open 	sourced. (Full-time or close to it.) In particular, everything 	Mark&#8217;s team does goes to open source.</li>
<li>Yahoo is moving as much of its 	analytics to Hadoop as possible. Much of this is being moved away 	from <a href="http://www.dbms2.com/2009/09/19/oracle-database-siz/" >Oracle</a> and from Yahoo&#8217;s own <a href="http://www.dbms2.com/2009/07/06/yahoo-is-up-to-10-petabytes-now/" >Everest</a>.</li>
<li>A column store 	is being put on top of HDFS, based on Yahoo technology. Columns will 	be striped across nodes. Perhaps that&#8217;s why the effort is called 	Project Zebra.</li>
<li>Mark believes 	that in a year Hadoop will be much further along in meeting 	traditional data warehousing requirements, in areas such as:
<ul>
<li>Metadata</li>
<li>SLAs/high 	availability/other workload management</li>
<li>Data retention 	policies</li>
<li>Security/privacy*</li>
</ul>
</li>
<li>Yahoo views 	the time-to-market benefits of Hadoop as being more important than 	TCO.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;"><em><span id="more-974"></span>*I also spoke with a couple of Mark&#8217;s Yahoo colleagues, on his introduction, who are being less helpful than he is about clarifying what I am or am not allowed to say for publication. But I will say that I was heartened by the degree of concern they showed for doing the right thing with regard to privacy. I was not as heartened by the concrete ideas &#8212; or lack thereof &#8212; for making it happen. But frankly, I don&#8217;t think it&#8217;s a solvable technical problem. Rather, it should be <a href="http://www.monashreport.com/2006/06/06/freedom-even-without-data-privacy/" onclick="javascript:pageTracker._trackPageview('/www.monashreport.com');">a huge priority on the legal/political front</a>.</em></p>
<p style="margin-bottom: 0in;">We also talked some about Pig, Yahoo&#8217;s non-SQL DML (Data Manipulation Language) for Hadoop, which is however getting a SQL interface. And we talked about Pig vs. <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >Hive</a>. But I recently heard a rumor all that is in flux, so I won&#8217;t write it up now.</p>
<p style="margin-bottom: 0in;">Mark sent along a couple of interesting slide presentations by a colleague. After some back and forth as to whether I could post them, he suggested I post <a href="http://developer.yahoo.net/blogs/theater/archives/2009/09/welcome_hadoop_summit.html" onclick="javascript:pageTracker._trackPageview('/developer.yahoo.net');">these</a> <a href="http://developer.yahoo.net/blogs/theater/archives/2009/06/hadoopsummit_shugar.html" onclick="javascript:pageTracker._trackPageview('/developer.yahoo.net');">links</a> to similar material instead.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/01/yahoos-decapetabyte-data-warehousinghadoop/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>
