<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS2 -- DataBase Management System Services &#187; Google</title>
	<atom:link href="http://www.dbms2.com/category/products-and-vendors/google-mapreduce-bigtable/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Fri, 12 Mar 2010 23:51:42 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Some NoSQL links</title>
		<link>http://www.dbms2.com/2010/03/12/some-nosql-links/</link>
		<comments>http://www.dbms2.com/2010/03/12/some-nosql-links/#comments</comments>
		<pubDate>Fri, 12 Mar 2010 23:51:42 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Amazon and its cloud]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Continuent]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Tokutek]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1692</guid>
		<description><![CDATA[I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I&#8217;m poking around a bit reading stuff on the subjects. Here are some links I found.

A little over a year ago, Julian Browne put up a great post on Eric Brewer&#8217;s CAP conjecture/theorem, which provides much of the impetus [...]]]></description>
			<content:encoded><![CDATA[<p>I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I&#8217;m poking around a bit reading stuff on the subjects. Here are some links I found.</p>
<ul>
<li>A little over a year ago, Julian Browne put up a great post on <a href="http://www.julianbrowne.com/article/viewer/brewers-cap-theorem" onclick="javascript:pageTracker._trackPageview('/www.julianbrowne.com');" mce_href="http://www.julianbrowne.com/article/viewer/brewers-cap-theorem">Eric Brewer&#8217;s CAP conjecture/theorem</a>, which provides much of the impetus to relax the traditional requirement for atomicity/consistency.</li>
<li>Even more directly inspirational to NoSQL technology development were two seminal papers: Google&#8217;s on <a href="http://labs.google.com/papers/bigtable.html" onclick="javascript:pageTracker._trackPageview('/labs.google.com');" mce_href="http://labs.google.com/papers/bigtable.html">BigTable</a> and Amazon&#8217;s on <a href="http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf" onclick="javascript:pageTracker._trackPageview('/s3.amazonaws.com');" mce_href="http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf">Dynamo</a>. (That said, I&#8217;m having trouble getting myself to actually read them from start to finish, especially since they&#8217;ve been superseded by subsequent technology development.)</li>
<li>10gen (the MongoDB guys) hosted a NoSQL conference yesterday. Much blogging has ensued. The best post I&#8217;ve seen so far was by <a href="http://blog.marcua.net/post/442594842/notes-from-nosql-live-boston-2010" onclick="javascript:pageTracker._trackPageview('/blog.marcua.net');" mce_href="http://blog.marcua.net/post/442594842/notes-from-nosql-live-boston-2010">Adam Marcus</a>. I find the graph database notes near the bottom particularly interesting.</li>
<li>Mark Callaghan hit back against the <a href="http://mysqlha.blogspot.com/2010/03/plays-well-with-others.html" onclick="javascript:pageTracker._trackPageview('/mysqlha.blogspot.com');" mce_href="http://mysqlha.blogspot.com/2010/03/plays-well-with-others.html">NoSQL movement</a>, and in particular against the <a href="http://www.dbms2.com/2010/03/02/cassandra-nosql-scalable-oltp/"  mce_href="http://www.dbms2.com/2010/03/02/cassandra-nosql-scalable-oltp/">MySQL/memcached is passe</a>&#8216; meme. On the other hand, he also bemoaned many failings of MySQL. On the third hand, he praised or at least expressed hope for a variety of MySQL-related technologies, including <a href="http://www.dbms2.com/2009/04/16/introduction-to-tokutek/"  mce_href="http://www.dbms2.com/2009/04/16/introduction-to-tokutek/">Tokutek&#8217;s TokuDB</a> and <a href="http://www.dbms2.com/2009/09/03/continuent-on-clustering/"  mce_href="http://www.dbms2.com/2009/09/03/continuent-on-clustering/">Continuent&#8217;s Tungsten</a>.</li>
<li>In connection with that debate,&nbsp; Mark Rendle offered a <a href="http://blog.markrendle.net/2010/03/do-you-need-relational-database.html" onclick="javascript:pageTracker._trackPageview('/blog.markrendle.net');" mce_href="http://blog.markrendle.net/2010/03/do-you-need-relational-database.html">funny rant</a>, mainly pro-NoSQL, in the style of a Socratic dialogue.</li>
<li>John Quinn of Digg recently described <a href="http://www.stumbleupon.com/su/5099Ti/about.digg.com/node/564" onclick="javascript:pageTracker._trackPageview('/www.stumbleupon.com');" mce_href="http://www.stumbleupon.com/su/5099Ti/about.digg.com/node/564">Digg&#8217;s move from MySQL to Cassandra</a>, and outlined a lot of features Digg was adding to Cassandra, all of which it is open-sourcing.</li>
<li>The NoSQL guys maintain their own long <a href="http://nosql-database.org/links.html" onclick="javascript:pageTracker._trackPageview('/nosql-database.org');" mce_href="http://nosql-database.org/links.html">list of NoSQL-related links</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/03/12/some-nosql-links/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>More patent nonsense &#8212; Google MapReduce</title>
		<link>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/</link>
		<comments>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/#comments</comments>
		<pubDate>Thu, 11 Feb 2010 19:29:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1565</guid>
		<description><![CDATA[Google recently received a patent for MapReduce. The first and most general claim is (formatting and emphasis mine):
A system for large-scale processing of data, comprising:

a plurality of processes executing on a plurality of interconnected processors;
the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, [...]]]></description>
			<content:encoded><![CDATA[<p>Google recently received a <a href="http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&amp;Sect2=HITOFF&amp;d=PALL&amp;p=1&amp;u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&amp;r=1&amp;f=G&amp;l=50&amp;s1=7,650,331.PN.&amp;OS=PN/7,650,331&amp;RS=PN/7,650,331" onclick="javascript:pageTracker._trackPageview('/patft.uspto.gov');">patent</a> for MapReduce. The first and most general claim is (formatting and emphasis mine):<span id="more-1565"></span></p>
<blockquote><p>A system for large-scale processing of data, comprising:</p>
<ul>
<li>a plurality of processes executing on a plurality of interconnected processors;</li>
<li>the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, and worker processes;</li>
<li>the master process, in response to a request to perform the data processing job, assigning input data blocks of the set of input data to respective ones of the worker processes;</li>
<li>each of a first plurality of the worker processes <strong>including an application-independent map module</strong> for retrieving a respective input data block assigned to the worker process by the master process and <strong>applying an application-specific map operation</strong> to the respective input data block to produce intermediate data values, wherein at least a subset of the intermediate data values each comprises a <strong>key/value pair,</strong> and wherein at least two of the first plurality of the worker processes operate simultaneously so as to perform the application-specific map operation in <strong>parallel</strong> on distinct, respective input data blocks;</li>
<li>a partition operator for processing the produced intermediate data values to produce a plurality of intermediate data sets, wherein each respective intermediate data set includes <strong>all key/value pairs for a distinct set of respective keys,</strong> and wherein at least one of the respective intermediate data sets includes respective ones of the key/value pairs produced by a plurality of the first plurality of the worker processes;</li>
<li>and each of a second plurality of the worker processes including <strong>an application-independent reduce module for retrieving data,</strong> the retrieved data comprising at least a subset of the key/value pairs from a respective intermediate data set of the plurality of intermediate data sets and applying <strong>an application-specific reduce operation</strong> to the retrieved data to produce final output data corresponding to the distinct set of respective keys in the respective intermediate data set of the plurality of intermediate data sets, and wherein at least two of the second plurality of the worker processes operate simultaneously so as to perform the application-specific reduce operation in <strong>parallel</strong> on multiple respective subsets of the produced intermediate data values.</li>
</ul>
</blockquote>
<p><em>The way a patent works is that you make a big claim and, just in case it&#8217;s later invalidated, you also make more specialized sub-claims. What&#8217;s more, in a software patent, you claim everything twice, once as a &#8220;system&#8221; and once as a &#8220;method.&#8221;</em></p>
<p>When a patent takes that long to issue and has a core claim that wordy, one can assume there was much back and forth with the PTO (Patent and Trademark Office) to whittle it down to something they felt they could approve. At a guess, I&#8217;d conjecture that the supposedly unique parts of the claim are concentrated in the areas I bolded above, and that the PTO doesn&#8217;t think the claim would be patentable unless most or all of them were included.</p>
<p>So should the claim have been approved even so? Let&#8217;s consider prior art. <a href="../../../../../2009/10/06/oracle-mapreduce/">Oracle has long been able to parallelize ala MapReduce</a>. I don&#8217;t see anything in the claim that isn&#8217;t preceded by what Oracle did, except maybe the emphasis on key/value pairs. (And the same statement applies to the other 15 claims in the patent, at least on a quick skim.) I forget the details of SenSage&#8217;s quasi-MapReduce, which also preceded the Google patent filing, but I imagine something similar would be true about it.</p>
<p>There is no doubt that Google popularized the ideas of MapReduce &#8212; which turns out to have been a worthy public service. In one great example of that popularization, <a href="http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf" onclick="javascript:pageTracker._trackPageview('/www.cs.stanford.edu');">the seminal paper on parallel data mining</a> is almost laughable in how it <a href="../../../../../2009/10/15/mapreduce-webinar-slides/">deviates from MapReduce key/value pair formalism</a> &#8212; but it still seems to have been inspired by Google&#8217;s MapReduce. But that&#8217;s a different matter; popularization != invention, even though there&#8217;s a certain connection between the two in patent law. Actually, Google also often does get credit for having &#8220;invented&#8221; MapReduce, including regrettably in the marketing materials of clients I can&#8217;t talk out of saying that and which now might be looking into the barrel of the Google patent (hello Aster); but again, saying something doesn&#8217;t make it enforceable in court.</p>
<p>So what it all boils down to is:</p>
<p><strong>Should Google&#8217;s patent on the idea of parallelizing the handling of sets of application-visible key/value pairs be regarded as valid?</strong></p>
<p>The United States PTO, which is paid to think about these things, has evidently decided Yes. I disagree. In simplest terms, my reason is that key/value pairs have been around for decades, and so:</p>
<p><strong>Anything which was known or obvious without special reference to key/value pairs doesn&#8217;t suddenly become non-obvious when key/value pairs are mixed in.</strong></p>
<p>If Google ever tries to enforce its MapReduce patent, I&#8217;m available as an expert witness for the other side.</p>
<p><strong><em>Related links</em></strong></p>
<ul>
<li><a href="http://gigaom.com/2010/01/19/why-hadoop-users-shouldnt-fear-googles-new-mapreduce-patent/" onclick="javascript:pageTracker._trackPageview('/gigaom.com');">GigaOm</a> and <a href="http://arstechnica.com/open-source/news/2010/01/googles-mapreduce-patent-what-does-it-mean-for-hadoop.ars" onclick="javascript:pageTracker._trackPageview('/arstechnica.com');">Ars Technica</a> on the Google MapReduce patent</li>
<li>Another <a href="http://www.dbms2.com/2010/01/15/vertica-sybase-ipatent-litigation/" >silly software patent</a> issue</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Clearing up MapReduce confusion, yet again</title>
		<link>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/</link>
		<comments>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/#comments</comments>
		<pubDate>Wed, 30 Dec 2009 10:50:53 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Splunk]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1371</guid>
		<description><![CDATA[I&#8217;m frustrated by a constant need &#8212; or at least urge   &#8212; to correct myths and errors about MapReduce. Let&#8217;s try one more time:

MapReduce was named and popularized &#8212; but not invented &#8212; by Google.
&#8220;MapReduce&#8221; variously refers to:

A programming paradigm
Execution engines that implement the programming paradigm
Distributed file systems that work with the execution [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m frustrated by a constant need &#8212; or at least urge <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  &#8212; to correct <a href="http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/" >myths and errors about MapReduce</a>. Let&#8217;s try one more time:<span id="more-1371"></span></p>
<ul>
<li>MapReduce was named and popularized &#8212; but not invented &#8212; by Google.</li>
<li>&#8220;MapReduce&#8221; variously refers to:
<ul>
<li>A programming paradigm</li>
<li>Execution engines that implement the programming paradigm</li>
<li>Distributed file systems that work with the execution engines</li>
</ul>
</li>
<li>In particular, Hadoop is a MapReduce execution engine that includes or is closely associated with HDFS (Hadoop Distributed File System).</li>
<li>MapReduce and analytic DBMS can interact in a number of different ways, including:
<ul>
<li>Tight integration between a DBMS and exposed MapReduce functionality, e.g. <a href="http://www.dbms2.com/2009/10/15/mapreduce-webinar-slides/" >Aster Data&#8217;s SQL/MapReduce</a> or Greenplum.</li>
<li>Integrated MapReduce &#8220;under the covers&#8221;, e.g. SenSage or <a href="http://www.dbms2.com/2009/10/06/oracle-mapreduce/" >Oracle</a>. This may or may not follow all the rules Google laid out for MapReduce, but it&#8217;s at least similar in spirit.</li>
<li>Looser coupling between DBMS and a MapReduce system, e.g. <a href="http://www.dbms2.com/2009/08/04/verticas-version-of-mapreduce-integration/" >Vertica/Hadoop</a>, in which MapReduce may or may not run on a different cluster than the DBMS.</li>
<li>Not at all, except perhaps insofar as a quasi-DBMS such as <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >Hive</a> is implemented over a MapReduce system such as Hadoop/HDFS.</li>
</ul>
</li>
<li>As predicted by <a href="http://www.strategicmessaging.com/monashs-first-law-of-commercial-semantics-explained/2009/01/09/" onclick="javascript:pageTracker._trackPageview('/www.strategicmessaging.com');">Monash&#8217;s First Law of Commercial Semantics</a>, different vendors have individual variants on those themes. For example, as per <a href="http://www.splunk.com/product" onclick="javascript:pageTracker._trackPageview('/www.splunk.com');">a registration-required white paper</a>, Splunk is moving to publicly expose a not-quite-complete form of MapReduce.</li>
<li>MapReduce implementations such as Hadoop are sometimes regarded as part of the <a href="http://www.dbms2.com/2009/12/12/legit-nosql-key-value-store/" >NoSQL</a> &#8220;movement&#8221;. When they are, many generalities about NoSQL &#8212; such as that it doesn&#8217;t deal with analytics &#8212; are falsified.</li>
<li>So far as I can tell, mainstream enterprise (as opposed to web, scientific, investment, etc.) data mining folks may be looking at MapReduce for data mining, but they haven&#8217;t done much to adopt it yet. Probably that&#8217;s because the outfits who have the greatest need are the same ones that have the largest sunk investments in more traditional ways of doing data mining.</li>
<li>Cloudera != Hadoop. On the other hand, if you want to use Hadoop, it makes a lot of sense to do business with Cloudera.</li>
<li>Non-DBMS MapReduce != Hadoop. On the other hand, Hadoop is the default choice for non-DBMS MapReduce.</li>
<li>MapReduce != Hadoop, period. DBMS-based MapReduce is also a legitimate technical strategy.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Three big myths about MapReduce</title>
		<link>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/</link>
		<comments>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 16:14:37 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Michael Stonebraker]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1135</guid>
		<description><![CDATA[Once again, I find myself writing and talking a lot about MapReduce.  But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:

MapReduce is something very new
MapReduce involves strict 	adherence to the Map-Reduce programming paradigm
MapReduce is a single technology

So let&#8217;s give it a try.
When Dave DeWitt and Mike [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Once again, I find myself writing and talking a lot about MapReduce.  But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:</p>
<ul>
<li>MapReduce is something very new</li>
<li>MapReduce involves strict 	adherence to the Map-Reduce programming paradigm</li>
<li>MapReduce is a single technology</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-1135"></span>So let&#8217;s give it a try.</p>
<p style="margin-bottom: 0in;">When Dave DeWitt and Mike Stone<span style="font-style: normal;">braker leveled <a href="../2008/01/18/the-great-mapreduce-debate/">their famous blast at MapReduce</a>, many people thought they overstated their case. But one part of their story – one that both Mike and Dave say was most central to their case – was never effectively refuted, n</span>amely the claim that these ideas aren&#8217;t particularly new. I haven&#8217;t actually read enough computer science literature to have an independent opi<span style="font-style: normal;">nion on that issue. But I&#8217;ll say this – claims from companies such as <a href="../2009/10/18/introduction-to-sensage/">SenSage</a>, <a href="../2009/10/06/oracle-mapreduce/">Oracle</a>, or <a href="../2009/10/18/technical-introduction-to-splunk/">Splunk</a> that “We&#8217;ve be</span>en doing MapReduce all along” seem pretty credible to me.</p>
<p style="margin-bottom: 0in;">True, what those companies were doing things may not have looked exactly like the instant-classic MapReduce programming paradigm. But the same is true of many things almost everybody would agree count as MapReduce.  In particular, it is often not the case that you alternate Map and Reduce steps, each of whose outputs is a set of simple &lt;Key, Value&gt; pairs, with data redistributed based on Key at every step.</p>
<p style="margin-bottom: 0in;">Here are some examples of what I mean, drawn from <a href="http://www.asterdata.com/blog/index.php/2009/10/15/mastering-mapreduce/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');">my recent MapReduce webinar</a>.</p>
<ul>
<li>If you do text indexing in 	MapReduce, your goal is to wind up with a text index. So at some 	point you Reduce to a pair &lt;WordName, {all the (DocumentID, 	offset) pairs for the whole corpus, suitably ordered}&gt;.  That&#8217;s a 	heckuva compound “Value”.</li>
<li>The goal of data mining is usually 	to estimate a rather small number of parameters based on a large 	overall data set, often – depending on algorithm – in the form 	of a single vector. When you do that in MapReduce. you partition 	data among nodes, calculate something on each node that is 	structured more or less like your final vector. So when it comes 	time for the reduce, you just ship all of your vectors – one per 	node – to a single Reduce node, and do the appropriate math. 	Redistribution based on Key would be quite pointless.</li>
<li>When you sessionize clickstream 	logs in MapReduce, you may have just as many output records as input 	records. However, they now are reformatted, and might have a 	SessionID appended. In those cases, Reduce isn&#8217;t doing much by the 	way of reduction.</li>
<li>And as I happens in some 	<a href="../2009/08/04/verticas-version-of-mapreduce-integration/">Vertica-Hadoop</a> use cases around mortgage trading, sometimes MapReduce can even make 	data s<span style="font-style: normal;">ets vastly larger.</span></li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">By no means do I think this is a weakness of the MapReduce programming paradigm. Rather, I think it&#8217;s a MapReduce strength. But it&#8217;s not quite the way MapReduce has been promoted and explained to the IT public.</p>
<p style="margin-bottom: 0in; font-style: normal;">Finally: MapReduce, as commonly conceived, spans two different – albeit closely related – technology domains:</p>
<ul>
<li>Parallel 	programming</li>
<li>Distributed 	data management</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">For example, I imagine Greenplum&#8217;s and Vertica&#8217;s MapReduce/SQL combined syntaxes are very similar to each others. But Vertica&#8217;s data management implementation of MapReduce, which relies on Hadoop, is very different from Greenplum&#8217;s, which is tied into the Greenplum DBMS. Similary, non-DBMS MapReduce implementations are commonly associated with distributed file systems – notably HDFS (Hadoop Distributed File Systems) or Google&#8217;s internal GFS (Google File System). In those systems, the parallel language execution part should be aware of how the distributed file management part works – but perhaps that awareness can be pretty lightweight.</p>
<p style="margin-bottom: 0in; font-style: normal;">Right now, this is a distinction pretty much without a difference. If you choose an implementation of MapReduce &#8212; like pure Hadoop (say in the Cloudera distribution) or Hadoop-Vertica or Aster Data&#8217;s SQL/MapReduce – you&#8217;re basically picking an entire technology stack. But those stacks are going to do a whole lot of changing and maturing in the near future – and as they do, it&#8217;s likely that projects will interact or even combine in all sorts of interesting ways.</p>
<p style="margin-bottom: 0in; font-style: normal;"><strong>Bottom line: There are a lot of different ways to exploit MapReduce-related technology.</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Google Fusion Tables</title>
		<link>http://www.dbms2.com/2009/06/15/google-fusion-tables/</link>
		<comments>http://www.dbms2.com/2009/06/15/google-fusion-tables/#comments</comments>
		<pubDate>Mon, 15 Jun 2009 11:10:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=813</guid>
		<description><![CDATA[Google has announced an experimental cloud-based data management system called Fusion Tables. A press article and Slashdot thread ensued, based on some bizarre-sounding analyst quotes that I will not attempt to parse.
What Fusion Tables really seems to be is a spreadsheet without the formulae. That is, it&#8217;s a place to dump data in a grid [...]]]></description>
			<content:encoded><![CDATA[<p>Google has announced an <a href="http://googleresearch.blogspot.com/2009/06/google-fusion-tables.html" onclick="javascript:pageTracker._trackPageview('/googleresearch.blogspot.com');">experimental cloud-based data management system called Fusion Tables</a>. A <a href="http://www.itworld.com/saas/69183/watch-out-oracle-google-tests-cloud-based-database" onclick="javascript:pageTracker._trackPageview('/www.itworld.com');">press article</a> and <a href="http://developers.slashdot.org/article.pl?sid=09/06/12/1658206" onclick="javascript:pageTracker._trackPageview('/developers.slashdot.org');">Slashdot thread</a> ensued, based on some bizarre-sounding analyst quotes that I will not attempt to parse.</p>
<p>What Fusion Tables really seems to be is a spreadsheet without the formulae. That is, it&#8217;s a place to dump data in a grid of cells, comment on it, version it, and do elementary data manipulation.  This could, I guess, be useful as an alternative to traditional RDBMS &#8212; assuming, of course, that you want to have a row-by-row debate about 100 megs of data.</p>
<p>Seriously, while Google Fusion Tables bears some vague resemblance to what I&#8217;m thinking about for the future of both <a href="http://www.dbms2.com/2009/05/30/reinventing-business-intelligence/" >business intelligence</a> and <a href="http://www.dbms2.com/2009/06/08/the-future-of-data-marts/" >data marts</a>, it sounds as if it has a long way to go before it&#8217;s something most enterprises should spend time looking at.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/06/15/google-fusion-tables/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Reinventing business intelligence</title>
		<link>http://www.dbms2.com/2009/05/30/reinventing-business-intelligence/</link>
		<comments>http://www.dbms2.com/2009/05/30/reinventing-business-intelligence/#comments</comments>
		<pubDate>Sat, 30 May 2009 12:38:25 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[SAP AG]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=794</guid>
		<description><![CDATA[I&#8217;ve felt for quite a while that business intelligence tools are due for a revolution. But I&#8217;ve found the subject daunting to write about because &#8212; well, because it&#8217;s so multifaceted and big.  So to break that logjam, here are some thoughts on the reinvention of business intelligence technology, with no pretense of being [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I&#8217;ve felt for quite a while that business intelligence tools are due for a revolution. But I&#8217;ve found the subject daunting to write about because &#8212; well, because it&#8217;s so multifaceted and big.  So to break that logjam, here are some thoughts on the reinvention of business intelligence technology, with no pretense of being in any way comprehensive.</p>
<p style="margin-bottom: 0in;"><strong>Natural language and classic science fiction</strong></p>
<p style="margin-bottom: 0in;">Actually, there&#8217;s a pretty well-known example of BI near-perfection &#8212; <strong>the </strong><em><strong>Star Trek</strong></em><strong> computers,</strong> usually voiced by the late Majel Barrett Roddenberry. They didn&#8217;t have a big role in the recent movie, which was so fast-paced nobody had time to analyze very much, but were a big part of the <em>Star Trek</em> universe overall. <em>Star Trek&#8217;s</em> computers integrated analytics, operations, and authentication, all with a great natural language/voice interface and visual displays. That example is at the heart of <a href="http://www.texttechnologies.com/2009/05/30/men-are-from-earth-computers-are-from-vulcan/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">a 1998 article on natural language recognition I just re-posted</a>.</p>
<p style="margin-bottom: 0in;">As for reality: For decades, dating back at least to Artificial Intelligence Corporation&#8217;s Intellect, there have been offerings that provided<strong> &#8220;natural language&#8221; command, control, and query</strong> against otherwise fairly ordinary analytic tools. Such efforts have generally fizzled, for reasons outlined at the link above. Wolfram Alpha is the latest try; fortunately for its prospects, natural language is really only a small part of the Wolfram Alpha story.</p>
<p style="margin-bottom: 0in;">A second theme has more recently emerged &#8212; <strong>using text indexing to get at data more flexibly than a relational schema would normally allow,</strong> either by searching on data values themselves (stressed by <em>Attivio</em>) or more by searching on the definitions of pre-built reports (the Google OneBox story). SAP&#8217;s Explorer is the latest such view, but I find <a href="http://www.intelligententerprise.com/blog/archives/2009/05/explorer_seems.html#comments" onclick="javascript:pageTracker._trackPageview('/www.intelligententerprise.com');">Doug Henschen&#8217;s skepticism about SAP Explorer</a> more persuasive than <a href="http://www.intelligententerprise.com/blog/archives/2009/05/explorer_splash.html#comments" onclick="javascript:pageTracker._trackPageview('/www.intelligententerprise.com');">Cindi Howson&#8217;s cautiously favorable view</a>.  Partly that&#8217;s because I know SAP (and Business Objects); partly it&#8217;s because of difficulties such as those I already noted.</p>
<p style="margin-bottom: 0in;"><strong>Flexibility and data exploration</strong></p>
<p style="margin-bottom: 0in;">It&#8217;s a truism that each generation of dashboard-like technology fails because it&#8217;s too inflexible. Users are shown the information that will provide them with the most insight.  They appreciate it at first. But eventually it&#8217;s old hat, and when they want to do something new, the baked-in data model doesn&#8217;t support it.</p>
<p style="margin-bottom: 0in;">The latest attempts to overcome this problem lie in two overlapping trends &#8212; <strong>cool data exploration/visualization tools, </strong><span>and </span><strong>in-memory analytics.</strong> <span id="more-794"></span><span style="font-style: normal;">Tableau and Spotfire</span> are known more for the former; hot BI ven<span style="font-style: normal;">dor <a href="../2008/08/04/qliktech-qlikview-update/">QlikTech</a> is know</span>n for both. And many vendors &#8212; established or otherwise &#8212; are goi<span style="font-style: normal;">ng to <a href="../2009/04/22/clearing-some-of-my-buffer/">in-memory OLAP</a>.</span></p>
<p style="margin-bottom: 0in;"><strong><span style="font-style: normal;">Collaboration and communication</span></strong></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">The reason I&#8217;m finally buckling down and posting on this subject is the announcement of <a href="http://www.texttechnologies.com/2009/05/29/google-wave-finally-a-microsoft-killer/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">Google Wave, which I think foreshadows a revolution in communication and collaboration technology</a>. Google Wave augurs two primary advances. First, it shows how to make email, instant messaging, microblogging, and so on much more useful. Second, Google Wave could evolve in a way that &#8212; finally &#8212;  makes it truly practical for end-users to set up ad-hoc mini-portals that combine arbitrary URL-possessing resources, exposed to arbitrary workgroups of people.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">If and when both of those promises are fulfilled, it will become vastly easier for people to reason together about analytic questions.  That may take a little while, as Google Wave obviously wasn&#8217;t designed with business intelligence in mind. But whether from Google or from a frightened Microsoft redoubling its SharePoint efforts, there&#8217;s hope that we&#8217;ll see a leap forward in general collaboration technology. And since BI vendors are doing a generally decent job of exposing queries, charts and so on as portlets, it seems likely that business intelligence will benefit from the collaboration arms race.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">That&#8217;s important. The first time I heard that reporting was as important for communication as for analytics was from Pilot Software a quarter-century or so ago, and it&#8217;s just as true now as it was then.  In its first incarnations it probably will be a little too dumb for my tastes, focusing more on mindless reporting and same-old KPIs than on deeper analysis.  Still, it&#8217;s a move in a good direction.</span></p>
<p style="margin-bottom: 0in;"><strong><span style="font-style: normal;">Other directions</span></strong></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">As I said at the beginning, I find it too daunting to try to cover all facets of this subject in one post. So I&#8217;ll leave out, at a minimum:</span></p>
<ul>
<li><span style="font-style: normal;"><a href="http://www.dbms2.com/2009/02/25/even-more-final-version-of-my-tdwi-slide-deck/" >Data 	warehousing performance and TCO</a>, which I of course write about 	extensively</span></li>
<li><span style="font-style: normal;"><a href="http://www.dbms2.com/2009/05/21/notes-on-cep-performance/" >Complex 	event/stream processing</a>, which I&#8217;ve written quite a bit about too</span></li>
<li><span style="font-style: normal;">Data 	mining and predictive analytics</span></li>
<li><span style="font-style: normal;">Operational 	BI</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">plus some hobby horses you probably don&#8217;t want to hear about anyway until I work out a better way of articulating my opinions.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">But by all means please comment on what I&#8217;ve left out just as vigorously as on what I&#8217;ve included.  This post is just the first of many to come.</span></p>
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/05/30/reinventing-business-intelligence/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>High-end MySQL use</title>
		<link>http://www.dbms2.com/2008/11/21/high-end-mysql-use/</link>
		<comments>http://www.dbms2.com/2008/11/21/high-end-mysql-use/#comments</comments>
		<pubDate>Fri, 21 Nov 2008 19:55:31 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=630</guid>
		<description><![CDATA[To a large extent, MySQL lives in two different alternate universes from most other DBMS. One is for low-end, simple database applications.  For example, of all the DBMS I write about, MySQL is the one I actually use in my own business &#8212; because MySQL sits underneath WordPress, and WordPress is what runs my [...]]]></description>
			<content:encoded><![CDATA[<p>To a large extent, MySQL lives in two different alternate universes from most other DBMS. One is for low-end, simple database applications.  For example, of all the DBMS I write about, MySQL is the one I actually use in my own business &#8212; because MySQL sits underneath WordPress, and WordPress is what runs my blogs.  My largest database (the one for <em>DBMS2</em>) contains 12 megabytes of data in 11 tables, none of which has yet reached 5000 rows in size.<span id="more-630"></span></p>
<p>The other alternate universe for MySQL is Very Large Internet Companies. Often, via a technique called <em>sharding,</em> a very simple schema is massively partitioned, with different partitions on different machines. Consider, for example, the following quote from the <em><a href="http://mysqlha.blogspot.com/2008/11/no-more-crashes-for-my-mysql.html" onclick="javascript:pageTracker._trackPageview('/mysqlha.blogspot.com');">High Availability MySQL</a></em> blog:</p>
<blockquote><p>It almost never crashes for me now, but you might not be able to make the same claim. Well, if you use MyISAM rather than InnoDB you might be able to make the claim, but in that case you really need it to not crash as MyISAM might not recover to a transaction consistent state. The fix is not in the official release. This is almost a repeat of a <a href="http://mysqlha.blogspot.com/2008/10/crashes-from-innodb-long-semaphore_25.html" onclick="javascript:pageTracker._trackPageview('/mysqlha.blogspot.com');">recent post</a>, but one thing has changed. We deployed the fix for <a href="http://bugs.mysql.com/bug.php?id=32149" onclick="javascript:pageTracker._trackPageview('/bugs.mysql.com');">bug 32149</a> and eliminated the cause of the majority of crashes from software bugs. We still get crashes on a daily basis, but hardware is the cause.</p></blockquote>
<p>On a first read, that seems ridiculous &#8212; crashes are daily, they used to be worse, and this is called &#8220;High Availability&#8221;??  But let&#8217;s look at the first sentence of the author&#8217;s <a href="http://en.oreilly.com/mysql2008/public/schedule/speaker/88" onclick="javascript:pageTracker._trackPageview('/en.oreilly.com');">bio</a>:</p>
<blockquote><p>Mark is the lead for the MySQL Engineering team at Google that supports a large MySQL deployment.</p></blockquote>
<p>Suddenly the whole thing makes a lot more sense.  Google&#8217;s basic approach to computing is to run farms of 100s or 1000s of servers, each of which may have an individual Mean Time Between Failures measured in months.</p>
<p>While we&#8217;re at it, I got that bio from an April, 2008 talk. The talk&#8217;s abstract is revealing as well:</p>
<blockquote><p>The InnoDB storage engine is an amazing work of engineering and art. But it has a flaw. It was designed for servers with few <span class="caps">CPU</span> cores and few disks. Design decisions made then limit the scalability of InnoDB on today’s commodity servers, which happen to have many disks and <span class="caps">CPU</span> cores. InnoDB and the MySQL community are working hard on fixing these problems. I will describe the performance problems in the official release and fixes for the problems that have been developed by InnoDB, Google and the MySQL community.</p></blockquote>
<p>If you want a highly industrial-strength DBMS, and you don&#8217;t want to co-develop it yourself, MySQL probably isn&#8217;t for you.  But that doesn&#8217;t mean it isn&#8217;t a fine product for other uses, or other users.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2008/11/21/high-end-mysql-use/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Google has thousands of internal data formats, mostly simple ones</title>
		<link>http://www.dbms2.com/2008/07/08/google-has-thousands-of-internal-data-formats-mostly-simple-ones/</link>
		<comments>http://www.dbms2.com/2008/07/08/google-has-thousands-of-internal-data-formats-mostly-simple-ones/#comments</comments>
		<pubDate>Tue, 08 Jul 2008 18:27:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=459</guid>
		<description><![CDATA[In connection with the release of Protocol Buffers, Kenton Varda of Google wrote:
At Google, our mission is organizing all of the world&#8217;s information. We use literally thousands of different data formats to represent networked messages between servers, index records in repositories, geospatial datasets, and more. Most of these formats are structured, not flat. This raises [...]]]></description>
			<content:encoded><![CDATA[<p>In connection with the release of <a href="http://www.networkworld.com/community/node/29671" onclick="javascript:pageTracker._trackPageview('/www.networkworld.com');">Protocol Buffers</a>, Kenton Varda of Google <a href="http://google-opensource.blogspot.com/2008/07/protocol-buffers-googles-data.html" onclick="javascript:pageTracker._trackPageview('/google-opensource.blogspot.com');">wrote</a>:<span id="more-459"></span></p>
<blockquote><p>At Google, our mission is organizing all of the world&#8217;s information. We use literally thousands of different data formats to represent networked messages between servers, index records in repositories, geospatial datasets, and more. Most of these formats are structured, not flat. This raises an important question: How do we encode it all?</p></blockquote>
<p>That sounds like a lot.  On the other hand, if &#8220;data format&#8221; is just a synonym for &#8220;table structure,&#8221; &#8220;file structure,&#8221; and/or &#8220;schema,&#8221; it sounds more plausible.   Varda goes on to say</p>
<blockquote><p>a simple lists-and-records model &#8230; solves the majority of problems</p></blockquote>
<p>Come to think of it, that sounds very consistent with the idea that <a href="http://www.dbms2.com/2008/01/18/the-great-mapreduce-debate/" >MapReduce</a> solves a large fraction of Google&#8217;s data management issues.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2008/07/08/google-has-thousands-of-internal-data-formats-mostly-simple-ones/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>More Google reliability woes</title>
		<link>http://www.dbms2.com/2008/01/20/more-google-reliability-woes/</link>
		<comments>http://www.dbms2.com/2008/01/20/more-google-reliability-woes/#comments</comments>
		<pubDate>Sun, 20 Jan 2008 08:03:19 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Google]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/2008/01/20/more-google-reliability-woes/</guid>
		<description><![CDATA[Google&#8217;s reliability issues are ever worse.  As I previously pointed out, this is evidence against the notion that MapReduce is a replacement for established DBMS.
]]></description>
			<content:encoded><![CDATA[<p>Google&#8217;s reliability issues are <a href="http://blog.searchenginewatch.com/blog/080117-150547" onclick="javascript:pageTracker._trackPageview('/blog.searchenginewatch.com');">ever worse</a>.  As I previously pointed out, this is evidence <a href="http://www.dbms2.com/2008/01/19/mapreduce-variable-schema-analytics/" >against</a> the notion that <a href="http://www.dbms2.com/2008/01/18/the-great-mapreduce-debate/" >MapReduce</a> is a replacement for established DBMS.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2008/01/20/more-google-reliability-woes/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Oracle/Google/Apple merger – wow!  Just &#8212; wow.</title>
		<link>http://www.dbms2.com/2007/04/01/oracle-google-apple-merger-possibilities/</link>
		<comments>http://www.dbms2.com/2007/04/01/oracle-google-apple-merger-possibilities/#comments</comments>
		<pubDate>Sun, 01 Apr 2007 11:02:02 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Apple]]></category>
		<category><![CDATA[April Fool's Day]]></category>
		<category><![CDATA[merger]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/2007/04/01/oracle-google-apple-merger-possibilities/</guid>
		<description><![CDATA[If rumors are to be believed, Oracle, Google, and Apple are close to agreeing on a mega-blockbuster three-way merger.  Just the personality combinations are amazing, starting with close friends Jobs and Ellison &#8212; perhaps the two greatest entrepreneurs of Silicon Valley, and both with impeccable taste – and the traditionally sloppy, generation-younger Page and [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal">If rumors are to be believed, Oracle, Google, and Apple are close to agreeing on a mega-blockbuster three-way merger.  Just the personality combinations are amazing, starting with close friends Jobs and Ellison &#8212; perhaps the two greatest entrepreneurs of Silicon Valley, and both with impeccable taste – and the traditionally sloppy, generation-younger Page and Brin.  But let’s jump straight to some of the possible business and technology ramifications.</p>
<p class="MsoNormal"><strong>The Macintosh could become a serious Windows competitor.</strong> The Mac is quietly making an enterprise comeback anyway.  Business intelligence, dashboards, and the like are constantly in the throes of UI re-invention.  (I have some articles I the works about why the industry never seem to get them right, but in the mean time here is my <a href="http://www.monashreport.com/2006/04/08/ui-musings/" onclick="javascript:pageTracker._trackPageview('/www.monashreport.com');">UI overview article</a> from last year.)</p>
<p class="MsoNormal"><strong>Whole new generations of personal/pervasive computing devices could evolve</strong>.  Apple obviously is a huge personal-electronic-device player with the iPod and upcoming iPhone.  Google has looked into cell phones as well. Designing cool devices will not be a problem.  The issue is making them integrate really well with enterprise systems.  I favor speech interfaces, myself.</p>
<p class="MsoNormal"><strong>Enterprise information management could be transformed. </strong>Oracle is batting about 0-for-the-decade in search.   Google has is selling a lot of not-terribly-useful low-end enterprise search boxes.  There’s room for both to do a lot better.  Ex-Oracle executive Dennis Moore has some <a href="http://www.texttechnologies.com/2007/02/28/sap%e2%80%99s-%e2%80%9csearch%e2%80%9d-strategy-isn%e2%80%99t-about-search/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">good ideas</a> in that regard.</p>
<p class="MsoNormal"><em><strong>Related link</strong></em></p>
<ul>
<li>Scoble has details on <a href="http://scobleizer.com/2007/03/31/apple-collaborating-with-amazon-google-and-cingular-on-new-ireader/trackback/http:/scobleizer.com/2007/03/31/apple-collaborating-with-amazon-google-and-cingular-on-new-ireader/" onclick="javascript:pageTracker._trackPageview('/scobleizer.com');">part of the story</a>.</li>
</ul>
<p class="MsoNormal"><em>There’s one catch, however:  On April 1, rumors generally should not be taken too seriously.</em></p>
<p class="MsoNormal">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2007/04/01/oracle-google-apple-merger-possibilities/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
