<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Google</title>
	<atom:link href="http://www.dbms2.com/category/products-and-vendors/google-mapreduce-bigtable/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Wed, 08 Feb 2012 22:51:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Big data terminology and positioning</title>
		<link>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/</link>
		<comments>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 01:35:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5768</guid>
		<description><![CDATA[Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions: Bigness &#8212; Volume, Velocity, size Structure &#8212; Variety, Variability, Complexity given that High-velocity &#8220;big data&#8221; problems are usually high-volume as well.* Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction. But the conflation should [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I observed that <a href="../../../../../2011/09/11/big-data-has-jumped-the-shark/">Big Data terminology is seriously broken</a>. It is reasonable to reduce the subject to two quasi-dimensions:</p>
<ul>
<li><strong>Bigness</strong> &#8212; Volume, Velocity, size</li>
<li><strong>Structure</strong> &#8212; Variety, Variability, Complexity</li>
</ul>
<p>given that</p>
<ul>
<li>High-velocity &#8220;big data&#8221; problems are usually high-volume as well.*</li>
<li>Variety, variability, and complexity all relate to the <a href="../../../../../2011/05/17/poly-structured-database/">simply-structured/poly-structured</a> distinction.</li>
</ul>
<p>But the conflation should stop there.</p>
<p><em>*Low-volume/high-velocity problems are commonly referred to as <a href="../2011/08/25/renaming-cep-or-not/">&#8220;event processing&#8221; and/or &#8220;streaming&#8221;</a>.</em></p>
<p>When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2&#215;2 matrix of possibilities. For want of better alternatives, my suggestions are:</p>
<ul>
<li><strong>Relational big data</strong> is data of high volume that fits well into a relational DBMS.</li>
<li><strong>Multi-structured big data</strong> is data of high volume that doesn&#8217;t fit well into a relational DBMS. <em>Alternative: Poly-structured big data.</em></li>
<li><strong>Conventional relational data</strong> is data of not-so-high volume that fits well into a relational DBMS. <em>Alternatives: Ordinary/normal/smaller relational data.</em></li>
<li><strong>Smaller poly-structured data</strong> is data for which <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schema</a> capabilities are important, but which doesn&#8217;t rise to &#8220;big data&#8221; volume.</li>
</ul>
<p><span id="more-5768"></span>Notes on all this include:</p>
<ul>
<li>&#8220;Relational big data&#8221; is commonly what you need a scalable analytic relational DBMS for. But there are non-analytic use cases as well.</li>
<li>The paradigmatic example of &#8220;multi-structured big data&#8221; is log files. Thus, multi-structured big data is commonly what you need a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> for.</li>
<li>One might want to equate non-analytic relational big data technology to &#8220;NewSQL&#8221;. However, I&#8217;m struggling to think of a database size range in which the entire NewSQL industry can match Oracle&#8217;s market share alone.</li>
<li>One might want to equate non-analytic multi-structured big data technology to &#8220;NoSQL&#8221;. However:
<ul>
<li>&#8220;NoSQL&#8221; is also used to encompass not-so-big-data use cases, such as prototyping in MongoDB.</li>
<li><a href="../../../../../2011/10/02/defining-nosql/">&#8220;NoSQL&#8221; has non-ACID/low(er)-data-integrity connotations</a> that aren&#8217;t appropriate for all non-relational systems.</li>
</ul>
</li>
<li>Up to a point, you can analyze relational big data in a conventional relational DBMS, but an analytic RDBMS will usually win on TCO (Total Cost of Ownership). In particular, reasonable thresholds for moving an analytic database off Oracle might be:
<ul>
<li>1-2 terabytes if you&#8217;ve never bought anything past Oracle Standard Edition.</li>
<li>5-10 terabytes if you&#8217;re already paying for Oracle Enterprise Edition.</li>
<li>A lot higher than that if you actually find Oracle Exadata to be cost-effective.</li>
</ul>
</li>
<li>Depending on how big one acknowledges as &#8220;big&#8221;, the market share leader in &#8220;big bit bucket&#8221; use cases is either Splunk or Hadoop.</li>
<li>If we look at multi-structured big data management overall, MarkLogic joins the list of market share contenders, as do various NoSQL alternatives.</li>
<li>It is wrong to say that the large web companies invented &#8220;big data&#8221; technology. But it is more reasonable to say they invented much of &#8220;multi-structured big data&#8221; management. In particular (and this is just a partial list), Google, Amazon, Yahoo, Facebook, et al. can reasonably be credited with Hadoop, Cassandra, HBase and various predecessors to same.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Text data management, Part 1: Confusion</title>
		<link>http://www.dbms2.com/2011/10/10/text-data-management-confusion/</link>
		<comments>http://www.dbms2.com/2011/10/10/text-data-management-confusion/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 01:58:03 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5421</guid>
		<description><![CDATA[This is Part 1 of a three post series. The posts cover: Confusion about text data management. Choices for text data management (general and short-request). Choices for text data management (analytic). There&#8217;s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include: The terminology around text [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is Part 1 of a three post series. The posts cover:</em></p>
<ol>
<li> <em><a href="http://www.dbms2.com/2011/10/10/text-data-management-confusion/">Confusion about text data management</a>.</em></li>
<li><em><a href="http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/">Choices for text data management (general and short-request)</a>.</em></li>
<li><em><a href="http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/">Choices for text data management (analytic)</a>.</em></li>
</ol>
<p>There&#8217;s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:</p>
<ul>
<li>The terminology around text data is inaccurate.</li>
<li>Data volume estimates for text are misleading.</li>
<li>Multiple different technologies are in the mix, including:
<ul>
<li>Enterprise text search.</li>
<li>Text analytics &#8212; <a href="http://www.texttechnologies.com/category/software-as-a-service-saas/category/text-mining/">text mining</a>, sentiment analysis, etc.</li>
<li>Document stores &#8212; e.g. document-oriented NoSQL, or MarkLogic.</li>
<li>Log management and parsing &#8212; e.g. Splunk.</li>
<li>Text archiving &#8212; e.g., various specialty email archiving products I couldn&#8217;t even name.</li>
<li>Public web search &#8212; Google et al.</li>
</ul>
</li>
<li>Text search vendors have disappointed, especially technically.</li>
<li>Text analytics vendors have disappointed, especially financially.</li>
<li>Other analytic technology vendors ignore <a href="http://www.texttechnologies.com/2010/12/01/state-of-the-art-text-analytics-mining-applications/">what the text analytic vendors actually have accomplished</a>, and reinvent inferior wheels rather than OEM the state of the art.</li>
</ul>
<p>Above all: <a href="http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/">The use cases for text data vary greatly</a>, just as the use cases for simply-structured databases do.</p>
<p>There are probably fewer people now than there were six years ago who need to be told that <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">text and relational database management are very different things</a>. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: <span id="more-5421"></span></p>
<ul>
<li><strong> The terms &#8220;unstructured&#8221; or &#8220;semi-structured&#8221; data are inherently misleading</strong>. That&#8217;s why <a href="../../../../../2011/05/17/poly-structured-database/">I favor &#8220;multi-structured&#8221; or &#8220;poly-structured&#8221; instead</a>. (&#8220;Multi-structured&#8221; seems to be winning; e.g., it&#8217;s been adopted by Teradata and Teradata/Aster.)</li>
<li>The &#8220;social media&#8221; text data any one enterprise brings in house isn&#8217;t all that much. For example, <a href="../../../../../2011/04/14/attensity-update/">Attensity serves many different enterprises&#8217; social media needs from a single 20-terabyte data store</a>, and reports that no single enterprise has required as much as 1 terabyte of text yet. <strong>Text data may consume a lot of storage </strong>on spinning disks somewhere,<strong> but it&#8217;s not that big a factor in future DBMS industry growth.</strong> (That 20 terabyte figure does seem low.)</li>
<li><strong>Structured databases are typically worth a lot more per bit than other kinds.</strong> The most valuable electronic data, per-bit, is probably records of significant economic transactions &#8212; purchases, sales, money transfers, etc. The least valuable may be sensor log files, whose contents consist mainly of &#8220;Nothing going on here; ping you again in a minute.&#8221; Email logs, web interaction data and many other kinds fall somewhere in between. Highly valuable documents &#8212; such as signed contracts &#8212; generally persist in paper as well as electronic forms. <strong>Investors commonly overlook this point.</strong></li>
<li><strong>The enterprise text search industry is screwed up.</strong>
<ul>
<li>FAST was a goofy company before it was acquired for far too much money by Microsoft.</li>
<li>Autonomy was a goofy company before it was acquired for far too much money by HP.</li>
<li>Google&#8217;s enterprise efforts are quiet.</li>
<li>The integration of text search and relational DBMS &#8212; e.g. at Oracle &#8212; has languished, with poor performance and evident lack of management attention.</li>
<li>Smaller text search vendors don&#8217;t seem to be getting a lot of traction &#8212; e.g., <a href="http://www.texttechnologies.com/category/vendors/coveo/">Coveo</a> has a decent reputation, but when&#8217;s the last time you heard much about them? What has Attivio actually accomplished?</li>
</ul>
</li>
<li><strong>Text analytics is a small business</strong>. Add up the revenue for Attensity, Clarabridge, Lexalytics, Temis, and all the others, and you might poke above $100 million, especially now that Attensity had a three-way merger. Then again, you might not.</li>
<li>Even so, <strong>the text analytics vendors have developed sophisticated technology.</strong> In particular, you can use it to get a pretty good idea as to what people are writing about you, individually or as groups.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/10/text-data-management-confusion/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Notes and links October 3 2010</title>
		<link>http://www.dbms2.com/2010/10/03/notes-and-links-october-3-2010/</link>
		<comments>http://www.dbms2.com/2010/10/03/notes-and-links-october-3-2010/#comments</comments>
		<pubDate>Mon, 04 Oct 2010 01:10:41 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[HP and Neoview]]></category>
		<category><![CDATA[Humor]]></category>
		<category><![CDATA[Kickfire]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3103</guid>
		<description><![CDATA[Some notes, follow-up, and links before I head out to California:  HP hired a software guy, Leo Apotheker, as CEO, and a software guy with a liking for high-end services, Ray Lane, as chairman. Now a Leo Apotheker conference call suggests HP will increase its emphasis on software, and maybe high-end services as well. No [...]]]></description>
			<content:encoded><![CDATA[<p>Some notes, follow-up, and links before I head out to California:  <span id="more-3103"></span></p>
<ul>
<li>HP hired a software guy, Leo Apotheker, as CEO, and a software guy with a liking for high-end services, <a href="http://www.dbms2.com/2010/09/30/ray-lane-at-hp/">Ray Lane</a>, as chairman. Now a <a href="http://news.cnet.com/8301-31021_3-20018241-260.html">Leo Apotheker conference call</a> suggests HP will increase its emphasis on software, and maybe high-end services as well. No surprise. The article suggests, however,  that HP at this point has no clear strategy along these lines. That&#8217;s no surprise either.
<ul>
<li>And then there&#8217;s <a href="http://techcrunch.com/2010/10/01/oh-thank-god-oracle-has-a-new-rivalry/">Sarah Lacy&#8217;s take</a>, of which the interesting part reads &#8220;Separately, Andreessen has said that he thinks enterprise software is  ripe for disruption and his firm is going to fund a new generation of  Oracle-killers.&#8221;</li>
<li>I added more on <a href="http://www.softwarememories.com/2010/10/03/ray-lane-and-the-integration-of-software-and-consulting-at-oracle/">Ray Lane&#8217;s tenure at Oracle</a> over on <em><a href="http://www.softwarememories.com">Software Memories</a>.</em></li>
</ul>
</li>
<li>Netezza had a falling out with its original supplier of geospatial technology, Intelligent Integration Systems (IISi), and a lawsuit ensued over alleged copying. Now ISSi has upped the stakes, essentially alleging that <a href="http://news.cnet.com/8301-27080_3-20017809-245.html">Netezza&#8217;s new geospatial software doesn&#8217;t work</a>, and that hence the CIA (evidently a Netezza user) is killing the wrong people via drone strikes. Netezza has wisely selected from its short list of acceptable responses, including versions of:
<ul>
<li>&#8220;All our classified customers are happy, and if we told you anything more than that, that would kind of defeat the purpose of being classified, wouldn&#8217;t it?&#8221;</li>
<li>&#8220;Copy, schmopy. A polygon is a polygon, and has been since Euclid.&#8221;</li>
<li>&#8220;We don&#8217;t have no steenking bugs.&#8221;</li>
</ul>
</li>
<li><a href="http://www.theregister.co.uk/2010/09/30/ocz_hdsl/">OCZ</a>, whoever they are, are trying to offer solid-state drives with PCIe-like bandwidth, which makes sense in that most observers except <a href="http://www.dbms2.com/2009/10/25/teradata-hardware-strategy-and-tactics/">Teradata</a> think the SAS interface isn&#8217;t fast enough for solid-state.</li>
<li>Speaking of Teradata, I&#8217;d been wondering somewhat as to why they just <a href="http://www.dbms2.com/2010/08/12/teradata-future-product-strategy/">shut down Kickfire&#8217;s product line after acquiring its assets</a>. Well, somebody who tested a Kickfire box told me that &#8212; <a href="http://www.dbms2.com/2008/04/18/kickfire-kicks-off/">great TPC-H results notwithstanding</a> &#8212; it turned out not to be nearly as fast as one might think, on real-life data sets that didn&#8217;t fit entirely into RAM. Hard though such a thing may be to imagine, it turns out that Kickfire&#8217;s TPC-H results were yet less significant than I thought they were.</li>
<li>I haven&#8217;t been looking at <em><a href="http://highscalability.com/">High Scalability</a></em> nearly as  much as I should, and that&#8217;s an understatement. It&#8217;s an outstanding  blog.</li>
<li>A couple of Google execs offered some <a href="http://www.mediapost.com/publications/?fa=Articles.showArticle&amp;art_aid=136685&amp;nid=119185">predictions   about the future of online advertising</a>, which might be of  interest  to anybody selling analytic (or text analytic) technology to  the  online/digital media market.</li>
<li>The BBC shows us <a href="http://www.bbc.co.uk/blogs/researchanddevelopment/2010/09/what-makes-zeitgeist-tick.shtml">what a single 133-character tweet plus its metadata look like in JSON</a>. (All 1582 characters.)</li>
<li><em>Huffington Post&#8217;s</em> CEO made some comments about <a href="http://paidcontent.org/article/419-huffposts-hippeau-social-informants-are-the-new-influencers/">influencers</a> which are additive to what I&#8217;ve been saying about <a href="http://www.strategicmessaging.com/influencers-long-tail-watts-godin/2008/02/02/">influencers</a> over on <em><a href="http://www.strategicmessaging.com">Strategic Messaging</a>.</em> (If you don&#8217;t read that &#8212; well, it&#8217;s my blog about marketing.)<em><br />
</em></li>
<li>Speaking of my other blogs, I&#8217;m not bothering to put up a separate post like this over on <em><a href="http://www.texttechologies.com">Text Technologies</a>, </em>where thee posts I have put up recently tend to be (at least by my standards) relatively link-heavy anyway, but I have a couple more to share even so:
<ul>
<li>Paul Carr&#8217;s <a href="http://techcrunch.com/2010/10/01/eh-oh-well/">7 rules for TechCrunch/AOL employees</a> are really funny.</li>
<li>Some major search engine marketing experts are sounding <a href="http://sphinn.com/story/159876">defeatist about web spam</a>.</li>
</ul>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/10/03/notes-and-links-october-3-2010/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Nested data structures keep coming up, especially for log files</title>
		<link>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/</link>
		<comments>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/#comments</comments>
		<pubDate>Sat, 31 Jul 2010 10:42:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2723</guid>
		<description><![CDATA[Nested data structures have come up several times now, almost always in the context of log files. Google has published about a project called Dremel. Per Tasso Agyros, one of Dremel&#8217;s key concepts is nested data structures. Those arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data [...]]]></description>
			<content:encoded><![CDATA[<p>Nested data structures have come up several times now, almost always in the context of log files.</p>
<ul>
<li>Google has published about a project called <a href="http://www.asterdata.com/blog/index.php/2010/07/19/google%E2%80%99s-dremel-%E2%80%93-or-can-mapreduce-itself-handle-fast-interactive-querying/">Dremel</a>. Per Tasso Agyros, one of Dremel&#8217;s key concepts is nested data structures.</li>
<li>Those <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/">arrays</a> that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of course log-oriented. <a href="http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/">eBay was very interested in that project too</a>.</li>
<li>Facebook&#8217;s log files have a big nested data structure flavor.</li>
</ul>
<p>I don&#8217;t have a grasp yet on what exactly is happening here, but it&#8217;s something.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Cassandra technical overview</title>
		<link>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/</link>
		<comments>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 09:10:39 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Amazon and its cloud]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2473</guid>
		<description><![CDATA[Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I&#8217;m finally finding time to clear my Cassandra/Riptano [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I&#8217;m finally finding time to clear my Cassandra/Riptano backlog. I&#8217;ll cover the more technical parts below, and the more business- or usage-oriented ones in <a href="http://www.dbms2.com/2010/07/06/riptano-and-cassandra-adoption/">a companion Cassandra/Riptano post</a>.</p>
<p style="margin-bottom: 0in;">Jonathan&#8217;s core claims for Cassandra include:</p>
<ul>
<li>Cassandra is shared-nothing.</li>
<li>Cassandra has good approaches to 	replication and partitioning, right out of the box.</li>
<li>In particular, Cassandra is good 	for use cases that distribute a database around the world and want 	to access it at “local” latencies. (Indeed, Jonathan asserts 	that non-local replication is a significant non-big-data Cassandra 	use case.)</li>
<li>Cassandra&#8217;s scale-out is 	application-transparent, unlike sharded MySQL&#8217;s.</li>
<li> Cassandra is fast at both appends 	and range queries, which would be hard to accomplish in a pure 	key-value store.</li>
</ul>
<p style="margin-bottom: 0in;">In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he&#8217;s concerned, may well belong in a more traditional SQL DBMS.  <span id="more-2473"></span></p>
<p style="margin-bottom: 0in;">Further highlights of our talks included, as best I understood them:</p>
<ul>
<li>Cassandra is based in parts both 	on Google&#8217;s <strong>BigTable</strong> paper of 2006 and Amazon&#8217;s <strong>Dynamo</strong> paper of 2007.
<ul>
<li>The core of what Cassandra takes 	from BigTable is based on <strong>log-structured merge trees,</strong> which 	actually entered the computer science literature in 1996.</li>
<li>Cassand<span style="font-weight: normal;">ra&#8217;s 	approach to horizontal scaling, replication, failover, etc. seems to 	be based Dynamo. </span></li>
</ul>
</li>
<li>There seems to be <strong>a logical 	concept of “row”</strong> in Cassandra, or it&#8217;s at least meaningful 	to use the SQL/relational concept of a “row” when talking about 	Cassandra data. However, Cassandra is closer to being a <strong>column-based 	data store</strong> than a row-based one. (Not the same thing, but 	closer.)</li>
<li>Even so, it only takes a single 	seek to return a whole Cassandra “row”.</li>
<li>Cassandra 	writes data quite differently from the way a classical OLTP DBMS 	would.
<ul>
<li><strong>Cassandra writes just the data 	elements</strong><span style="font-weight: normal;"> – i.e., fields – </span><strong>that are actually being inserted or changed,</strong> not whole 	rows.</li>
<li>One benefit is that Cassandra data 	is very <strong>sparse.</strong> NULLs aren&#8217;t stored in any way, and hence in 	particular take up no space.</li>
<li>Another benefit – and one of the 	core concepts of Cassandra – is that <strong>you can implicitly assume 	different schemas for different rows of the same “table.”</strong> In 	particular, you can add data for columns that you didn&#8217;t envision 	when you first started storing “rows” of the same “table.”</li>
<li><strong>Writes are collected into 	sorted “memtables,” which from time to time are sent to disk.</strong> Once data gets to disk, it&#8217;s <strong>immutable,</strong> except for occasional 	merge/reorganization/garbage collection.
<ul>
<li>Jonathan claims, plausibly, that 	this makes write throughput very fast (because the I/O is 	fundamentally sequential in nature.)</li>
<li>The default as to how long data 	typically stays in memory before it gets persisted to dis<span style="font-weight: normal;">k 	is “whichever comes first of {64 MB written, 300k updates, 1 	hour}”. </span></li>
</ul>
</li>
<li>Cassandra has <strong>durability</strong> – 	guaranteed non-loss of data – assuming fsync is turned on. fsync 	seems to create a 15% or so overhead.</li>
<li>However, Cassandra has <strong>no 	concept of a “transaction.”</strong></li>
<li>As one would 	expect, data can be read even before it has been persisted to disk.</li>
</ul>
</li>
<li>According to 	Jonathan, Cassandra can do about 14,000 writes or 7,000 reads per 	second, on a quad-core server.
<ul>
<li>Those figures scale pretty 	linearly with the number of servers. (There&#8217;s some overhead for 	network latency.)</li>
<li>Those figures assume a five-column 	row.</li>
<li>Cassandra&#8217;s write-performance 	figures are only “mildly sensitive” to the width of the row. 	E.g., doubling row width only gives a 15-20% throughput hit, due to 	some fixed per-row overhead. That said, I imagine going 100X in row 	width would create a major slowdown, although perhaps while 	measuring width more in bytes than in column count.</li>
<li>Cassandra&#8217;s <span style="color: #000080;"><span lang="zxx"><span style="text-decoration: underline;"><a href="http://racklabs.com/%7Ebwilliam/cassandra/04vs05vs06.png">performance</a></span></span></span> has been growing nicely in each point release. Jonathan thinks this 	general trend will continue.</li>
</ul>
</li>
<li>Jonathan thinks Cassandra is 	pretty good at keeping your data safe.
<ul>
<li>Each node has a commit log.</li>
<li>When a node goes down, its writes 	are buffered until it comes back up.</li>
</ul>
</li>
<li>You can run Hadoop MapReduce 	straight against Cassandra files.</li>
<li>A Cassandra node might hold 	anything from 10s of gigabytes to multiple terabytes of data. You 	might want to go with the low end if you want to have lots of cache 	hits.</li>
<li>Solid-state storage would speed up 	Cassandra reads, not writes, and is not widely used with Cassandra 	yet.</li>
<li>Jonathan says Cassandra is really 	good at handling time series data, by which I suspect he means log 	files. <a href="https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/">Cloudkick</a> is a user of this capability.</li>
</ul>
<p style="margin-bottom: 0in;">I certainly didn&#8217;t grasp everything about Cassandra replication and partitioning strategies. That wasn&#8217;t the focus of our talks, and anyway I got the impression they are so flexible that there&#8217;s little that can firmly be said about them. But I did get the impressions:</p>
<ul>
<li>You set your consistency rules in 	the Cassandra API, not on a per-table basis. (I think this means 	that a lack of administrative tools is supposedly a feature, not a 	drawback.)</li>
<li>As a practical matter, Cassandra 	users commonly take one of two approaches to consistency:
<ul>
<li><a href="http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/">RYW consistency</a>, most 	commonly with N = 3 and R = W = 2.</li>
<li>Geographically dispersed eventual 	consistency.</li>
</ul>
</li>
<li>Cassandra data is most commonly 	distributed via consistent hashing, but other options are 	“pluggable.”</li>
<li>If you add a node, the busiest 	note automagically decides to ship some data over, reducing its 	load. Of course, this only works if you get the new node on before 	the old node is so maxed out it doesn&#8217;t have time to do the 	shipping.</li>
</ul>
<p style="margin-bottom: 0in;">When we talked in March, the next release of Cassandra was going to be 0.7. Cassandra 0.7 was going to be a performance/scalability release, for example fixing the flaw that garbage collection read rows into memory one at a time. After that, Cassandra 0.8 was to be a feature release, with one planned feature being more automatic index management and/or materialized-view-like capability, so as to reduce the burden on Cassandra developers of schema management.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>My M<span style="font-style: normal;">arch 	<a href="../2010/03/12/some-nosql-links/">NoSQL 	links post</a> included </span>the Google and Amazon papers</li>
<li>The <a href="https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/">March 	2, 2010 Cloudkick post</a> also linked above goes into a lot of 	detail, including what they think is great about Cassandra and what 	they think is still missing</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Various quick notes</title>
		<link>http://www.dbms2.com/2010/05/23/various-quick-notes/</link>
		<comments>http://www.dbms2.com/2010/05/23/various-quick-notes/#comments</comments>
		<pubDate>Sun, 23 May 2010 08:38:51 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[SAP AG]]></category>
		<category><![CDATA[SAS Institute]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2173</guid>
		<description><![CDATA[As you might imagine, there are a lot of blog posts I&#8217;d like to write I never seem to get around to, or things I&#8217;d like to comment on that I don&#8217;t want to bother ever writing a full post about. In some cases I just tweet a comment or link and leave it at [...]]]></description>
			<content:encoded><![CDATA[<p>As you might imagine, there are a lot of blog posts I&#8217;d like to write I never seem to get around to, or things I&#8217;d like to comment on that I don&#8217;t want to bother ever writing a full post about. In some cases I just <a href="http://twitter.com/CurtMonash">tweet</a> a comment or link and leave it at that.</p>
<p>And it&#8217;s not going to get any better. Next week = the oft-postponed elder care trip. Then I&#8217;m back for a short week. Then I&#8217;m off on my quarterly visit to the SF area. Soon thereafter I&#8217;ve have a lot to do in connection with <a href="http://www.netezza.com/userconference/speakers.html">Enzee Universe</a>. And at that point another month will have gone by.</p>
<p>Anyhow:<span id="more-2173"></span></p>
<ul>
<li>Back in January, Oracle finally briefed me on <a href="http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/">Exadata 2</a>. I also requested and got permission to post what I regarded as pretty interesting slides, then never got around to doing so. Well, <a href="http://www.monash.com/uploads/Exadata-slides-January-2010.pdf">here they are</a>. (Pay no attention to the word &#8220;Confidential&#8221;.)</li>
<li>Two people I have a lot of respect for, <a href="http://intelligent-enterprise.informationweek.com/blog/archives/2010/05/sap_and_inmemor.html">Cindi Howson</a> and <a href="http://intelligent-enterprise.informationweek.com/blog/archives/2010/05/quick_takes_on.html">Doug Henschen</a>, seem bullish on SAP&#8217;s in-memory NewDB efforts. But for a variety of execution reasons, I&#8217;m skeptical that this will matter for anything except SAP&#8217;s analytics suite. I.e., I don&#8217;t think anybody much except SAP will write OLTP apps to it, and I don&#8217;t think that without OLTP apps being written to it it&#8217;s much more than Business Objects&#8217; answer to QlikView.</li>
<li>I just learned that <a href="http://www.thestreet.com/story/10640248/1/tech-rights-give-companies-upper-hand.html">Netezza&#8217;s previous geospatial technology didn&#8217;t get ported to TwinFin</a>. However, <a href="http://www.netezza.com/releases/2010/release021710.htm">Netezza obviously found a geospatial alternative</a>.</li>
</ul>
<p>I &#8216;m beginning to make a habit of asking vendors for a postable version of their slide decks. <a href="http://www.dbms2.com/2010/05/23/sybase-iq-15/">Sybase IQ</a> is another example.</p>
<ul>
<li>Google is doing something called <a href="http://googlecode.blogspot.com/2010/05/bigquery-and-prediction-api-get-more.html">BigQuery</a> that is &#8220;SQL-like&#8221; for big data analytics. I don&#8217;t know anything about it.</li>
<li>I also don&#8217;t know anything about <a href="http://www-01.ibm.com/software/ebusiness/jstart/bigsheets/">IBM BigSheets</a> yet. It sounds something like <a href="http://www.dbms2.com/2010/04/16/introduction-to-datameer/">Datameer</a>, but that could be way off the mark.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/23/various-quick-notes/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>ITA Software and Needlebase</title>
		<link>http://www.dbms2.com/2010/04/21/ita-software-needlebase-google/</link>
		<comments>http://www.dbms2.com/2010/04/21/ita-software-needlebase-google/#comments</comments>
		<pubDate>Wed, 21 Apr 2010 16:54:56 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Oracle]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1949</guid>
		<description><![CDATA[Rumors are flying that Google may acquire ITA Software. I know nothing of their validity, but I have known about ITA Software for a while. Random notes include: ITA Software builds huge OLTP systems that it runs itself on behalf of airlines. Very, very unusually, ITA Software builds these huge OLTP systems in LISP. ITA [...]]]></description>
			<content:encoded><![CDATA[<p>Rumors are flying that <a href="http://www.bloomberg.com/apps/news?pid=newsarchive&amp;sid=aJXdCOdgJmw4">Google may acquire ITA Software</a>. I know nothing of their validity, but I have known about ITA Software for a while. Random notes include:</p>
<ul>
<li>ITA Software builds huge OLTP systems that it runs itself on behalf of airlines.</li>
<li>Very, very unusually, ITA Software builds these <a href="http://www.networkworld.com/community/node/29552">huge OLTP systems in LISP</a>.</li>
<li><a href="http://www.dbms2.com/2008/01/24/mysql-database/">ITA Software is an Oracle shop</a> (see Dan Weinreb&#8217;s comment).</li>
<li><a href="http://www.dbms2.com/2008/01/31/ellen-rubin-is-leaving-netezza/">ITA Software is run by a techie</a> (again, see Dan Weinreb&#8217;s comment).</li>
<li>ITA Software has an interesting screen-scraping/web ETL project called Needlebase</li>
</ul>
<p>ITA&#8217;s software does both price/reservation lookup/checking and reservation-making. I&#8217;ve had trouble keeping it straight, but I think the lookup is ITA&#8217;s actual business, and the reservation-making is ITA&#8217;s Next Big Thing. This is one of the ultimate federated-transaction-processing applications, because it involves coordinating huge OLTP systems run, in some cases, by companies that are bitter competitors with each other. Network latencies have to allow for intercontinental travel of the data itself.</p>
<p><em>Indeed, airline reservation systems are pretty much the OLTP ultimate in themselves. As the story goes, transaction monitors were pretty much invented for airline reservation systems in the 1960s.</em></p>
<p>A really small project for ITA Software is Needlebase. I stopped by ITA to look at Needlebase in January, and what it is is a very smart and hence interesting screen-scraping system. The idea is people publish database information to the web, and you may want to look at their web pages and recover the database records it is based on. Applications of this to the airline industry, which has 100s of 1000s of price changes per day &#8212; and I may be too low by one or two orders of magnitude when I say that &#8212; should be fairly obvious. ITA Software has aspirations of applying Needlebase to other sectors as well, or more precisely having users who do so. Last I looked, ITA hadn&#8217;t put significant resources behind stimulating Needlebase adoption &#8212; but Google might well change that.</p>
<p><em>Edit: I just re-found <a href="http://danweinreb.org/blog/the-failure-of-lisp-a-reply-to-brandon-werner">an old characterization of (some of) what ITA Software does</a> by &#8212; who else? &#8212; Dan Weinreb:</em></p>
<blockquote><p>I am working on our new product, an airline reservation system.  It’s an online transaction-processing system that must be up 99.99% of the time, maintaining maximum response time (e.g. on www.aircanada.com).  It’s a very, very complicated system.  The presentation layer is written in Java using conventional techniques.  The business rule layer is written in Common Lisp; about 500,000 lines of code (plus another 100,000 or so of open source libraries).  The database layer is Oracle RAC.  We operate our own data centers, some here in Massachusetts and a disaster-recovery site in Canada (separate power grid).</p></blockquote>
<p><em><strong>Related links</strong></em></p>
<ul>
<li><a href="http://www.itasoftware.com">ITA Software</a> and <a href="http://www.needlebase.com">Needlebase</a> websites</li>
<li><a href="http://www.dbms2.com/2008/03/07/lisp-humor/">More about LISP</a> <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/21/ita-software-needlebase-google/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Some NoSQL links</title>
		<link>http://www.dbms2.com/2010/03/12/some-nosql-links/</link>
		<comments>http://www.dbms2.com/2010/03/12/some-nosql-links/#comments</comments>
		<pubDate>Fri, 12 Mar 2010 23:51:42 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Amazon and its cloud]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Continuent]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Tokutek]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1692</guid>
		<description><![CDATA[I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I&#8217;m poking around a bit reading stuff on the subjects. Here are some links I found. A little over a year ago, Julian Browne put up a great post on Eric Brewer&#8217;s CAP conjecture/theorem, which provides much of the [...]]]></description>
			<content:encoded><![CDATA[<p>I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I&#8217;m poking around a bit reading stuff on the subjects. Here are some links I found.<span id="more-1692"></span></p>
<ul>
<li>A little over a year ago, Julian Browne put up a great post on <a href="http://www.julianbrowne.com/article/viewer/brewers-cap-theorem">Eric Brewer&#8217;s CAP conjecture/theorem</a>, which provides much of the impetus to relax the traditional requirement for atomicity/consistency.</li>
<li>Even more directly inspirational to NoSQL technology development were two seminal papers: Google&#8217;s on <a href="http://labs.google.com/papers/bigtable.html">BigTable</a> and Amazon&#8217;s on <a href="http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf">Dynamo</a>. (That said, I&#8217;m having trouble getting myself to actually read them from start to finish, especially since they&#8217;ve been superseded by subsequent technology development.)</li>
<li>10gen (the MongoDB guys) hosted a NoSQL conference yesterday. Much blogging has ensued. The best post I&#8217;ve seen so far was by <a href="http://blog.marcua.net/post/442594842/notes-from-nosql-live-boston-2010">Adam Marcus</a>. I find the graph database notes near the bottom particularly interesting.</li>
<li>Mark Callaghan hit back against the <a href="http://mysqlha.blogspot.com/2010/03/plays-well-with-others.html">NoSQL <span style="text-decoration: line-through;">movement</span> hype</a>, and in particular against the <a href="http://www.dbms2.com/2010/03/02/cassandra-nosql-scalable-oltp/">MySQL/memcached is passe</a>&#8216; meme. On the other hand, he also bemoaned many failings of MySQL. On the third hand, he praised or at least expressed hope for a variety of MySQL-related technologies, including <a href="http://www.dbms2.com/2009/04/16/introduction-to-tokutek/">Tokutek&#8217;s TokuDB</a> and <a href="http://www.dbms2.com/2009/09/03/continuent-on-clustering/">Continuent&#8217;s Tungsten</a>.</li>
<li>In connection with that debate, Mark Rendle offered a <a href="http://blog.markrendle.net/2010/03/do-you-need-relational-database.html">funny rant</a>, mainly pro-NoSQL, in the style of a Socratic dialogue.</li>
<li>John Quinn of Digg recently described <a href="http://www.stumbleupon.com/su/5099Ti/about.digg.com/node/564">Digg&#8217;s move from MySQL to Cassandra</a>, and outlined a lot of features Digg was adding to Cassandra, all of which it is open-sourcing.</li>
<li>The NoSQL guys maintain their own long <a href="http://nosql-database.org/links.html">list of NoSQL-related links</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/03/12/some-nosql-links/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>More patent nonsense &#8212; Google MapReduce</title>
		<link>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/</link>
		<comments>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/#comments</comments>
		<pubDate>Thu, 11 Feb 2010 19:29:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1565</guid>
		<description><![CDATA[Google recently received a patent for MapReduce. The first and most general claim is (formatting and emphasis mine): A system for large-scale processing of data, comprising: a plurality of processes executing on a plurality of interconnected processors; the plurality of processes including a master process, for coordinating a data processing job for processing a set [...]]]></description>
			<content:encoded><![CDATA[<p>Google recently received a <a href="http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&amp;Sect2=HITOFF&amp;d=PALL&amp;p=1&amp;u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&amp;r=1&amp;f=G&amp;l=50&amp;s1=7,650,331.PN.&amp;OS=PN/7,650,331&amp;RS=PN/7,650,331">patent</a> for MapReduce. The first and most general claim is (formatting and emphasis mine):<span id="more-1565"></span></p>
<blockquote><p>A system for large-scale processing of data, comprising:</p>
<ul>
<li>a plurality of processes executing on a plurality of interconnected processors;</li>
<li>the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, and worker processes;</li>
<li>the master process, in response to a request to perform the data processing job, assigning input data blocks of the set of input data to respective ones of the worker processes;</li>
<li>each of a first plurality of the worker processes <strong>including an application-independent map module</strong> for retrieving a respective input data block assigned to the worker process by the master process and <strong>applying an application-specific map operation</strong> to the respective input data block to produce intermediate data values, wherein at least a subset of the intermediate data values each comprises a <strong>key/value pair,</strong> and wherein at least two of the first plurality of the worker processes operate simultaneously so as to perform the application-specific map operation in <strong>parallel</strong> on distinct, respective input data blocks;</li>
<li>a partition operator for processing the produced intermediate data values to produce a plurality of intermediate data sets, wherein each respective intermediate data set includes <strong>all key/value pairs for a distinct set of respective keys,</strong> and wherein at least one of the respective intermediate data sets includes respective ones of the key/value pairs produced by a plurality of the first plurality of the worker processes;</li>
<li>and each of a second plurality of the worker processes including <strong>an application-independent reduce module for retrieving data,</strong> the retrieved data comprising at least a subset of the key/value pairs from a respective intermediate data set of the plurality of intermediate data sets and applying <strong>an application-specific reduce operation</strong> to the retrieved data to produce final output data corresponding to the distinct set of respective keys in the respective intermediate data set of the plurality of intermediate data sets, and wherein at least two of the second plurality of the worker processes operate simultaneously so as to perform the application-specific reduce operation in <strong>parallel</strong> on multiple respective subsets of the produced intermediate data values.</li>
</ul>
</blockquote>
<p><em>The way a patent works is that you make a big claim and, just in case it&#8217;s later invalidated, you also make more specialized sub-claims. What&#8217;s more, in a software patent, you claim everything twice, once as a &#8220;system&#8221; and once as a &#8220;method.&#8221;</em></p>
<p>When a patent takes that long to issue and has a core claim that wordy, one can assume there was much back and forth with the PTO (Patent and Trademark Office) to whittle it down to something they felt they could approve. At a guess, I&#8217;d conjecture that the supposedly unique parts of the claim are concentrated in the areas I bolded above, and that the PTO doesn&#8217;t think the claim would be patentable unless most or all of them were included.</p>
<p>So should the claim have been approved even so? Let&#8217;s consider prior art. <a href="../../../../../2009/10/06/oracle-mapreduce/">Oracle has long been able to parallelize ala MapReduce</a>. I don&#8217;t see anything in the claim that isn&#8217;t preceded by what Oracle did, except maybe the emphasis on key/value pairs. (And the same statement applies to the other 15 claims in the patent, at least on a quick skim.) I forget the details of SenSage&#8217;s quasi-MapReduce, which also preceded the Google patent filing, but I imagine something similar would be true about it.</p>
<p>There is no doubt that Google popularized the ideas of MapReduce &#8212; which turns out to have been a worthy public service. In one great example of that popularization, <a href="http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf">the seminal paper on parallel data mining</a> is almost laughable in how it <a href="../../../../../2009/10/15/mapreduce-webinar-slides/">deviates from MapReduce key/value pair formalism</a> &#8212; but it still seems to have been inspired by Google&#8217;s MapReduce. But that&#8217;s a different matter; popularization != invention, even though there&#8217;s a certain connection between the two in patent law. Actually, Google also often does get credit for having &#8220;invented&#8221; MapReduce, including regrettably in the marketing materials of clients I can&#8217;t talk out of saying that and which now might be looking into the barrel of the Google patent (hello Aster); but again, saying something doesn&#8217;t make it enforceable in court.</p>
<p>So what it all boils down to is:</p>
<p><strong>Should Google&#8217;s patent on the idea of parallelizing the handling of sets of application-visible key/value pairs be regarded as valid?</strong></p>
<p>The United States PTO, which is paid to think about these things, has evidently decided Yes. I disagree. In simplest terms, my reason is that key/value pairs have been around for decades, and so:</p>
<p><strong>Anything which was known or obvious without special reference to key/value pairs doesn&#8217;t suddenly become non-obvious when key/value pairs are mixed in.</strong></p>
<p>If Google ever tries to enforce its MapReduce patent, I&#8217;m available as an expert witness for the other side.</p>
<p><strong><em>Related links</em></strong></p>
<ul>
<li><a href="http://gigaom.com/2010/01/19/why-hadoop-users-shouldnt-fear-googles-new-mapreduce-patent/">GigaOm</a> and <a href="http://arstechnica.com/open-source/news/2010/01/googles-mapreduce-patent-what-does-it-mean-for-hadoop.ars">Ars Technica</a> on the Google MapReduce patent</li>
<li>Another <a href="http://www.dbms2.com/2010/01/15/vertica-sybase-ipatent-litigation/">silly software patent</a> issue</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Clearing up MapReduce confusion, yet again</title>
		<link>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/</link>
		<comments>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/#comments</comments>
		<pubDate>Wed, 30 Dec 2009 10:50:53 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Splunk]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1371</guid>
		<description><![CDATA[I&#8217;m frustrated by a constant need &#8212; or at least urge &#8212; to correct myths and errors about MapReduce. Let&#8217;s try one more time: MapReduce was named and popularized &#8212; but not invented &#8212; by Google. &#8220;MapReduce&#8221; variously refers to: A programming paradigm Execution engines that implement the programming paradigm Distributed file systems that work [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m frustrated by a constant need &#8212; or at least urge <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  &#8212; to correct <a href="http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/">myths and errors about MapReduce</a>. Let&#8217;s try one more time:<span id="more-1371"></span></p>
<ul>
<li>MapReduce was named and popularized &#8212; but not invented &#8212; by Google.</li>
<li>&#8220;MapReduce&#8221; variously refers to:
<ul>
<li>A programming paradigm</li>
<li>Execution engines that implement the programming paradigm</li>
<li>Distributed file systems that work with the execution engines</li>
</ul>
</li>
<li>In particular, Hadoop is a MapReduce execution engine that includes or is closely associated with HDFS (Hadoop Distributed File System).</li>
<li>MapReduce and analytic DBMS can interact in a number of different ways, including:
<ul>
<li>Tight integration between a DBMS and exposed MapReduce functionality, e.g. <a href="http://www.dbms2.com/2009/10/15/mapreduce-webinar-slides/">Aster Data&#8217;s SQL/MapReduce</a> or Greenplum.</li>
<li>Integrated MapReduce &#8220;under the covers&#8221;, e.g. SenSage or <a href="http://www.dbms2.com/2009/10/06/oracle-mapreduce/">Oracle</a>. This may or may not follow all the rules Google laid out for MapReduce, but it&#8217;s at least similar in spirit.</li>
<li>Looser coupling between DBMS and a MapReduce system, e.g. <a href="http://www.dbms2.com/2009/08/04/verticas-version-of-mapreduce-integration/">Vertica/Hadoop</a>, in which MapReduce may or may not run on a different cluster than the DBMS.</li>
<li>Not at all, except perhaps insofar as a quasi-DBMS such as <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/">Hive</a> is implemented over a MapReduce system such as Hadoop/HDFS.</li>
</ul>
</li>
<li>As predicted by <a href="http://www.strategicmessaging.com/monashs-first-law-of-commercial-semantics-explained/2009/01/09/">Monash&#8217;s First Law of Commercial Semantics</a>, different vendors have individual variants on those themes. For example, as per <a href="http://www.splunk.com/product">a registration-required white paper</a>, Splunk is moving to publicly expose a not-quite-complete form of MapReduce.</li>
<li>MapReduce implementations such as Hadoop are sometimes regarded as part of the <a href="http://www.dbms2.com/2009/12/12/legit-nosql-key-value-store/">NoSQL</a> &#8220;movement&#8221;. When they are, many generalities about NoSQL &#8212; such as that it doesn&#8217;t deal with analytics &#8212; are falsified.</li>
<li>So far as I can tell, mainstream enterprise (as opposed to web, scientific, investment, etc.) data mining folks may be looking at MapReduce for data mining, but they haven&#8217;t done much to adopt it yet. Probably that&#8217;s because the outfits who have the greatest need are the same ones that have the largest sunk investments in more traditional ways of doing data mining.</li>
<li>Cloudera != Hadoop. On the other hand, if you want to use Hadoop, it makes a lot of sense to do business with Cloudera.</li>
<li>Non-DBMS MapReduce != Hadoop. On the other hand, Hadoop is the default choice for non-DBMS MapReduce.</li>
<li>MapReduce != Hadoop, period. DBMS-based MapReduce is also a legitimate technical strategy.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>

