<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Google</title>
	<atom:link href="http://www.dbms2.com/category/products-and-vendors/google-mapreduce-bigtable/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 02 Sep 2010 09:06:44 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Nested data structures keep coming up, especially for log files</title>
		<link>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/</link>
		<comments>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/#comments</comments>
		<pubDate>Sat, 31 Jul 2010 10:42:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2723</guid>
		<description><![CDATA[Nested data structures have come up several times now, almost always in the context of log files.

Google has published about a project called Dremel. Per Tasso Agyros, one of Dremel&#8217;s key concepts is nested data structures.
Those arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of [...]]]></description>
			<content:encoded><![CDATA[<p>Nested data structures have come up several times now, almost always in the context of log files.</p>
<ul>
<li>Google has published about a project called <a href="http://www.asterdata.com/blog/index.php/2010/07/19/google%E2%80%99s-dremel-%E2%80%93-or-can-mapreduce-itself-handle-fast-interactive-querying/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');">Dremel</a>. Per Tasso Agyros, one of Dremel&#8217;s key concepts is nested data structures.</li>
<li>Those <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/" >arrays</a> that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of course log-oriented. <a href="http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/" >eBay was very interested in that project too</a>.</li>
<li>Facebook&#8217;s log files have a big nested data structure flavor.</li>
</ul>
<p>I don&#8217;t have a grasp yet on what exactly is happening here, but it&#8217;s something.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Cassandra technical overview</title>
		<link>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/</link>
		<comments>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 09:10:39 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Amazon and its cloud]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Riptano]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2473</guid>
		<description><![CDATA[Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I&#8217;m finally finding time to clear my Cassandra/Riptano [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I&#8217;m finally finding time to clear my Cassandra/Riptano backlog. I&#8217;ll cover the more technical parts below, and the more business- or usage-oriented ones in <a href="http://www.dbms2.com/2010/07/06/riptano-and-cassandra-adoption/" >a companion Cassandra/Riptano post</a>.</p>
<p style="margin-bottom: 0in;">Jonathan&#8217;s core claims for Cassandra include:</p>
<ul>
<li>Cassandra is shared-nothing.</li>
<li>Cassandra has good approaches to 	replication and partitioning, right out of the box.</li>
<li>In particular, Cassandra is good 	for use cases that distribute a database around the world and want 	to access it at “local” latencies. (Indeed, Jonathan asserts 	that non-local replication is a significant non-big-data Cassandra 	use case.)</li>
<li>Cassandra&#8217;s scale-out is 	application-transparent, unlike sharded MySQL&#8217;s.</li>
<li> Cassandra is fast at both appends 	and range queries, which would be hard to accomplish in a pure 	key-value store.</li>
</ul>
<p style="margin-bottom: 0in;">In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he&#8217;s concerned, may well belong in a more traditional SQL DBMS.  <span id="more-2473"></span></p>
<p style="margin-bottom: 0in;">Further highlights of our talks included, as best I understood them:</p>
<ul>
<li>Cassandra is based in parts both 	on Google&#8217;s <strong>BigTable</strong> paper of 2006 and Amazon&#8217;s <strong>Dynamo</strong> paper of 2007.
<ul>
<li>The core of what Cassandra takes 	from BigTable is based on <strong>log-structured merge trees,</strong> which 	actually entered the computer science literature in 1996.</li>
<li>Cassand<span style="font-weight: normal;">ra&#8217;s 	approach to horizontal scaling, replication, failover, etc. seems to 	be based Dynamo. </span></li>
</ul>
</li>
<li>There seems to be <strong>a logical 	concept of “row”</strong> in Cassandra, or it&#8217;s at least meaningful 	to use the SQL/relational concept of a “row” when talking about 	Cassandra data. However, Cassandra is closer to being a <strong>column-based 	data store</strong> than a row-based one. (Not the same thing, but 	closer.)</li>
<li>Even so, it only takes a single 	seek to return a whole Cassandra “row”.</li>
<li>Cassandra 	writes data quite differently from the way a classical OLTP DBMS 	would.
<ul>
<li><strong>Cassandra writes just the data 	elements</strong><span style="font-weight: normal;"> – i.e., fields – </span><strong>that are actually being inserted or changed,</strong> not whole 	rows.</li>
<li>One benefit is that Cassandra data 	is very <strong>sparse.</strong> NULLs aren&#8217;t stored in any way, and hence in 	particular take up no space.</li>
<li>Another benefit – and one of the 	core concepts of Cassandra – is that <strong>you can implicitly assume 	different schemas for different rows of the same “table.”</strong> In 	particular, you can add data for columns that you didn&#8217;t envision 	when you first started storing “rows” of the same “table.”</li>
<li><strong>Writes are collected into 	sorted “memtables,” which from time to time are sent to disk.</strong> Once data gets to disk, it&#8217;s <strong>immutable,</strong> except for occasional 	merge/reorganization/garbage collection.
<ul>
<li>Jonathan claims, plausibly, that 	this makes write throughput very fast (because the I/O is 	fundamentally sequential in nature.)</li>
<li>The default as to how long data 	typically stays in memory before it gets persisted to dis<span style="font-weight: normal;">k 	is “whichever comes first of {64 MB written, 300k updates, 1 	hour}”. </span></li>
</ul>
</li>
<li>Cassandra has <strong>durability</strong> – 	guaranteed non-loss of data – assuming fsync is turned on. fsync 	seems to create a 15% or so overhead.</li>
<li>However, Cassandra has <strong>no 	concept of a “transaction.”</strong></li>
<li>As one would 	expect, data can be read even before it has been persisted to disk.</li>
</ul>
</li>
<li>According to 	Jonathan, Cassandra can do about 14,000 writes or 7,000 reads per 	second, on a quad-core server.
<ul>
<li>Those figures scale pretty 	linearly with the number of servers. (There&#8217;s some overhead for 	network latency.)</li>
<li>Those figures assume a five-column 	row.</li>
<li>Cassandra&#8217;s write-performance 	figures are only “mildly sensitive” to the width of the row. 	E.g., doubling row width only gives a 15-20% throughput hit, due to 	some fixed per-row overhead. That said, I imagine going 100X in row 	width would create a major slowdown, although perhaps while 	measuring width more in bytes than in column count.</li>
<li>Cassandra&#8217;s <span style="color: #000080;"><span lang="zxx"><span style="text-decoration: underline;"><a href="http://racklabs.com/%7Ebwilliam/cassandra/04vs05vs06.png" onclick="javascript:pageTracker._trackPageview('/racklabs.com');">performance</a></span></span></span> has been growing nicely in each point release. Jonathan thinks this 	general trend will continue.</li>
</ul>
</li>
<li>Jonathan thinks Cassandra is 	pretty good at keeping your data safe.
<ul>
<li>Each node has a commit log.</li>
<li>When a node goes down, its writes 	are buffered until it comes back up.</li>
</ul>
</li>
<li>You can run Hadoop MapReduce 	straight against Cassandra files.</li>
<li>A Cassandra node might hold 	anything from 10s of gigabytes to multiple terabytes of data. You 	might want to go with the low end if you want to have lots of cache 	hits.</li>
<li>Solid-state storage would speed up 	Cassandra reads, not writes, and is not widely used with Cassandra 	yet.</li>
<li>Jonathan says Cassandra is really 	good at handling time series data, by which I suspect he means log 	files. <a href="https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/" onclick="javascript:pageTracker._trackPageview('/www.cloudkick.com');">Cloudkick</a> is a user of this capability.</li>
</ul>
<p style="margin-bottom: 0in;">I certainly didn&#8217;t grasp everything about Cassandra replication and partitioning strategies. That wasn&#8217;t the focus of our talks, and anyway I got the impression they are so flexible that there&#8217;s little that can firmly be said about them. But I did get the impressions:</p>
<ul>
<li>You set your consistency rules in 	the Cassandra API, not on a per-table basis. (I think this means 	that a lack of administrative tools is supposedly a feature, not a 	drawback.)</li>
<li>As a practical matter, Cassandra 	users commonly take one of two approaches to consistency:
<ul>
<li><a href="http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/" >RYW consistency</a>, most 	commonly with N = 3 and R = W = 2.</li>
<li>Geographically dispersed eventual 	consistency.</li>
</ul>
</li>
<li>Cassandra data is most commonly 	distributed via consistent hashing, but other options are 	“pluggable.”</li>
<li>If you add a node, the busiest 	note automagically decides to ship some data over, reducing its 	load. Of course, this only works if you get the new node on before 	the old node is so maxed out it doesn&#8217;t have time to do the 	shipping.</li>
</ul>
<p style="margin-bottom: 0in;">When we talked in March, the next release of Cassandra was going to be 0.7. Cassandra 0.7 was going to be a performance/scalability release, for example fixing the flaw that garbage collection read rows into memory one at a time. After that, Cassandra 0.8 was to be a feature release, with one planned feature being more automatic index management and/or materialized-view-like capability, so as to reduce the burden on Cassandra developers of schema management.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>My M<span style="font-style: normal;">arch 	<a href="../2010/03/12/some-nosql-links/">NoSQL 	links post</a> included </span>the Google and Amazon papers</li>
<li>The <a href="https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/" onclick="javascript:pageTracker._trackPageview('/www.cloudkick.com');">March 	2, 2010 Cloudkick post</a> also linked above goes into a lot of 	detail, including what they think is great about Cassandra and what 	they think is still missing</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Various quick notes</title>
		<link>http://www.dbms2.com/2010/05/23/various-quick-notes/</link>
		<comments>http://www.dbms2.com/2010/05/23/various-quick-notes/#comments</comments>
		<pubDate>Sun, 23 May 2010 08:38:51 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[SAP AG]]></category>
		<category><![CDATA[SAS Institute]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2173</guid>
		<description><![CDATA[As you might imagine, there are a lot of blog posts I&#8217;d like to write I never seem to get around to, or things I&#8217;d like to comment on that I don&#8217;t want to bother ever writing a full post about. In some cases I just tweet a comment or link and leave it at [...]]]></description>
			<content:encoded><![CDATA[<p>As you might imagine, there are a lot of blog posts I&#8217;d like to write I never seem to get around to, or things I&#8217;d like to comment on that I don&#8217;t want to bother ever writing a full post about. In some cases I just <a href="http://twitter.com/CurtMonash" onclick="javascript:pageTracker._trackPageview('/twitter.com');">tweet</a> a comment or link and leave it at that.</p>
<p>And it&#8217;s not going to get any better. Next week = the oft-postponed elder care trip. Then I&#8217;m back for a short week. Then I&#8217;m off on my quarterly visit to the SF area. Soon thereafter I&#8217;ve have a lot to do in connection with <a href="http://www.netezza.com/userconference/speakers.html" onclick="javascript:pageTracker._trackPageview('/www.netezza.com');">Enzee Universe</a>. And at that point another month will have gone by.</p>
<p>Anyhow:<span id="more-2173"></span></p>
<ul>
<li>Back in January, Oracle finally briefed me on <a href="http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/" >Exadata 2</a>. I also requested and got permission to post what I regarded as pretty interesting slides, then never got around to doing so. Well, <a href="http://www.monash.com/uploads/Exadata-slides-January-2010.pdf" onclick="javascript:pageTracker._trackPageview('/www.monash.com');">here they are</a>. (Pay no attention to the word &#8220;Confidential&#8221;.)</li>
<li>Two people I have a lot of respect for, <a href="http://intelligent-enterprise.informationweek.com/blog/archives/2010/05/sap_and_inmemor.html" onclick="javascript:pageTracker._trackPageview('/intelligent-enterprise.informationweek.com');">Cindi Howson</a> and <a href="http://intelligent-enterprise.informationweek.com/blog/archives/2010/05/quick_takes_on.html" onclick="javascript:pageTracker._trackPageview('/intelligent-enterprise.informationweek.com');">Doug Henschen</a>, seem bullish on SAP&#8217;s in-memory NewDB efforts. But for a variety of execution reasons, I&#8217;m skeptical that this will matter for anything except SAP&#8217;s analytics suite. I.e., I don&#8217;t think anybody much except SAP will write OLTP apps to it, and I don&#8217;t think that without OLTP apps being written to it it&#8217;s much more than Business Objects&#8217; answer to QlikView.</li>
<li>I just learned that <a href="http://www.thestreet.com/story/10640248/1/tech-rights-give-companies-upper-hand.html" onclick="javascript:pageTracker._trackPageview('/www.thestreet.com');">Netezza&#8217;s previous geospatial technology didn&#8217;t get ported to TwinFin</a>. However, <a href="http://www.netezza.com/releases/2010/release021710.htm" onclick="javascript:pageTracker._trackPageview('/www.netezza.com');">Netezza obviously found a geospatial alternative</a>.</li>
</ul>
<p>I &#8216;m beginning to make a habit of asking vendors for a postable version of their slide decks. <a href="http://www.dbms2.com/2010/05/23/sybase-iq-15/" >Sybase IQ</a> is another example.</p>
<ul>
<li>Google is doing something called <a href="http://googlecode.blogspot.com/2010/05/bigquery-and-prediction-api-get-more.html" onclick="javascript:pageTracker._trackPageview('/googlecode.blogspot.com');">BigQuery</a> that is &#8220;SQL-like&#8221; for big data analytics. I don&#8217;t know anything about it.</li>
<li>I also don&#8217;t know anything about <a href="http://www-01.ibm.com/software/ebusiness/jstart/bigsheets/" onclick="javascript:pageTracker._trackPageview('/www-01.ibm.com');">IBM BigSheets</a> yet. It sounds something like <a href="http://www.dbms2.com/2010/04/16/introduction-to-datameer/" >Datameer</a>, but that could be way off the mark.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/23/various-quick-notes/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>ITA Software and Needlebase</title>
		<link>http://www.dbms2.com/2010/04/21/ita-software-needlebase-google/</link>
		<comments>http://www.dbms2.com/2010/04/21/ita-software-needlebase-google/#comments</comments>
		<pubDate>Wed, 21 Apr 2010 16:54:56 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Oracle]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1949</guid>
		<description><![CDATA[Rumors are flying that Google may acquire ITA Software. I know nothing of their validity, but I have known about ITA Software for a while. Random notes include:

ITA Software builds huge OLTP systems that it runs itself on behalf of airlines.
Very, very unusually, ITA Software builds these huge OLTP systems in LISP.
ITA Software is an [...]]]></description>
			<content:encoded><![CDATA[<p>Rumors are flying that <a href="http://www.bloomberg.com/apps/news?pid=newsarchive&amp;sid=aJXdCOdgJmw4" onclick="javascript:pageTracker._trackPageview('/www.bloomberg.com');">Google may acquire ITA Software</a>. I know nothing of their validity, but I have known about ITA Software for a while. Random notes include:</p>
<ul>
<li>ITA Software builds huge OLTP systems that it runs itself on behalf of airlines.</li>
<li>Very, very unusually, ITA Software builds these <a href="http://www.networkworld.com/community/node/29552" onclick="javascript:pageTracker._trackPageview('/www.networkworld.com');">huge OLTP systems in LISP</a>.</li>
<li><a href="http://www.dbms2.com/2008/01/24/mysql-database/" >ITA Software is an Oracle shop</a> (see Dan Weinreb&#8217;s comment).</li>
<li><a href="http://www.dbms2.com/2008/01/31/ellen-rubin-is-leaving-netezza/" >ITA Software is run by a techie</a> (again, see Dan Weinreb&#8217;s comment).</li>
<li>ITA Software has an interesting screen-scraping/web ETL project called Needlebase</li>
</ul>
<p>ITA&#8217;s software does both price/reservation lookup/checking and reservation-making. I&#8217;ve had trouble keeping it straight, but I think the lookup is ITA&#8217;s actual business, and the reservation-making is ITA&#8217;s Next Big Thing. This is one of the ultimate federated-transaction-processing applications, because it involves coordinating huge OLTP systems run, in some cases, by companies that are bitter competitors with each other. Network latencies have to allow for intercontinental travel of the data itself.</p>
<p><em>Indeed, airline reservation systems are pretty much the OLTP ultimate in themselves. As the story goes, transaction monitors were pretty much invented for airline reservation systems in the 1960s.</em></p>
<p>A really small project for ITA Software is Needlebase. I stopped by ITA to look at Needlebase in January, and what it is is a very smart and hence interesting screen-scraping system. The idea is people publish database information to the web, and you may want to look at their web pages and recover the database records it is based on. Applications of this to the airline industry, which has 100s of 1000s of price changes per day &#8212; and I may be too low by one or two orders of magnitude when I say that &#8212; should be fairly obvious. ITA Software has aspirations of applying Needlebase to other sectors as well, or more precisely having users who do so. Last I looked, ITA hadn&#8217;t put significant resources behind stimulating Needlebase adoption &#8212; but Google might well change that.</p>
<p><em>Edit: I just re-found <a href="http://danweinreb.org/blog/the-failure-of-lisp-a-reply-to-brandon-werner" onclick="javascript:pageTracker._trackPageview('/danweinreb.org');">an old characterization of (some of) what ITA Software does</a> by &#8212; who else? &#8212; Dan Weinreb:</em></p>
<blockquote><p>I am working on our new product, an airline reservation system.  It’s an online transaction-processing system that must be up 99.99% of the time, maintaining maximum response time (e.g. on www.aircanada.com).  It’s a very, very complicated system.  The presentation layer is written in Java using conventional techniques.  The business rule layer is written in Common Lisp; about 500,000 lines of code (plus another 100,000 or so of open source libraries).  The database layer is Oracle RAC.  We operate our own data centers, some here in Massachusetts and a disaster-recovery site in Canada (separate power grid).</p></blockquote>
<p><em><strong>Related links</strong></em></p>
<ul>
<li><a href="http://www.itasoftware.com" onclick="javascript:pageTracker._trackPageview('/www.itasoftware.com');">ITA Software</a> and <a href="http://www.needlebase.com" onclick="javascript:pageTracker._trackPageview('/www.needlebase.com');">Needlebase</a> websites</li>
<li><a href="http://www.dbms2.com/2008/03/07/lisp-humor/" >More about LISP</a> <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/21/ita-software-needlebase-google/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Some NoSQL links</title>
		<link>http://www.dbms2.com/2010/03/12/some-nosql-links/</link>
		<comments>http://www.dbms2.com/2010/03/12/some-nosql-links/#comments</comments>
		<pubDate>Fri, 12 Mar 2010 23:51:42 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Amazon and its cloud]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Continuent]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Tokutek]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1692</guid>
		<description><![CDATA[I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I&#8217;m poking around a bit reading stuff on the subjects. Here are some links I found.

A little over a year ago, Julian Browne put up a great post on Eric Brewer&#8217;s CAP conjecture/theorem, which provides much of the impetus [...]]]></description>
			<content:encoded><![CDATA[<p>I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I&#8217;m poking around a bit reading stuff on the subjects. Here are some links I found.<span id="more-1692"></span></p>
<ul>
<li>A little over a year ago, Julian Browne put up a great post on <a href="http://www.julianbrowne.com/article/viewer/brewers-cap-theorem" onclick="javascript:pageTracker._trackPageview('/www.julianbrowne.com');">Eric Brewer&#8217;s CAP conjecture/theorem</a>, which provides much of the impetus to relax the traditional requirement for atomicity/consistency.</li>
<li>Even more directly inspirational to NoSQL technology development were two seminal papers: Google&#8217;s on <a href="http://labs.google.com/papers/bigtable.html" onclick="javascript:pageTracker._trackPageview('/labs.google.com');">BigTable</a> and Amazon&#8217;s on <a href="http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf" onclick="javascript:pageTracker._trackPageview('/s3.amazonaws.com');">Dynamo</a>. (That said, I&#8217;m having trouble getting myself to actually read them from start to finish, especially since they&#8217;ve been superseded by subsequent technology development.)</li>
<li>10gen (the MongoDB guys) hosted a NoSQL conference yesterday. Much blogging has ensued. The best post I&#8217;ve seen so far was by <a href="http://blog.marcua.net/post/442594842/notes-from-nosql-live-boston-2010" onclick="javascript:pageTracker._trackPageview('/blog.marcua.net');">Adam Marcus</a>. I find the graph database notes near the bottom particularly interesting.</li>
<li>Mark Callaghan hit back against the <a href="http://mysqlha.blogspot.com/2010/03/plays-well-with-others.html" onclick="javascript:pageTracker._trackPageview('/mysqlha.blogspot.com');">NoSQL <span style="text-decoration: line-through;">movement</span> hype</a>, and in particular against the <a href="http://www.dbms2.com/2010/03/02/cassandra-nosql-scalable-oltp/" >MySQL/memcached is passe</a>&#8216; meme. On the other hand, he also bemoaned many failings of MySQL. On the third hand, he praised or at least expressed hope for a variety of MySQL-related technologies, including <a href="http://www.dbms2.com/2009/04/16/introduction-to-tokutek/" >Tokutek&#8217;s TokuDB</a> and <a href="http://www.dbms2.com/2009/09/03/continuent-on-clustering/" >Continuent&#8217;s Tungsten</a>.</li>
<li>In connection with that debate, Mark Rendle offered a <a href="http://blog.markrendle.net/2010/03/do-you-need-relational-database.html" onclick="javascript:pageTracker._trackPageview('/blog.markrendle.net');">funny rant</a>, mainly pro-NoSQL, in the style of a Socratic dialogue.</li>
<li>John Quinn of Digg recently described <a href="http://www.stumbleupon.com/su/5099Ti/about.digg.com/node/564" onclick="javascript:pageTracker._trackPageview('/www.stumbleupon.com');">Digg&#8217;s move from MySQL to Cassandra</a>, and outlined a lot of features Digg was adding to Cassandra, all of which it is open-sourcing.</li>
<li>The NoSQL guys maintain their own long <a href="http://nosql-database.org/links.html" onclick="javascript:pageTracker._trackPageview('/nosql-database.org');">list of NoSQL-related links</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/03/12/some-nosql-links/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>More patent nonsense &#8212; Google MapReduce</title>
		<link>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/</link>
		<comments>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/#comments</comments>
		<pubDate>Thu, 11 Feb 2010 19:29:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1565</guid>
		<description><![CDATA[Google recently received a patent for MapReduce. The first and most general claim is (formatting and emphasis mine):
A system for large-scale processing of data, comprising:

a plurality of processes executing on a plurality of interconnected processors;
the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, [...]]]></description>
			<content:encoded><![CDATA[<p>Google recently received a <a href="http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&amp;Sect2=HITOFF&amp;d=PALL&amp;p=1&amp;u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&amp;r=1&amp;f=G&amp;l=50&amp;s1=7,650,331.PN.&amp;OS=PN/7,650,331&amp;RS=PN/7,650,331" onclick="javascript:pageTracker._trackPageview('/patft.uspto.gov');">patent</a> for MapReduce. The first and most general claim is (formatting and emphasis mine):<span id="more-1565"></span></p>
<blockquote><p>A system for large-scale processing of data, comprising:</p>
<ul>
<li>a plurality of processes executing on a plurality of interconnected processors;</li>
<li>the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, and worker processes;</li>
<li>the master process, in response to a request to perform the data processing job, assigning input data blocks of the set of input data to respective ones of the worker processes;</li>
<li>each of a first plurality of the worker processes <strong>including an application-independent map module</strong> for retrieving a respective input data block assigned to the worker process by the master process and <strong>applying an application-specific map operation</strong> to the respective input data block to produce intermediate data values, wherein at least a subset of the intermediate data values each comprises a <strong>key/value pair,</strong> and wherein at least two of the first plurality of the worker processes operate simultaneously so as to perform the application-specific map operation in <strong>parallel</strong> on distinct, respective input data blocks;</li>
<li>a partition operator for processing the produced intermediate data values to produce a plurality of intermediate data sets, wherein each respective intermediate data set includes <strong>all key/value pairs for a distinct set of respective keys,</strong> and wherein at least one of the respective intermediate data sets includes respective ones of the key/value pairs produced by a plurality of the first plurality of the worker processes;</li>
<li>and each of a second plurality of the worker processes including <strong>an application-independent reduce module for retrieving data,</strong> the retrieved data comprising at least a subset of the key/value pairs from a respective intermediate data set of the plurality of intermediate data sets and applying <strong>an application-specific reduce operation</strong> to the retrieved data to produce final output data corresponding to the distinct set of respective keys in the respective intermediate data set of the plurality of intermediate data sets, and wherein at least two of the second plurality of the worker processes operate simultaneously so as to perform the application-specific reduce operation in <strong>parallel</strong> on multiple respective subsets of the produced intermediate data values.</li>
</ul>
</blockquote>
<p><em>The way a patent works is that you make a big claim and, just in case it&#8217;s later invalidated, you also make more specialized sub-claims. What&#8217;s more, in a software patent, you claim everything twice, once as a &#8220;system&#8221; and once as a &#8220;method.&#8221;</em></p>
<p>When a patent takes that long to issue and has a core claim that wordy, one can assume there was much back and forth with the PTO (Patent and Trademark Office) to whittle it down to something they felt they could approve. At a guess, I&#8217;d conjecture that the supposedly unique parts of the claim are concentrated in the areas I bolded above, and that the PTO doesn&#8217;t think the claim would be patentable unless most or all of them were included.</p>
<p>So should the claim have been approved even so? Let&#8217;s consider prior art. <a href="../../../../../2009/10/06/oracle-mapreduce/">Oracle has long been able to parallelize ala MapReduce</a>. I don&#8217;t see anything in the claim that isn&#8217;t preceded by what Oracle did, except maybe the emphasis on key/value pairs. (And the same statement applies to the other 15 claims in the patent, at least on a quick skim.) I forget the details of SenSage&#8217;s quasi-MapReduce, which also preceded the Google patent filing, but I imagine something similar would be true about it.</p>
<p>There is no doubt that Google popularized the ideas of MapReduce &#8212; which turns out to have been a worthy public service. In one great example of that popularization, <a href="http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf" onclick="javascript:pageTracker._trackPageview('/www.cs.stanford.edu');">the seminal paper on parallel data mining</a> is almost laughable in how it <a href="../../../../../2009/10/15/mapreduce-webinar-slides/">deviates from MapReduce key/value pair formalism</a> &#8212; but it still seems to have been inspired by Google&#8217;s MapReduce. But that&#8217;s a different matter; popularization != invention, even though there&#8217;s a certain connection between the two in patent law. Actually, Google also often does get credit for having &#8220;invented&#8221; MapReduce, including regrettably in the marketing materials of clients I can&#8217;t talk out of saying that and which now might be looking into the barrel of the Google patent (hello Aster); but again, saying something doesn&#8217;t make it enforceable in court.</p>
<p>So what it all boils down to is:</p>
<p><strong>Should Google&#8217;s patent on the idea of parallelizing the handling of sets of application-visible key/value pairs be regarded as valid?</strong></p>
<p>The United States PTO, which is paid to think about these things, has evidently decided Yes. I disagree. In simplest terms, my reason is that key/value pairs have been around for decades, and so:</p>
<p><strong>Anything which was known or obvious without special reference to key/value pairs doesn&#8217;t suddenly become non-obvious when key/value pairs are mixed in.</strong></p>
<p>If Google ever tries to enforce its MapReduce patent, I&#8217;m available as an expert witness for the other side.</p>
<p><strong><em>Related links</em></strong></p>
<ul>
<li><a href="http://gigaom.com/2010/01/19/why-hadoop-users-shouldnt-fear-googles-new-mapreduce-patent/" onclick="javascript:pageTracker._trackPageview('/gigaom.com');">GigaOm</a> and <a href="http://arstechnica.com/open-source/news/2010/01/googles-mapreduce-patent-what-does-it-mean-for-hadoop.ars" onclick="javascript:pageTracker._trackPageview('/arstechnica.com');">Ars Technica</a> on the Google MapReduce patent</li>
<li>Another <a href="http://www.dbms2.com/2010/01/15/vertica-sybase-ipatent-litigation/" >silly software patent</a> issue</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Clearing up MapReduce confusion, yet again</title>
		<link>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/</link>
		<comments>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/#comments</comments>
		<pubDate>Wed, 30 Dec 2009 10:50:53 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Splunk]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1371</guid>
		<description><![CDATA[I&#8217;m frustrated by a constant need &#8212; or at least urge   &#8212; to correct myths and errors about MapReduce. Let&#8217;s try one more time:

MapReduce was named and popularized &#8212; but not invented &#8212; by Google.
&#8220;MapReduce&#8221; variously refers to:

A programming paradigm
Execution engines that implement the programming paradigm
Distributed file systems that work with the execution [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m frustrated by a constant need &#8212; or at least urge <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  &#8212; to correct <a href="http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/" >myths and errors about MapReduce</a>. Let&#8217;s try one more time:<span id="more-1371"></span></p>
<ul>
<li>MapReduce was named and popularized &#8212; but not invented &#8212; by Google.</li>
<li>&#8220;MapReduce&#8221; variously refers to:
<ul>
<li>A programming paradigm</li>
<li>Execution engines that implement the programming paradigm</li>
<li>Distributed file systems that work with the execution engines</li>
</ul>
</li>
<li>In particular, Hadoop is a MapReduce execution engine that includes or is closely associated with HDFS (Hadoop Distributed File System).</li>
<li>MapReduce and analytic DBMS can interact in a number of different ways, including:
<ul>
<li>Tight integration between a DBMS and exposed MapReduce functionality, e.g. <a href="http://www.dbms2.com/2009/10/15/mapreduce-webinar-slides/" >Aster Data&#8217;s SQL/MapReduce</a> or Greenplum.</li>
<li>Integrated MapReduce &#8220;under the covers&#8221;, e.g. SenSage or <a href="http://www.dbms2.com/2009/10/06/oracle-mapreduce/" >Oracle</a>. This may or may not follow all the rules Google laid out for MapReduce, but it&#8217;s at least similar in spirit.</li>
<li>Looser coupling between DBMS and a MapReduce system, e.g. <a href="http://www.dbms2.com/2009/08/04/verticas-version-of-mapreduce-integration/" >Vertica/Hadoop</a>, in which MapReduce may or may not run on a different cluster than the DBMS.</li>
<li>Not at all, except perhaps insofar as a quasi-DBMS such as <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >Hive</a> is implemented over a MapReduce system such as Hadoop/HDFS.</li>
</ul>
</li>
<li>As predicted by <a href="http://www.strategicmessaging.com/monashs-first-law-of-commercial-semantics-explained/2009/01/09/" onclick="javascript:pageTracker._trackPageview('/www.strategicmessaging.com');">Monash&#8217;s First Law of Commercial Semantics</a>, different vendors have individual variants on those themes. For example, as per <a href="http://www.splunk.com/product" onclick="javascript:pageTracker._trackPageview('/www.splunk.com');">a registration-required white paper</a>, Splunk is moving to publicly expose a not-quite-complete form of MapReduce.</li>
<li>MapReduce implementations such as Hadoop are sometimes regarded as part of the <a href="http://www.dbms2.com/2009/12/12/legit-nosql-key-value-store/" >NoSQL</a> &#8220;movement&#8221;. When they are, many generalities about NoSQL &#8212; such as that it doesn&#8217;t deal with analytics &#8212; are falsified.</li>
<li>So far as I can tell, mainstream enterprise (as opposed to web, scientific, investment, etc.) data mining folks may be looking at MapReduce for data mining, but they haven&#8217;t done much to adopt it yet. Probably that&#8217;s because the outfits who have the greatest need are the same ones that have the largest sunk investments in more traditional ways of doing data mining.</li>
<li>Cloudera != Hadoop. On the other hand, if you want to use Hadoop, it makes a lot of sense to do business with Cloudera.</li>
<li>Non-DBMS MapReduce != Hadoop. On the other hand, Hadoop is the default choice for non-DBMS MapReduce.</li>
<li>MapReduce != Hadoop, period. DBMS-based MapReduce is also a legitimate technical strategy.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Three big myths about MapReduce</title>
		<link>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/</link>
		<comments>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 16:14:37 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Michael Stonebraker]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1135</guid>
		<description><![CDATA[Once again, I find myself writing and talking a lot about MapReduce.  But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:

MapReduce is something very new
MapReduce involves strict 	adherence to the Map-Reduce programming paradigm
MapReduce is a single technology

So let&#8217;s give it a try.
When Dave DeWitt and Mike [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Once again, I find myself writing and talking a lot about MapReduce.  But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:</p>
<ul>
<li>MapReduce is something very new</li>
<li>MapReduce involves strict 	adherence to the Map-Reduce programming paradigm</li>
<li>MapReduce is a single technology</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-1135"></span>So let&#8217;s give it a try.</p>
<p style="margin-bottom: 0in;">When Dave DeWitt and Mike Stone<span style="font-style: normal;">braker leveled <a href="../2008/01/18/the-great-mapreduce-debate/">their famous blast at MapReduce</a>, many people thought they overstated their case. But one part of their story – one that both Mike and Dave say was most central to their case – was never effectively refuted, n</span>amely the claim that these ideas aren&#8217;t particularly new. I haven&#8217;t actually read enough computer science literature to have an independent opi<span style="font-style: normal;">nion on that issue. But I&#8217;ll say this – claims from companies such as <a href="../2009/10/18/introduction-to-sensage/">SenSage</a>, <a href="../2009/10/06/oracle-mapreduce/">Oracle</a>, or <a href="../2009/10/18/technical-introduction-to-splunk/">Splunk</a> that “We&#8217;ve be</span>en doing MapReduce all along” seem pretty credible to me.</p>
<p style="margin-bottom: 0in;">True, what those companies were doing things may not have looked exactly like the instant-classic MapReduce programming paradigm. But the same is true of many things almost everybody would agree count as MapReduce.  In particular, it is often not the case that you alternate Map and Reduce steps, each of whose outputs is a set of simple &lt;Key, Value&gt; pairs, with data redistributed based on Key at every step.</p>
<p style="margin-bottom: 0in;">Here are some examples of what I mean, drawn from <a href="http://www.asterdata.com/blog/index.php/2009/10/15/mastering-mapreduce/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');">my recent MapReduce webinar</a>.</p>
<ul>
<li>If you do text indexing in 	MapReduce, your goal is to wind up with a text index. So at some 	point you Reduce to a pair &lt;WordName, {all the (DocumentID, 	offset) pairs for the whole corpus, suitably ordered}&gt;.  That&#8217;s a 	heckuva compound “Value”.</li>
<li>The goal of data mining is usually 	to estimate a rather small number of parameters based on a large 	overall data set, often – depending on algorithm – in the form 	of a single vector. When you do that in MapReduce. you partition 	data among nodes, calculate something on each node that is 	structured more or less like your final vector. So when it comes 	time for the reduce, you just ship all of your vectors – one per 	node – to a single Reduce node, and do the appropriate math. 	Redistribution based on Key would be quite pointless.</li>
<li>When you sessionize clickstream 	logs in MapReduce, you may have just as many output records as input 	records. However, they now are reformatted, and might have a 	SessionID appended. In those cases, Reduce isn&#8217;t doing much by the 	way of reduction.</li>
<li>And as I happens in some 	<a href="../2009/08/04/verticas-version-of-mapreduce-integration/">Vertica-Hadoop</a> use cases around mortgage trading, sometimes MapReduce can even make 	data s<span style="font-style: normal;">ets vastly larger.</span></li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">By no means do I think this is a weakness of the MapReduce programming paradigm. Rather, I think it&#8217;s a MapReduce strength. But it&#8217;s not quite the way MapReduce has been promoted and explained to the IT public.</p>
<p style="margin-bottom: 0in; font-style: normal;">Finally: MapReduce, as commonly conceived, spans two different – albeit closely related – technology domains:</p>
<ul>
<li>Parallel 	programming</li>
<li>Distributed 	data management</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">For example, I imagine Greenplum&#8217;s and Vertica&#8217;s MapReduce/SQL combined syntaxes are very similar to each others. But Vertica&#8217;s data management implementation of MapReduce, which relies on Hadoop, is very different from Greenplum&#8217;s, which is tied into the Greenplum DBMS. Similary, non-DBMS MapReduce implementations are commonly associated with distributed file systems – notably HDFS (Hadoop Distributed File Systems) or Google&#8217;s internal GFS (Google File System). In those systems, the parallel language execution part should be aware of how the distributed file management part works – but perhaps that awareness can be pretty lightweight.</p>
<p style="margin-bottom: 0in; font-style: normal;">Right now, this is a distinction pretty much without a difference. If you choose an implementation of MapReduce &#8212; like pure Hadoop (say in the Cloudera distribution) or Hadoop-Vertica or Aster Data&#8217;s SQL/MapReduce – you&#8217;re basically picking an entire technology stack. But those stacks are going to do a whole lot of changing and maturing in the near future – and as they do, it&#8217;s likely that projects will interact or even combine in all sorts of interesting ways.</p>
<p style="margin-bottom: 0in; font-style: normal;"><strong>Bottom line: There are a lot of different ways to exploit MapReduce-related technology.</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Google Fusion Tables</title>
		<link>http://www.dbms2.com/2009/06/15/google-fusion-tables/</link>
		<comments>http://www.dbms2.com/2009/06/15/google-fusion-tables/#comments</comments>
		<pubDate>Mon, 15 Jun 2009 11:10:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=813</guid>
		<description><![CDATA[Google has announced an experimental cloud-based data management system called Fusion Tables. A press article and Slashdot thread ensued, based on some bizarre-sounding analyst quotes that I will not attempt to parse.
What Fusion Tables really seems to be is a spreadsheet without the formulae. That is, it&#8217;s a place to dump data in a grid [...]]]></description>
			<content:encoded><![CDATA[<p>Google has announced an <a href="http://googleresearch.blogspot.com/2009/06/google-fusion-tables.html" onclick="javascript:pageTracker._trackPageview('/googleresearch.blogspot.com');">experimental cloud-based data management system called Fusion Tables</a>. A <a href="http://www.itworld.com/saas/69183/watch-out-oracle-google-tests-cloud-based-database" onclick="javascript:pageTracker._trackPageview('/www.itworld.com');">press article</a> and <a href="http://developers.slashdot.org/article.pl?sid=09/06/12/1658206" onclick="javascript:pageTracker._trackPageview('/developers.slashdot.org');">Slashdot thread</a> ensued, based on some bizarre-sounding analyst quotes that I will not attempt to parse.</p>
<p>What Fusion Tables really seems to be is a spreadsheet without the formulae. That is, it&#8217;s a place to dump data in a grid of cells, comment on it, version it, and do elementary data manipulation.  This could, I guess, be useful as an alternative to traditional RDBMS &#8212; assuming, of course, that you want to have a row-by-row debate about 100 megs of data.</p>
<p>Seriously, while Google Fusion Tables bears some vague resemblance to what I&#8217;m thinking about for the future of both <a href="http://www.dbms2.com/2009/05/30/reinventing-business-intelligence/" >business intelligence</a> and <a href="http://www.dbms2.com/2009/06/08/the-future-of-data-marts/" >data marts</a>, it sounds as if it has a long way to go before it&#8217;s something most enterprises should spend time looking at.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/06/15/google-fusion-tables/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Reinventing business intelligence</title>
		<link>http://www.dbms2.com/2009/05/30/reinventing-business-intelligence/</link>
		<comments>http://www.dbms2.com/2009/05/30/reinventing-business-intelligence/#comments</comments>
		<pubDate>Sat, 30 May 2009 12:38:25 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[SAP AG]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=794</guid>
		<description><![CDATA[I&#8217;ve felt for quite a while that business intelligence tools are due for a revolution. But I&#8217;ve found the subject daunting to write about because &#8212; well, because it&#8217;s so multifaceted and big.  So to break that logjam, here are some thoughts on the reinvention of business intelligence technology, with no pretense of being [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I&#8217;ve felt for quite a while that business intelligence tools are due for a revolution. But I&#8217;ve found the subject daunting to write about because &#8212; well, because it&#8217;s so multifaceted and big.  So to break that logjam, here are some thoughts on the reinvention of business intelligence technology, with no pretense of being in any way comprehensive.</p>
<p style="margin-bottom: 0in;"><strong>Natural language and classic science fiction</strong></p>
<p style="margin-bottom: 0in;">Actually, there&#8217;s a pretty well-known example of BI near-perfection &#8212; <strong>the </strong><em><strong>Star Trek</strong></em><strong> computers,</strong> usually voiced by the late Majel Barrett Roddenberry. They didn&#8217;t have a big role in the recent movie, which was so fast-paced nobody had time to analyze very much, but were a big part of the <em>Star Trek</em> universe overall. <em>Star Trek&#8217;s</em> computers integrated analytics, operations, and authentication, all with a great natural language/voice interface and visual displays. That example is at the heart of <a href="http://www.texttechnologies.com/2009/05/30/men-are-from-earth-computers-are-from-vulcan/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">a 1998 article on natural language recognition I just re-posted</a>.</p>
<p style="margin-bottom: 0in;">As for reality: For decades, dating back at least to Artificial Intelligence Corporation&#8217;s Intellect, there have been offerings that provided<strong> &#8220;natural language&#8221; command, control, and query</strong> against otherwise fairly ordinary analytic tools. Such efforts have generally fizzled, for reasons outlined at the link above. Wolfram Alpha is the latest try; fortunately for its prospects, natural language is really only a small part of the Wolfram Alpha story.</p>
<p style="margin-bottom: 0in;">A second theme has more recently emerged &#8212; <strong>using text indexing to get at data more flexibly than a relational schema would normally allow,</strong> either by searching on data values themselves (stressed by <em>Attivio</em>) or more by searching on the definitions of pre-built reports (the Google OneBox story). SAP&#8217;s Explorer is the latest such view, but I find <a href="http://www.intelligententerprise.com/blog/archives/2009/05/explorer_seems.html#comments" onclick="javascript:pageTracker._trackPageview('/www.intelligententerprise.com');">Doug Henschen&#8217;s skepticism about SAP Explorer</a> more persuasive than <a href="http://www.intelligententerprise.com/blog/archives/2009/05/explorer_splash.html#comments" onclick="javascript:pageTracker._trackPageview('/www.intelligententerprise.com');">Cindi Howson&#8217;s cautiously favorable view</a>.  Partly that&#8217;s because I know SAP (and Business Objects); partly it&#8217;s because of difficulties such as those I already noted.</p>
<p style="margin-bottom: 0in;"><strong>Flexibility and data exploration</strong></p>
<p style="margin-bottom: 0in;">It&#8217;s a truism that each generation of dashboard-like technology fails because it&#8217;s too inflexible. Users are shown the information that will provide them with the most insight.  They appreciate it at first. But eventually it&#8217;s old hat, and when they want to do something new, the baked-in data model doesn&#8217;t support it.</p>
<p style="margin-bottom: 0in;">The latest attempts to overcome this problem lie in two overlapping trends &#8212; <strong>cool data exploration/visualization tools, </strong><span>and </span><strong>in-memory analytics.</strong> <span id="more-794"></span><span style="font-style: normal;">Tableau and Spotfire</span> are known more for the former; hot BI ven<span style="font-style: normal;">dor <a href="../2008/08/04/qliktech-qlikview-update/">QlikTech</a> is know</span>n for both. And many vendors &#8212; established or otherwise &#8212; are goi<span style="font-style: normal;">ng to <a href="../2009/04/22/clearing-some-of-my-buffer/">in-memory OLAP</a>.</span></p>
<p style="margin-bottom: 0in;"><strong><span style="font-style: normal;">Collaboration and communication</span></strong></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">The reason I&#8217;m finally buckling down and posting on this subject is the announcement of <a href="http://www.texttechnologies.com/2009/05/29/google-wave-finally-a-microsoft-killer/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">Google Wave, which I think foreshadows a revolution in communication and collaboration technology</a>. Google Wave augurs two primary advances. First, it shows how to make email, instant messaging, microblogging, and so on much more useful. Second, Google Wave could evolve in a way that &#8212; finally &#8212;  makes it truly practical for end-users to set up ad-hoc mini-portals that combine arbitrary URL-possessing resources, exposed to arbitrary workgroups of people.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">If and when both of those promises are fulfilled, it will become vastly easier for people to reason together about analytic questions.  That may take a little while, as Google Wave obviously wasn&#8217;t designed with business intelligence in mind. But whether from Google or from a frightened Microsoft redoubling its SharePoint efforts, there&#8217;s hope that we&#8217;ll see a leap forward in general collaboration technology. And since BI vendors are doing a generally decent job of exposing queries, charts and so on as portlets, it seems likely that business intelligence will benefit from the collaboration arms race.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">That&#8217;s important. The first time I heard that reporting was as important for communication as for analytics was from Pilot Software a quarter-century or so ago, and it&#8217;s just as true now as it was then.  In its first incarnations it probably will be a little too dumb for my tastes, focusing more on mindless reporting and same-old KPIs than on deeper analysis.  Still, it&#8217;s a move in a good direction.</span></p>
<p style="margin-bottom: 0in;"><strong><span style="font-style: normal;">Other directions</span></strong></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">As I said at the beginning, I find it too daunting to try to cover all facets of this subject in one post. So I&#8217;ll leave out, at a minimum:</span></p>
<ul>
<li><span style="font-style: normal;"><a href="http://www.dbms2.com/2009/02/25/even-more-final-version-of-my-tdwi-slide-deck/" >Data 	warehousing performance and TCO</a>, which I of course write about 	extensively</span></li>
<li><span style="font-style: normal;"><a href="http://www.dbms2.com/2009/05/21/notes-on-cep-performance/" >Complex 	event/stream processing</a>, which I&#8217;ve written quite a bit about too</span></li>
<li><span style="font-style: normal;">Data 	mining and predictive analytics</span></li>
<li><span style="font-style: normal;">Operational 	BI</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">plus some hobby horses you probably don&#8217;t want to hear about anyway until I work out a better way of articulating my opinions.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">But by all means please comment on what I&#8217;ve left out just as vigorously as on what I&#8217;ve included.  This post is just the first of many to come.</span></p>
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/05/30/reinventing-business-intelligence/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
	</channel>
</rss>
