<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; MapReduce</title>
	<atom:link href="http://www.dbms2.com/category/parallelization/mapreduce/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 02 Sep 2010 09:06:44 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>The substance of Pentaho&#8217;s Hadoop strategy</title>
		<link>http://www.dbms2.com/2010/08/21/the-substance-of-pentahos-hadoop-strategy/</link>
		<comments>http://www.dbms2.com/2010/08/21/the-substance-of-pentahos-hadoop-strategy/#comments</comments>
		<pubDate>Sat, 21 Aug 2010 06:40:29 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Pentaho]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2848</guid>
		<description><![CDATA[Pentaho has been talking about a Hadoop-related strategy. Unfortunately, in support of its Hadoop efforts, Pentaho has been &#8212; quite insistently &#8212; saying things that don&#8217;t make a lot of sense to people who know anything about Hadoop.
That said, I think I found four sensible points in Pentaho&#8217;s Hadoop strategy, namely:

If you use an ETL [...]]]></description>
			<content:encoded><![CDATA[<p>Pentaho has been talking about a Hadoop-related strategy. Unfortunately, in support of its Hadoop efforts, Pentaho has been &#8212; quite insistently &#8212; saying things that don&#8217;t make a lot of sense to people who know anything about Hadoop.</p>
<p>That said, I think I found four sensible points in Pentaho&#8217;s Hadoop strategy, namely:</p>
<ol>
<li>If you use an ETL tool like Pentaho&#8217;s to move things in and out of HDFS, you may be able to orchestrate two more steps in the ETL process than if you used Hadoop&#8217;s native orchestration tools.</li>
<li>A lot of what you want to do in MapReduce is things that can be graphically specified in an ETL tool like Pentaho&#8217;s. (That would include tokenization or regex.)</li>
<li>If you have some really lightweight BI requirements (ad hoc, reporting, or whatever) against HDFS data, you might be content to do it straight against HDFS, rather than moving the data into a real DBMS. If so, BI tools like Pentaho&#8217;s might be useful.</li>
<li>Somebody might want to use a screwy version of MapReduce, where by &#8220;screwy&#8221; I mean anything that isn&#8217;t <a href="http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/" >Cloudera Enterprise</a>, <a href="http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/" >Aster Data SQL/MapReduce</a>, or some other implementation/distribution with a lot of supporting tools. In that case, they might need all the tools they can get.</li>
</ol>
<p>The first of those points is, in the grand scheme of things, pretty trivial.</p>
<p>The third one makes sense. While Hadoop&#8217;s Hive client means you could roll your own integration with your own favorite BI tool in any case, having somebody certify it for you themselves could be nice. So if Pentaho ships something that works before other vendors do, good on them. (Target date seems to be October.)</p>
<p>The fourth one is kind of sad.</p>
<p>But if there&#8217;s any shovel-meet-pony aspect to all this &#8212; or indeed a reason for writing this blog post &#8212; it would be the second point. If one understands data management, but is in the &#8220;Oh no! Hadoop wants me to PROGRAM!&#8221; crowd, then being able to specify one&#8217;s MapReduce might be a really nice alternative versus having to actually code it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/21/the-substance-of-pentahos-hadoop-strategy/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Big Data is Watching You!</title>
		<link>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/</link>
		<comments>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/#comments</comments>
		<pubDate>Wed, 11 Aug 2010 05:30:22 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2760</guid>
		<description><![CDATA[There&#8217;s a boom in large-scale analytics. The subjects of this analysis may be categorized as:

People
Financial trades
Electronic networks
Everything else

The most varied, interesting, and valuable of those four categories is the first one.

That may change some day, with the growing importance of machine-generated data, and of big-data science in particular. But I think it&#8217;s a fair assessment [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">There&#8217;s a boom in large-scale analytics. The subjects of this analysis may be categorized as:</p>
<ul>
<li>People</li>
<li>Financial trades</li>
<li>Electronic networks</li>
<li>Everything else</li>
</ul>
<p style="margin-bottom: 0in;">The most varied, interesting, and valuable of those four categories is the first one.</p>
<p><span id="more-2760"></span></p>
<p style="margin-bottom: 0in;"><em>That may change some day, with the growing importance of<a href="http://www.dbms2.com/2010/04/08/machine-generated-data-example/" > </a><a href="http://www.dbms2.com/2010/04/08/machine-generated-data-example/" >machine-generated data</a>,</em><em> and of <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/" >big-data science</a> </em><em>in particular. But I think it&#8217;s a fair assessment at the present, and for at least the next few years.</em></p>
<p style="margin-bottom: 0in;">Some of th<span style="font-weight: normal;">e most interesting use cases are concentrated in the areas of identifying individuals, groups of people, or behaviors of (groups of) people. For example:</span></p>
<ul>
<li>comScore works hard to <strong>identify 	individual web surfers </strong><span style="font-weight: normal;">– 	i.e. to </span><strong>deanonymize</strong><span style="font-weight: normal;"> them &#8212; even</span> though they may have given incomplete or false 	personal information.</li>
<li>Other companies at least try to 	figure out <strong>which information in a user&#8217;s profile is unreliable,</strong> so as to classify them better. (Yes, there are 62-year-old 	video-game-obsessed Lady Gaga fans, but that&#8217;s generally not the way 	to bet.)</li>
<li>Multiple telecom vendors try to 	identify who their <strong>most influential customers</strong> are (to a first 	approximation, they&#8217;re the ones most often called by the most 	people, but it surely gets more sophisticated than that). This 	information is then used to reduce churn, either by working hard to 	retain those users, or – if they do churn – to move very fast to 	retain the business from their friends.</li>
<li>Other kinds of companies do 	similar kinds of analysis, to the extent that they have enough of a 	social graph to do so. (This application is a case where the term 	“<a href="http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/" >social graph</a>” is not a misnomer.)</li>
<li><strong>Turing detectives</strong> (I just 	coined that phrase) try to determine whether users are humans or 	bots.</li>
<li>Central to detecting <strong>insurance 	fraud</strong> is identifying suspiciously close connections between 	claimants, service providers, and so on.</li>
<li>Identifying groups of people is 	also important in flagging <strong>insider trading.</strong><span style="font-weight: normal;"> Even more important are other kinds of analysis, along the lines of 	“is this normal innocent trading behavior?” </span></li>
<li><span style="font-weight: normal;">Intelligence 	agencies try to detect networks of </span><strong>terrorists</strong><span style="font-weight: normal;"> and their sympathizers. They further try to identify unusual 	patterns of communication or meetings along those networks that 	might indicate terrorist acts are being planned. (Civilian law 	enforcement agencies can use similar techniques.)</span></li>
</ul>
<p style="margin-bottom: 0in; font-weight: normal;">In most cases, the analysis and/or run-time execution of the relevant models is done with the help of analytic DBMS. Other technologies that come into play include non-DBMS MapReduce (Hadoop), graph engines, and CEP (Complex Event Processing). The vendor most heavily represented on that list is probably Aster Data, because:</p>
<ul>
<li>Aster Data is 	focused on hard-core analytics.</li>
<li>I talk a lot 	with Aster Data, and in particular had a long, detailed use-cases 	discussion with them last week.</li>
<li><span style="font-weight: normal;">The 	comScore example happens to come from a speaker at </span><a href="http://www.dbms2.com/2010/05/07/implications-onew-analytic-technology/" ><span style="font-weight: normal;">an 	Aster event</span></a><span style="font-weight: normal;"> I also 	participated in.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">And by the way, all this only scratches the surface of what will be possible down the road. It&#8217;s based mainly on where you live, what you purchase, how you behave on websites, and who you communicate with. </span><span style="color: #000080;"><span lang="zxx"><span style="text-decoration: underline;"><a href="../2010/07/04/fair-data-use/"><span style="font-weight: normal;">Other kinds of data, which could be used to be yet more intrusive</span></a></span></span></span><span style="font-weight: normal;">, generally aren&#8217;t involved.</span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">I actually have two points in drawing up this list. One is golly-gee-whiz about how a lot of analytically sophisticated applications are actually getting into production. The other is to highlight the privacy and liberty threats If This Goes On Unchecked (which is why I didn&#8217;t include some other less-people-focused examples). There&#8217;s also a related danger that, to the extent we don&#8217;t get some smart regulations to keep us safe(r), we&#8217;ll get a bunch of stupid regulations instead. </span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">The Analytic Era has only just begun.<br />
</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Some interesting links</title>
		<link>http://www.dbms2.com/2010/07/23/some-interesting-links/</link>
		<comments>http://www.dbms2.com/2010/07/23/some-interesting-links/#comments</comments>
		<pubDate>Fri, 23 Jul 2010 09:04:48 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[EnterpriseDB and Postgres Plus]]></category>
		<category><![CDATA[Fun stuff]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Humor]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[SAP AG]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2626</guid>
		<description><![CDATA[In no particular order:  

Neil Raden points out that business intelligence dashboards can be dangerously misleading. His reasoning (sound) is that whatever you measure is apt to be distorted by the fact people know they&#8217;re being measured. His solution (implied) is to hire a good-looking consultant like himself to do it right.
I&#8217;ve had my issues [...]]]></description>
			<content:encoded><![CDATA[<p>In no particular order:  <span id="more-2626"></span></p>
<ul>
<li>Neil Raden points out that <a href="http://www.b-eye-network.com/channels/5083/view/9618/" onclick="javascript:pageTracker._trackPageview('/www.b-eye-network.com');">business intelligence dashboards can be dangerously misleading</a>. His reasoning (sound) is that whatever you measure is apt to be distorted by the fact people know they&#8217;re being measured. His solution (implied) is to hire a <a href="http://twitter.com/NeilRaden/status/19110492482" onclick="javascript:pageTracker._trackPageview('/twitter.com');">good-looking</a> consultant like himself to do it right.</li>
<li>I&#8217;ve had my issues with Fred Holahan, who was VP of Marketing when I posted that <a href="http://www.dbms2.com/2009/04/20/first-thoughts-on-oracle-acquiring-sun/" >EnterpriseDB was not to be trusted</a>. (That said, Fred is long gone from EnterpriseDB and my opinion hasn&#8217;t changed.) But he&#8217;s put up a good series of posts on the basis of the open source &#8220;progressive engagement&#8221; marketing funnel, including this gem on <a href="http://opensourceadvisory.com/wordpress/?p=860" onclick="javascript:pageTracker._trackPageview('/opensourceadvisory.com');">why you shouldn&#8217;t count on monetizing your community/free users</a>.</li>
<li><a href="http://tech.fortune.cnn.com/2010/07/22/oracle-plans-to-double-acquisition-budget/" onclick="javascript:pageTracker._trackPageview('/tech.fortune.cnn.com');">Oracle plans to increase its acquisition budget</a>. The figure given is $70 billion over the next 5 years. <em>Edit: But see this funny <a href="http://www.theregister.co.uk/2010/07/23/oracle_acquisition_budget/" onclick="javascript:pageTracker._trackPageview('/www.theregister.co.uk');">Register</a> followup.</em></li>
<li>Clayton Christensen wrote a phenomenal article on <a href="http://hbr.org/2010/07/how-will-you-measure-your-life/ar/1" onclick="javascript:pageTracker._trackPageview('/hbr.org');">how to live a good life</a>, from a very business-y perspective. (Only in one anecdote was it too religiously-oriented for my tastes.) Takeaways include:
<ul>
<li>Your core goals probably revolve around something other than business success. (E.g., family.) Don&#8217;t lose sight of that.</li>
<li>To the extent you&#8217;re a manager or leader, you may have a huge impact on other people&#8217;s lives. Use that power in admirable ways.</li>
<li>Teach people how to fish for answers, rather than just giving them answers. They&#8217;ll probably come to better conclusions than you would have anyway. (This is a core principle in my own consulting.)</li>
<li>Take time to reflect. And by the way, the same techniques you use for strategic analysis in business can be applied to your life as well.</li>
</ul>
</li>
<li><a href="http://www.bothsidesofthetable.com/2010/07/19/life-is-10-how-you-make-it-and-90-how-you-take-it/" onclick="javascript:pageTracker._trackPageview('/www.bothsidesofthetable.com');">Mark Suster</a> has a pretty good post expanding on my first Christensen takeaway, highlighting a point too often missing from articles in that genre: It&#8217;s not just family; it&#8217;s also all the cool things around us.</li>
<li>I haven&#8217;t gone through the <a href="http://developer.yahoo.com/events/hadoopsummit2010/agenda.html" onclick="javascript:pageTracker._trackPageview('/developer.yahoo.com');">Hadoop Summit archives</a> yet, but it looks as if there&#8217;s a lot of insight there about current Hadoop application activity.</li>
<li>If you&#8217;re a cat lover and don&#8217;t hate simple/traditional music, check out <a href="http://www.marcgunn.com/poetry/labels/cat_songs.shtml" onclick="javascript:pageTracker._trackPageview('/www.marcgunn.com');">Marc Gunn&#8217;s cat filksongs</a>, especially the infectious &#8220;What Shall We Do With a Catnipped Kitty?&#8221; and &#8220;Lord of the Pounce&#8221;, both playable from the right sidebar of that page (#7 and #10 respectively). Gunn is also a chief perpetrator of the justly (in)famous <a href="http://www.thebards.net/" onclick="javascript:pageTracker._trackPageview('/www.thebards.net');">Do Virgins Taste Better?</a> cycle of filksongs.</li>
<li>Former SAP exec Dennis Moore offers a theory as to <a href="http://dbmoore.blogspot.com/2010/05/why-is-in-memory-database-important-to.html" onclick="javascript:pageTracker._trackPageview('/dbmoore.blogspot.com');">why SAP cares so much about in-memory DBMS</a>. It&#8217;s to integrate business processes, because SAP has no other software layer good at doing same. Interestingly, Dennis originated SAP&#8217;s previous attempt at meeting a similar need via its composite applications initiative. However, in Dennis&#8217; view this benefit would only be achieved by a major rewrite of SAP&#8217;s applications.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/23/some-interesting-links/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Cloudera Enterprise and Hadoop evolution</title>
		<link>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/</link>
		<comments>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 17:22:27 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2440</guid>
		<description><![CDATA[I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I&#8217;d say:  

If you are or want to be a serious 	MapReduce user – and you&#8217;re past the “play around over the 	weekend” stage &#8212; you probably should have either:

A serious non-DBMS MapReduce 	distribution.
MapReduce integrated into your [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I&#8217;d say:  <span id="more-2440"></span></p>
<ul>
<li>If you are or want to be a serious 	MapReduce user – and you&#8217;re past the “play around over the 	weekend” stage &#8212; you probably should have either:
<ul>
<li>A serious non-DBMS MapReduce 	distribution.</li>
<li>MapReduce integrated into your 	analytic DBMS.</li>
<li>Both.</li>
</ul>
</li>
<li>The obvious choice for non-DBMS 	MapReduce is Hadoop.</li>
<li>The obvious choice for a Hadoop 	distribution is <strong>Cloudera Enterprise.</strong></li>
<li>Cloudera Enterprise has three main 	aspects, in an inseparable bundle:
<ul>
<li>Distributions for a double-digit 	number of open source projects. It&#8217;s nice having all that in one 	package – unless, of course, you like playing with Tinkertoys.</li>
<li>Proprietary Cloudera code.</li>
<li>Cloudera support.</li>
</ul>
</li>
<li>Cloudera says its proprietary code 	is and in the future is planned to be concentrated – at least in 	large part &#8212; on integrating open source technology with closed 	source products. This has the virtue of being targeted directly at 	that segment of the market which has proven it&#8217;s actually willing to 	pay money for software.</li>
<li>Cloudera Enterprise areas of 	focus, now and in the presumed future, include:
<ul>
<li><strong>Core Hadoop engine,</strong> which 	Cloudera says is quite predictably and appropriately evolving more 	slowly than the tools around it.</li>
</ul>
<ul>
<li><strong>Development, management and 	administrative tools,</strong> including:
<ul>
<li><strong>Pig</strong> and <strong>Hive</strong>. Cloudera says &gt;70% 	of Facebook Hadoop jobs are initiated through Hive, and the same is 	true of Yahoo and Pig.</li>
<li>Connectivity to commercial tools.</li>
<li>The product formerly known as 	“Cloudera Desktop.”</li>
</ul>
</li>
<li><strong>Workflow</strong>, which in this context 	refers to letting you create a Hadoop application as a sequence of 	small steps, rather than forcing you to kluge it into being one 	unwieldy thing. At the moment, this is much less widely adopted than 	Pig and Hive, but Cloudera has high hopes for it, because of its 	obvious benefits in modularity and manageability.</li>
<li><strong>Quasi-DBMS technology.</strong> Besides Hive and Pig, this includes <strong>HBase.</strong> Cloudera says there has 	been considerable demand for HBase, and it is pleased that project 	is now mature enough to ship. Cloudera stresses that it intends 	HBase not for OLTP, but as an adjunct to analytic processing. E.g., 	Cloudera suggests HBase would be a fine vehicle for replicating 	dimension tables across each node of a cluster.</li>
<li><strong>Data connectivity, </strong><span style="font-weight: normal;">e.g. 	to MySQL or to sensor log files.</span></li>
</ul>
</li>
<li>Cloudera Enterprise pricing is 	well below DBMS prices – not by a full order of magnitude, if I&#8217;m 	right about everybody&#8217;s quantity discount policies, but even so by a 	lot. Details are NDA.</li>
</ul>
<p style="margin-bottom: 0in;">Cloudera sometimes sends confusing signals about its beliefs and strategies. For example, one can get different stories depending on whether one talks to:</p>
<ul>
<li>Somebody at Cloudera who comes 	primarily from the user and open source communities.</li>
<li>Somebody at Cloudera who has 	actually worked at a software company before.</li>
</ul>
<p style="margin-bottom: 0in;">But I predict that Cloudera will now stick for a while with more or less the strategy outlined above.</p>
<p style="margin-bottom: 0in;">Naturally, we also talked about Hadoop adoption. Highlights of that part – no doubt somewhat biased towards Cloudera&#8217;s own customer base &#8212; included:</p>
<ul>
<li>Notwithstanding <a href="http://www.dbms2.com/2009/04/14/ebay-thinks-mpp-dbms-clobber-mapreduce/" >eBay&#8217;s prior 	skepticism about MapReduce</a>, it is quoted saying nice things in a Cloudera press release, 	and has apparently become quite a large Hadoop user, starting out 	with a search-quality use case.</li>
<li>Typical Hadoop deployment sizes 	are 10 nodes or so when experimenting, 80-500+ in production.</li>
<li>10 terabytes/node – I&#8217;m pretty 	sure Cloudera meant of user data &#8212; is not inconceivable, so a 	cost-conscious 500-node user could have 5 petabytes of data managed 	by Hadoop.</li>
<li>Cloudera has half a dozen 	customers at the 75+ node production level.</li>
<li>Web and financial services are the 	two vertical markets moving most aggressively into Hadoop 	production. The government is also in significant Hadoop production, 	but the details of that are classified.</li>
<li>Web uses for Hadoop include:
<ul>
<li>Clickstream – sessionization, 	etc. – that&#8217;s a super-mainstream use.</li>
<li>Search – analyzing search 	attempts in conjunction with structured data.</li>
<li>Machine learning (for ad serving, 	etc.).</li>
</ul>
</li>
<li>Financial services uses for Hadoop 	include:
<ul>
<li>Internal trading rule 	enforcement/fraud detection.</li>
<li>Complex ETL.</li>
<li>Portfolio risk assessment 	(typically overnight).</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">None of this is inconsistent with previous surveys of <a href="http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/" >Hadoop use cases</a>.</p>
<p style="margin-bottom: 0in; font-style: normal;">Various users talked at the Hadoop Summit this week. I wasn&#8217;t there, and won&#8217;t write about their stories for now. That said, <a href="http://www.slideshare.net/kevinweil/hadoop-at-twitter-hadoop-summit-2010" onclick="javascript:pageTracker._trackPageview('/www.slideshare.net');">Twitter&#8217;s slide deck</a> from same has some interesting stuff, including:</p>
<ul>
<li><span style="font-style: normal;">7 	TB/day ETLed from MySQL.</span></li>
<li><span style="font-style: normal;">Petabytes-being-stored 	accordingly coming soon.</span></li>
<li><span style="font-style: normal;">Open 	sourcing their ETL tool Crane.</span></li>
<li><span style="font-style: normal;">3-4X 	LZO compression at little CPU cost.</span></li>
<li><span style="font-style: normal;">HBase 	is a more usable for them than HDFS, which isn&#8217;t mutable enough.</span></li>
<li><span style="font-style: normal;">Pig 	= 5% of code and coding effort vs. vanilla Hadoop at 30% or less 	performance hit.</span></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Clarifying the state of MPP in-database SAS</title>
		<link>http://www.dbms2.com/2010/05/07/in-database-sas-teradata-netezza-aster/</link>
		<comments>http://www.dbms2.com/2010/05/07/in-database-sas-teradata-netezza-aster/#comments</comments>
		<pubDate>Fri, 07 May 2010 06:23:49 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2061</guid>
		<description><![CDATA[I routinely am briefed way in advance of products&#8217; introductions. For that reason and others, it can be hard for me to keep straight what&#8217;s been officially announced, introduced for test, introduced for general availability, vaguely planned for the indefinite future, and so on. Perhaps nothing has confused me more in that regard than the [...]]]></description>
			<content:encoded><![CDATA[<p>I routinely am briefed way in advance of products&#8217; introductions. For that reason and others, it can be hard for me to keep straight what&#8217;s been officially announced, introduced for test, introduced for general availability, vaguely planned for the indefinite future, and so on. Perhaps nothing has confused me more in that regard than the SAS Institute&#8217;s multi-year effort to get SAS integrated into various MPP DBMS, specifically <a href="http://www.dbms2.com/2009/08/02/teradata-13-focuses-on-advanced-analytic-performance/" >Teradata</a>, <a href="http://www.dbms2.com/2010/02/22/netezza-twinfin/" >Netezza Twinfin(i)</a>, and <a href="http://www.dbms2.com/2010/02/22/aster-data-ncluster-4-5/" >Aster Data nCluster</a>.</p>
<p>However, I chatted briefly Thursday with Michelle Wilkie, who is the SAS product manager overseeing all this (and also some other stuff, like SAS running on grids without being integrated into a DBMS). As best I understood, the story is:<span id="more-2061"></span></p>
<ul>
<li>On <strong>Teradata,</strong> SAS is shipping in-database scoring today. SAS also is shipping a limited amount of in-database modeling on Teradata, the count recently having gone up from 4 &#8220;procs&#8221; to 10.</li>
<li>On <strong>Netezza Twinfin(i),</strong> SAS is shipping in-database scoring, and this was recently announced. I can&#8217;t actually find much evidence of this announcement by searching the Web or the SAS website, but Michelle was pretty clear on the point even so.  Further confusing matters, <a href="http://www.sas.com/technologies/analytics/datamining/scoring_acceleration/" onclick="javascript:pageTracker._trackPageview('/www.sas.com');">SAS&#8217; website</a> seems to say in-database scoring is supported on Netezza&#8217;s old generation of products but not its latest one, even though SAS CTO Keith Collins told me <a href="http://www.dbms2.com/2009/09/03/sas-on-netezza-and-other-netezza-extensibility/" >exactly the opposite</a> would be true.</li>
<li>On <strong>Aster Data nCluster,</strong> SAS will ship in-database scoring by the end of 2010. If I understood correctly, this will be for &#8220;limited&#8221; rather than &#8220;general&#8221; availability, but Michelle framed that as a distinction without a difference. I.e., if you want to buy in-database SAS scoring on Aster nCluster, you&#8217;ll be able to.</li>
<li>(More) in-database SAS modeling is expected on all of Teradata, Netezza Twinfin(i), and Aster Data nCluster in the vague future. (The concept of 2011/2012 came into play.)</li>
<li>SAS/Teradata integration, developed first, involved more hand-coding. SAS has subsequently developed some kind of a more general parallelism/in-database capability, akin to what it has in the DBMS-less grid, that either is or isn&#8217;t a good match for DBMS vendors&#8217; native way of supporting parallel processing. (Obviously, I&#8217;m still pretty unclear on this part.)</li>
<li>SAS technology is a good fit for Aster Data&#8217;s MapReduce-centric way of doing parallelism.</li>
</ul>
<p>I also took the opportunity to ask Michelle a question I&#8217;ve had a heck of a time getting answered: <strong>What&#8217;s the big-deal about in-database data mining scoring anyway?</strong> After all, the most common form of in-database data mining scoring is just to take a weighted sum of specific fields in a row, where the weights are the regression coefficients. You can do that in generic SQL, with performance that superficially should be at least as good as that for any alternative strategy. Michelle&#8217;s answers seemed to be twofold:</p>
<ul>
<li><strong>There are other kinds of scoring too</strong> &#8212; neural networks, etc.</li>
<li><strong>Coding the scoring in SQL isn&#8217;t that easy. </strong>Michelle gave the example of a specific user (default Netezza reference account, with initials resembling mine) that spent 400 hours writing and testing something you now get for free with SAS/Netezza integration.</li>
</ul>
<p><em>Edit: In response to this post, SAS wrote in with <a href="http://www.dbms2.com/2010/05/15/further-clarifying-in-database-mpp-sas/" >further clarification about </a></em><em><a href="http://www.dbms2.com/2010/05/15/further-clarifying-in-database-mpp-sas/" >in-database and/or MPP SAS</a>.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/07/in-database-sas-teradata-netezza-aster/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Aster Data&#8217;s mapreduce.org site</title>
		<link>http://www.dbms2.com/2010/04/18/aster-mapreduce-or/</link>
		<comments>http://www.dbms2.com/2010/04/18/aster-mapreduce-or/#comments</comments>
		<pubDate>Sun, 18 Apr 2010 20:56:10 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1918</guid>
		<description><![CDATA[Aster Data has started a site mapreduce.org, which purports to compile &#8220;the best information about MapReduce.&#8221; At the moment, mapreduce.org highlights include:

A feed of MapReduce-related posts from several blogs, including this one.
A calendar of MapReduce-related events, not necessarily Aster-specific, integrated with a feed combining &#8230;

&#8230; Aster MapReduce-related press releases and also &#8230;
&#8230; not necessarily Aster-specific [...]]]></description>
			<content:encoded><![CDATA[<p>Aster Data has started a site <a href="http://www.mapreduce.org/" onclick="javascript:pageTracker._trackPageview('/www.mapreduce.org');">mapreduce.org</a>, which purports to compile &#8220;the best information about MapReduce.&#8221; At the moment, mapreduce.org highlights include:</p>
<ul>
<li>A feed of MapReduce-related posts from several blogs, including this one.</li>
<li>A calendar of MapReduce-related events, not necessarily Aster-specific, integrated with a feed combining &#8230;
<ul>
<li>&#8230; Aster MapReduce-related press releases and also &#8230;</li>
<li>&#8230; not necessarily Aster-specific MapReduce-related press articles.</li>
</ul>
</li>
<li>Links to a lot of Aster Data MapReduce-related collateral. Some of that stuff is quite good.*</li>
<li>A sycophantic introduction from Colin White praising the value of the mapreduce.org &#8220;independent forum.&#8221;</li>
</ul>
<p><em>*I did a couple of <a href="http://www.dbms2.com/2009/10/15/mapreduce-webinar-slides/" >MapReduce-related webinars</a> for Aster late last year. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  But seriously &#8212; Aster does a good job of writing clear and informative collateral.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/18/aster-mapreduce-or/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Introduction to Datameer</title>
		<link>http://www.dbms2.com/2010/04/16/introduction-to-datameer/</link>
		<comments>http://www.dbms2.com/2010/04/16/introduction-to-datameer/#comments</comments>
		<pubDate>Sat, 17 Apr 2010 03:50:43 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Datameer]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1909</guid>
		<description><![CDATA[Elder care issues have flared up with a vengeance, so I&#8217;m not going to be blogging much for a while, and surely not at any length. That said, my first post about Datameer was never going to be very long, so lets get right to it:

Datameer offers a business 	intelligence and analytics stack that runs [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Elder care issues have flared up with a vengeance, so I&#8217;m not going to be blogging much for a while, and surely not at any length. That said, my first post about Datameer was never going to be very long, so lets get right to it:</p>
<ul>
<li>Datameer offers a business 	intelligence and analytics stack that runs on any distribution of 	Hadoop.</li>
<li>Datameer is still building a lot 	of features that it talks about, for target release in (I think) the 	fall.</li>
<li>Datameer&#8217;s pride and joy is its 	user interface. Very laudably for a software start-up, Datameer 	claims to have spent considerable time with professional user 	interface designers.</li>
<li>Datameer&#8217;s core user interface 	metaphor is formula definition via a spreadsheet.</li>
<li>Datameer includes 124 functions one can use in these formulae, ranging from math stuff to text tokenization.</li>
<li>Datameer does some straight BI, 	with 4 kinds of “visualization” headed for 20 kinds later. But 	if you want to do hard-core BI, use Datameer to dump data into an 	RDBMS and then use the BI tool of your choice. (Datameer&#8217;s messaging does 	tend to obscure or even contradict that point.)</li>
<li>Rather, Datameer seems to be 	designed for <span style="font-style: normal;">the classic </span><a href="http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/" >MapReduce 	use cases</a> of ETL and heavy data crunching.</li>
<li>Datameer&#8217;s messaging includes a 	bit about “Datameer is real-time, even though <a href="http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/" >Hadoop is generally 	thought of as batch</a>.” So far as I can tell, what that boils down 	to is …</li>
<li>… Datameer will let you examine 	sample and/or partial query results before a full Hadoop run is 	over. Apparently, there are three different ways Datameer lets you 	do this:
<ul>
<li>You can truly query against a 	sample of the data set.</li>
<li>You can query against intermediate 	results, when only some stages of the Hadoop process have already 	been run.</li>
<li>You can drill down into a 	“distributed index,” whatever the heck that means when Datameer says it.</li>
</ul>
</li>
<li>Datameer will let you import data 	from 15 or so different kinds of sources, SQL, NoSQL, and file 	system alike.</li>
</ul>
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/16/introduction-to-datameer/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Naming of the Foo</title>
		<link>http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/</link>
		<comments>http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/#comments</comments>
		<pubDate>Sat, 13 Mar 2010 22:47:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1703</guid>
		<description><![CDATA[Let&#8217;s start from some reasonable premises.

No technology category name is 	ever perfect.
It&#8217;s particularly hard to describe 	NoSQL (Not Only SQL) accurately, given the basic confusion as to 	what NoSQL is all about.
That said, it 	seems pretty clear that NoSQL is about making big websites (and 	perhaps other cloud-like installations) run and scale.
Dwight Merriman (founder/CEO of [...]]]></description>
			<content:encoded><![CDATA[<p>Let&#8217;s start from some reasonable premises.<span id="more-1703"></span></p>
<ul>
<li><a href="http://www.strategicmessaging.com/monashs-first-law-of-commercial-semantics-explained/2009/01/09/" onclick="javascript:pageTracker._trackPageview('/www.strategicmessaging.com');">No technology category name is 	ever perfect</a>.</li>
<li>It&#8217;s particularly hard to describe 	NoSQL (Not Only SQL) accurately, given <a href="http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/" >the basic confusion as to 	what NoSQL is all about</a>.</li>
<li>That said, it 	seems pretty clear that NoSQL is about making big websites (and 	perhaps other cloud-like installations) run and scale.</li>
<li>Dwight Merriman (founder/CEO of 	MongoDB vendor 10gen) is heading in the right direction when he says 	that the unifying ideas of NoSQL are that you do away with 	transactions and joins. But if he&#8217;s ever said something like “NoSQL 	is Foo without joins and transactions,” I don&#8217;t know what Foo is.</li>
<li><span style="font-style: normal;">Actually, 	I do know what Foo is – Foo is what happens when lots of people 	want to get small amounts each of information in or out of a 	database at the same time. I just don&#8217;t know what Foo is called.</span></li>
<li>Obviously, Foo is a lot like OLTP 	(OnLine Transaction Processing). However, it would be pretty silly 	for Foo to actually be OLTP, given that one of the core points of 	NoSQL is that you don&#8217;t have transactions.</li>
<li>It not just the “T” part of 	OLTP that&#8217;s fried.  Calling something “OnLine” only makes sense 	as long as offline is an option, and offline transaction processing 	has been obsolete for a very long time.*</li>
</ul>
<p style="margin-bottom: 0in;"><em>*Sure, if you strain you can talk yourself into exceptions. But the point stands.</em></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">So we need a name for Foo, where Foo is what happens when</span><span style="font-style: normal;"><strong> lots of people want to get small amounts each of information in or out of a database at the same time.</strong></span><span style="font-style: normal;"> Thus, three major subcategories of more-or-less disk-based Foo are:</span></p>
<ul>
<li><span style="font-style: normal;">No-compromises 	ACID-compliant relational OLTP</span></li>
<li><span style="font-style: normal;">Sharded 	MySQL</span></li>
<li>NoSQL</li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">There may be some more purely memory-centric versions too, but let&#8217;s put those aside for the moment. </span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">Absent a better idea, I can squeeze Foo into yet another four-letter acronym:</span></p>
<p style="margin-bottom: 0in;"><strong><span style="font-style: normal;">HVSP (High-Volume Simple Processing)</span></strong></p>
<p style="margin-bottom: 0in; font-style: normal;">That&#8217;s as imperfect as any other category name, and an awkward mouthful to boot. So I&#8217;d love to hear a better one; if you have such, please share it!  In the mean time, I think “HVSP” has merit because:</p>
<ul>
<li><span style="font-style: normal;">The 	“Processing” part should be noncontroversial.</span></li>
<li>“<span style="font-style: normal;">High-Volume” 	is inherent to the challenge. If RDBMS scale well enough for your 	use case, using something less powerful is probably silly.*  	Similarly, while Oracle shines at high-volume OLTP workloads, there 	are many cheaper DBMS that do a fine job of OLTP at lower volumes.</span></li>
<li>“<span style="font-style: normal;">Simple” 	is the core principle of NoSQL systems, which drop joins and 	transactions as being too much foofarah.  That only makes sense at 	all under the assumption that you have bone-simple queries and 	updates, so that programming around the lack of joins and 	transactions isn&#8217;t all that much of a burden.</span></li>
<li><span style="font-style: normal;">Something 	similar is true of sharded MySQL.</span></li>
<li><span style="font-style: normal;">Less 	obviously, “simple” is a core principle of relational OLTP as 	well. The point of the relational model is to cap the complexity of 	data operations, or more precisely to hide that complexity from 	programmers.</span></li>
<li><span style="font-style: normal;">And 	overloading the word “simple” a bit, it&#8217;s fair to say that if 	you&#8217;re reading or writing one record at a time, you&#8217;re doing 	something relatively simple, at least as opposed to what you do in 	analytic processing. The OLTP vs. OLAP distinction is preserved in 	this name change.</span></li>
<li><span style="font-style: normal;">The whole thing matches my definition above, namely &#8220;what happens when lots of people want to get small amounts each of information in or out of a database at the same time.&#8221;</span></li>
</ul>
<p style="margin-bottom: 0in;"><em>*Assuming, of course, that rows-and-tables are a good metaphor for your data structure in the first place.</em></p>
<p style="margin-bottom: 0in; font-style: normal;">Systems I&#8217;m leaving out of the HVSP and hence also NoSQL categories include:</p>
<ul>
<li><span style="font-style: normal;"><strong>Hadoop 	and other batch-oriented MapReduce.</strong></span><span style="font-style: normal;"> Hadoop isn&#8217;t part of NoSQL. I&#8217;m pretty sure that </span><a href="http://twitter.com/mikeolson/status/10388695185" onclick="javascript:pageTracker._trackPageview('/twitter.com');">Cloudera 	CEO Mike Olson</a><span style="font-style: normal;"> agrees with me.</span></li>
<li><span style="font-style: normal;"><span style="font-weight: normal;">More 	generally, </span></span><span style="font-style: normal;"><strong>non-SQL 	data stores that don&#8217;t meet the HVSP criteria.</strong></span><span style="font-style: normal;"> Dave Kellogg stretches things when he claims that <a href="http://www.kellblog.com/2010/03/10/ieee-computer-society-article-on-nosql-an-executive-level-overview/" onclick="javascript:pageTracker._trackPageview('/www.kellblog.com');">MarkLogic 	is a NoSQL system</a>. (But then, that was in a post where he 	seemingly praised </span><a href="http://www.dbms2.com/2009/12/11/nosql-q-and-a/" >a train wreck of an article</a><span style="font-style: normal;">.)</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">But hey – what good is a categorization if it doesn&#8217;t leave some things out?</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/feed/</wfw:commentRss>
		<slash:comments>32</slash:comments>
		</item>
		<item>
		<title>TwinFin(i) – Netezza&#8217;s version of a parallel analytic platform</title>
		<link>http://www.dbms2.com/2010/02/22/netezza-twinfin/</link>
		<comments>http://www.dbms2.com/2010/02/22/netezza-twinfin/#comments</comments>
		<pubDate>Mon, 22 Feb 2010 08:21:13 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1613</guid>
		<description><![CDATA[Much like Aster Data did in Aster 4.0 and now Aster 4.5, Netezza is announcing a general parallel big data analytic platform strategy. It is called Netezza TwinFin(i), it is a chargeable option for the Netezza TwinFin appliance, and many announced details are on the vague side, with Netezza promising more clarity at or before [...]]]></description>
			<content:encoded><![CDATA[<p>Much like Aster Data did in <a href="http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/" >Aster 4.0</a> and now <a href="http://www.dbms2.com/2010/02/22/aster-data-ncluster-4-5/" >Aster 4.5</a>, Netezza is announcing a general parallel big data analytic platform strategy. It is called Netezza TwinFin(i), it is a chargeable option for the <a href="http://www.dbms2.com/2009/07/30/netezza-new-product-family/" >Netezza TwinFin</a> appliance, and many announced details are on the vague side, with Netezza promising more clarity at or before its Enzee Universe conference in June. At a high level, the Aster and Netezza approaches compare/contrast as follows:<span id="more-1613"></span></p>
<ul>
<li>Netezza&#8217;s software runs on well-designed proprietary hardware. Aster runs on hardware that&#8217;s more off-the-shelf.</li>
<li>Aster was first to ship, and will also be first to ship an IDE (Integrated Development Environment).</li>
<li>MapReduce is central to Aster&#8217;s approach. Netezza TwinFin(i) supports MapReduce too, specifically a Hadoop implementation, but I don&#8217;t get the sense that everything Netezza does is built on MapReduce underpinnings.</li>
<li>Both Aster and Netezza try to provide rich functionality for creating in-memory data structures parallel analytic programs can use. Both seem to let you escape from the pure relational-table paradigm more easily than, say, Teradata&#8217;s new persistent memory capabilities do.</li>
<li>Aster and Netezza have made different choices about what kinds of prebuilt analytic packages to offer. Netezza could actually leapfrog Aster in this regard, but let&#8217;s see where each vendor is by, say, mid-year. If you care about the details of built-in analytic functions, you really should consider executing non-disclosure agreements with both those companies.</li>
<li>Both Aster and Netezza stress that you can run analytic functions out-of-process, greatly reducing the chance that they crash the database. Netezza and I&#8217;m pretty sure also Aster also retain the option of running in-process, which provides maximum performance. (In Netezza&#8217;s case C++ is the only in-process language supported, and I think Aster has a similar limitation.)</li>
<li>Like Aster, Netezza is integrating SQL queries and other analytic processing under the same workload management rubric.</li>
<li>Much like Aster, Netezza is tap-dancing by implying much richer forthcoming SAS support than anything currently announced. (The crunch-per-paragraph ratio in either vendor&#8217;s SAS-related press releases to date is distressingly low.)</li>
</ul>
<p>More specifically, here are some highlights of what I know, am guessing, and/or am allowed to say about Netezza TwinFin(i) at this time.</p>
<ul>
<li>The foundation for the analytic add-ons in Netezza TwinFin(i) is some sort of low-level “analytic executables.” Not understanding exactly what these are is my biggest area of confusion in the whole TwinFin(i) stack. Are they all C++, with everything translated into same? Is there Java all the way down as an alternative? (E.g., Hadoop is written in Java.) Anyhow, whatever it is, it&#8217;s surely a big improvement on <a href="../../../../../2007/09/27/the-netezza-developer-network/">Netezza&#8217;s prior Verilog-based generation of analytic extensibility technology</a>.</li>
<li>The announced list of languages supported in Netezza TwinFin(i) is Java, Python, Fortran, R, and C/C++. More are coming.</li>
<li>Netezza has named a lot of analytic functions it is adding, and hinting about more to come. It has named <a href="http://cran.r-project.org/" onclick="javascript:pageTracker._trackPageview('/cran.r-project.org');">CRAN/R</a> and GNU libraries, saying those have 1900 or more functions each. Netezza has also built its own linear algebra library for TwinFin(i), called nzMatrix. And as previously noted, TwinFin(i) also boasts a Hadoop implementation.</li>
<li>I haven&#8217;t heard about much in the way of TwinFin(i)-specific IDE support.</li>
<li>I don&#8217;t really have details as to what kinds of in-memory data structures Netezza TwinFin(i) does or doesn&#8217;t support.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/22/netezza-twinfin/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>More patent nonsense &#8212; Google MapReduce</title>
		<link>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/</link>
		<comments>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/#comments</comments>
		<pubDate>Thu, 11 Feb 2010 19:29:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1565</guid>
		<description><![CDATA[Google recently received a patent for MapReduce. The first and most general claim is (formatting and emphasis mine):
A system for large-scale processing of data, comprising:

a plurality of processes executing on a plurality of interconnected processors;
the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, [...]]]></description>
			<content:encoded><![CDATA[<p>Google recently received a <a href="http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&amp;Sect2=HITOFF&amp;d=PALL&amp;p=1&amp;u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&amp;r=1&amp;f=G&amp;l=50&amp;s1=7,650,331.PN.&amp;OS=PN/7,650,331&amp;RS=PN/7,650,331" onclick="javascript:pageTracker._trackPageview('/patft.uspto.gov');">patent</a> for MapReduce. The first and most general claim is (formatting and emphasis mine):<span id="more-1565"></span></p>
<blockquote><p>A system for large-scale processing of data, comprising:</p>
<ul>
<li>a plurality of processes executing on a plurality of interconnected processors;</li>
<li>the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, and worker processes;</li>
<li>the master process, in response to a request to perform the data processing job, assigning input data blocks of the set of input data to respective ones of the worker processes;</li>
<li>each of a first plurality of the worker processes <strong>including an application-independent map module</strong> for retrieving a respective input data block assigned to the worker process by the master process and <strong>applying an application-specific map operation</strong> to the respective input data block to produce intermediate data values, wherein at least a subset of the intermediate data values each comprises a <strong>key/value pair,</strong> and wherein at least two of the first plurality of the worker processes operate simultaneously so as to perform the application-specific map operation in <strong>parallel</strong> on distinct, respective input data blocks;</li>
<li>a partition operator for processing the produced intermediate data values to produce a plurality of intermediate data sets, wherein each respective intermediate data set includes <strong>all key/value pairs for a distinct set of respective keys,</strong> and wherein at least one of the respective intermediate data sets includes respective ones of the key/value pairs produced by a plurality of the first plurality of the worker processes;</li>
<li>and each of a second plurality of the worker processes including <strong>an application-independent reduce module for retrieving data,</strong> the retrieved data comprising at least a subset of the key/value pairs from a respective intermediate data set of the plurality of intermediate data sets and applying <strong>an application-specific reduce operation</strong> to the retrieved data to produce final output data corresponding to the distinct set of respective keys in the respective intermediate data set of the plurality of intermediate data sets, and wherein at least two of the second plurality of the worker processes operate simultaneously so as to perform the application-specific reduce operation in <strong>parallel</strong> on multiple respective subsets of the produced intermediate data values.</li>
</ul>
</blockquote>
<p><em>The way a patent works is that you make a big claim and, just in case it&#8217;s later invalidated, you also make more specialized sub-claims. What&#8217;s more, in a software patent, you claim everything twice, once as a &#8220;system&#8221; and once as a &#8220;method.&#8221;</em></p>
<p>When a patent takes that long to issue and has a core claim that wordy, one can assume there was much back and forth with the PTO (Patent and Trademark Office) to whittle it down to something they felt they could approve. At a guess, I&#8217;d conjecture that the supposedly unique parts of the claim are concentrated in the areas I bolded above, and that the PTO doesn&#8217;t think the claim would be patentable unless most or all of them were included.</p>
<p>So should the claim have been approved even so? Let&#8217;s consider prior art. <a href="../../../../../2009/10/06/oracle-mapreduce/">Oracle has long been able to parallelize ala MapReduce</a>. I don&#8217;t see anything in the claim that isn&#8217;t preceded by what Oracle did, except maybe the emphasis on key/value pairs. (And the same statement applies to the other 15 claims in the patent, at least on a quick skim.) I forget the details of SenSage&#8217;s quasi-MapReduce, which also preceded the Google patent filing, but I imagine something similar would be true about it.</p>
<p>There is no doubt that Google popularized the ideas of MapReduce &#8212; which turns out to have been a worthy public service. In one great example of that popularization, <a href="http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf" onclick="javascript:pageTracker._trackPageview('/www.cs.stanford.edu');">the seminal paper on parallel data mining</a> is almost laughable in how it <a href="../../../../../2009/10/15/mapreduce-webinar-slides/">deviates from MapReduce key/value pair formalism</a> &#8212; but it still seems to have been inspired by Google&#8217;s MapReduce. But that&#8217;s a different matter; popularization != invention, even though there&#8217;s a certain connection between the two in patent law. Actually, Google also often does get credit for having &#8220;invented&#8221; MapReduce, including regrettably in the marketing materials of clients I can&#8217;t talk out of saying that and which now might be looking into the barrel of the Google patent (hello Aster); but again, saying something doesn&#8217;t make it enforceable in court.</p>
<p>So what it all boils down to is:</p>
<p><strong>Should Google&#8217;s patent on the idea of parallelizing the handling of sets of application-visible key/value pairs be regarded as valid?</strong></p>
<p>The United States PTO, which is paid to think about these things, has evidently decided Yes. I disagree. In simplest terms, my reason is that key/value pairs have been around for decades, and so:</p>
<p><strong>Anything which was known or obvious without special reference to key/value pairs doesn&#8217;t suddenly become non-obvious when key/value pairs are mixed in.</strong></p>
<p>If Google ever tries to enforce its MapReduce patent, I&#8217;m available as an expert witness for the other side.</p>
<p><strong><em>Related links</em></strong></p>
<ul>
<li><a href="http://gigaom.com/2010/01/19/why-hadoop-users-shouldnt-fear-googles-new-mapreduce-patent/" onclick="javascript:pageTracker._trackPageview('/gigaom.com');">GigaOm</a> and <a href="http://arstechnica.com/open-source/news/2010/01/googles-mapreduce-patent-what-does-it-mean-for-hadoop.ars" onclick="javascript:pageTracker._trackPageview('/arstechnica.com');">Ars Technica</a> on the Google MapReduce patent</li>
<li>Another <a href="http://www.dbms2.com/2010/01/15/vertica-sybase-ipatent-litigation/" >silly software patent</a> issue</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
	</channel>
</rss>
