<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS2 -- DataBase Management System Services &#187; MapReduce</title>
	<atom:link href="http://www.dbms2.com/category/parallelization/mapreduce/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 18 Mar 2010 05:19:19 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>The Naming of the Foo</title>
		<link>http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/</link>
		<comments>http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/#comments</comments>
		<pubDate>Sat, 13 Mar 2010 22:47:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Mark Logic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1703</guid>
		<description><![CDATA[Let&#8217;s start from some reasonable premises.

No technology category name is 	ever perfect.
It&#8217;s particularly hard to describe 	NoSQL (Not Only SQL) accurately, given the basic confusion as to 	what NoSQL is all about.
That said, it 	seems pretty clear that NoSQL is about making big websites (and 	perhaps other cloud-like installations) run and scale.
Dwight Merriman (founder/CEO of [...]]]></description>
			<content:encoded><![CDATA[<p>Let&#8217;s start from some reasonable premises.<span id="more-1703"></span></p>
<ul>
<li><a href="http://www.strategicmessaging.com/monashs-first-law-of-commercial-semantics-explained/2009/01/09/" onclick="javascript:pageTracker._trackPageview('/www.strategicmessaging.com');">No technology category name is 	ever perfect</a>.</li>
<li>It&#8217;s particularly hard to describe 	NoSQL (Not Only SQL) accurately, given <a href="http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/" >the basic confusion as to 	what NoSQL is all about</a>.</li>
<li>That said, it 	seems pretty clear that NoSQL is about making big websites (and 	perhaps other cloud-like installations) run and scale.</li>
<li>Dwight Merriman (founder/CEO of 	MongoDB vendor 10gen) is heading in the right direction when he says 	that the unifying ideas of NoSQL are that you do away with 	transactions and joins. But if he&#8217;s ever said something like “NoSQL 	is Foo without joins and transactions,” I don&#8217;t know what Foo is.</li>
<li><span style="font-style: normal;">Actually, 	I do know what Foo is – Foo is what happens when lots of people 	want to get small amounts each of information in or out of a 	database at the same time. I just don&#8217;t know what Foo is called.</span></li>
<li>Obviously, Foo is a lot like OLTP 	(OnLine Transaction Processing). However, it would be pretty silly 	for Foo to actually be OLTP, given that one of the core points of 	NoSQL is that you don&#8217;t have transactions.</li>
<li>It not just the “T” part of 	OLTP that&#8217;s fried.  Calling something “OnLine” only makes sense 	as long as offline is an option, and offline transaction processing 	has been obsolete for a very long time.*</li>
</ul>
<p style="margin-bottom: 0in;"><em>*Sure, if you strain you can talk yourself into exceptions. But the point stands.</em></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">So we need a name for Foo, where Foo is what happens when</span><span style="font-style: normal;"><strong> lots of people want to get small amounts each of information in or out of a database at the same time.</strong></span><span style="font-style: normal;"> Thus, three major subcategories of more-or-less disk-based Foo are:</span></p>
<ul>
<li><span style="font-style: normal;">No-compromises 	ACID-compliant relational OLTP</span></li>
<li><span style="font-style: normal;">Sharded 	MySQL</span></li>
<li>NoSQL</li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">There may be some more purely memory-centric versions too, but let&#8217;s put those aside for the moment. </span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">Absent a better idea, I can squeeze Foo into yet another four-letter acronym:</span></p>
<p style="margin-bottom: 0in;"><strong><span style="font-style: normal;">HVSP (High-Volume Simple Processing)</span></strong></p>
<p style="margin-bottom: 0in; font-style: normal;">That&#8217;s as imperfect as any other category name, and an awkward mouthful to boot. So I&#8217;d love to hear a better one; if you have such, please share it!  In the mean time, I think “HVSP” has merit because:</p>
<ul>
<li><span style="font-style: normal;">The 	“Processing” part should be noncontroversial.</span></li>
<li>“<span style="font-style: normal;">High-Volume” 	is inherent to the challenge. If RDBMS scale well enough for your 	use case, using something less powerful is probably silly.*  	Similarly, while Oracle shines at high-volume OLTP workloads, there 	are many cheaper DBMS that do a fine job of OLTP at lower volumes.</span></li>
<li>“<span style="font-style: normal;">Simple” 	is the core principle of NoSQL systems, which drop joins and 	transactions as being too much foofarah.  That only makes sense at 	all under the assumption that you have bone-simple queries and 	updates, so that programming around the lack of joins and 	transactions isn&#8217;t all that much of a burden.</span></li>
<li><span style="font-style: normal;">Something 	similar is true of sharded MySQL.</span></li>
<li><span style="font-style: normal;">Less 	obviously, “simple” is a core principle of relational OLTP as 	well. The point of the relational model is to cap the complexity of 	data operations, or more precisely to hide that complexity from 	programmers.</span></li>
<li><span style="font-style: normal;">And 	overloading the word “simple” a bit, it&#8217;s fair to say that if 	you&#8217;re reading or writing one record at a time, you&#8217;re doing 	something relatively simple, at least as opposed to what you do in 	analytic processing. The OLTP vs. OLAP distinction is preserved in 	this name change.</span></li>
<li><span style="font-style: normal;">The whole thing matches my definition above, namely &#8220;what happens when lots of people want to get small amounts each of information in or out of a database at the same time.&#8221;</span></li>
</ul>
<p style="margin-bottom: 0in;"><em>*Assuming, of course, that rows-and-tables are a good metaphor for your data structure in the first place.</em></p>
<p style="margin-bottom: 0in; font-style: normal;">Systems I&#8217;m leaving out of the HVSP and hence also NoSQL categories include:</p>
<ul>
<li><span style="font-style: normal;"><strong>Hadoop 	and other batch-oriented MapReduce.</strong></span><span style="font-style: normal;"> Hadoop isn&#8217;t part of NoSQL. I&#8217;m pretty sure that </span><a href="http://twitter.com/mikeolson/status/10388695185" onclick="javascript:pageTracker._trackPageview('/twitter.com');">Cloudera 	CEO Mike Olson</a><span style="font-style: normal;"> agrees with me.</span></li>
<li><span style="font-style: normal;"><span style="font-weight: normal;">More 	generally, </span></span><span style="font-style: normal;"><strong>non-SQL 	data stores that don&#8217;t meet the HVSP criteria.</strong></span><span style="font-style: normal;"> Dave Kellogg stretches things when he claims that <a href="http://www.kellblog.com/2010/03/10/ieee-computer-society-article-on-nosql-an-executive-level-overview/" onclick="javascript:pageTracker._trackPageview('/www.kellblog.com');">MarkLogic 	is a NoSQL system</a>. (But then, that was in a post where he 	seemingly praised </span><a href="http://www.dbms2.com/2009/12/11/nosql-q-and-a/" >a train wreck of an article</a><span style="font-style: normal;">.)</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">But hey – what good is a categorization if it doesn&#8217;t leave some things out?</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/feed/</wfw:commentRss>
		<slash:comments>25</slash:comments>
		</item>
		<item>
		<title>TwinFin(i) – Netezza&#8217;s version of a parallel analytic platform</title>
		<link>http://www.dbms2.com/2010/02/22/netezza-twinfin/</link>
		<comments>http://www.dbms2.com/2010/02/22/netezza-twinfin/#comments</comments>
		<pubDate>Mon, 22 Feb 2010 08:21:13 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1613</guid>
		<description><![CDATA[Much like Aster Data did in Aster 4.0 and now Aster 4.5, Netezza is announcing a general parallel big data analytic platform strategy. It is called Netezza TwinFin(i), it is a chargeable option for the Netezza TwinFin appliance, and many announced details are on the vague side, with Netezza promising more clarity at or before [...]]]></description>
			<content:encoded><![CDATA[<p>Much like Aster Data did in <a href="http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/" >Aster 4.0</a> and now <a href="http://www.dbms2.com/2010/02/22/aster-data-ncluster-4-5/" >Aster 4.5</a>, Netezza is announcing a general parallel big data analytic platform strategy. It is called Netezza TwinFin(i), it is a chargeable option for the <a href="http://www.dbms2.com/2009/07/30/netezza-new-product-family/" >Netezza TwinFin</a> appliance, and many announced details are on the vague side, with Netezza promising more clarity at or before its Enzee Universe conference in June. At a high level, the Aster and Netezza approaches compare/contrast as follows:<span id="more-1613"></span></p>
<ul>
<li>Netezza&#8217;s software runs on well-designed proprietary hardware. Aster runs on hardware that&#8217;s more off-the-shelf.</li>
<li>Aster was first to ship, and will also be first to ship an IDE (Integrated Development Environment).</li>
<li>MapReduce is central to Aster&#8217;s approach. Netezza TwinFin(i) supports MapReduce too, specifically a Hadoop implementation, but I don&#8217;t get the sense that everything Netezza does is built on MapReduce underpinnings.</li>
<li>Both Aster and Netezza try to provide rich functionality for creating in-memory data structures parallel analytic programs can use. Both seem to let you escape from the pure relational-table paradigm more easily than, say, Teradata&#8217;s new persistent memory capabilities do.</li>
<li>Aster and Netezza have made different choices about what kinds of prebuilt analytic packages to offer. Netezza could actually leapfrog Aster in this regard, but let&#8217;s see where each vendor is by, say, mid-year. If you care about the details of built-in analytic functions, you really should consider executing non-disclosure agreements with both those companies.</li>
<li>Both Aster and Netezza stress that you can run analytic functions out-of-process, greatly reducing the chance that they crash the database. Netezza and I&#8217;m pretty sure also Aster also retain the option of running in-process, which provides maximum performance. (In Netezza&#8217;s case C++ is the only in-process language supported, and I think Aster has a similar limitation.)</li>
<li>Like Aster, Netezza is integrating SQL queries and other analytic processing under the same workload management rubric.</li>
<li>Much like Aster, Netezza is tap-dancing by implying much richer forthcoming SAS support than anything currently announced. (The crunch-per-paragraph ratio in either vendor&#8217;s SAS-related press releases to date is distressingly low.)</li>
</ul>
<p>More specifically, here are some highlights of what I know, am guessing, and/or am allowed to say about Netezza TwinFin(i) at this time.</p>
<ul>
<li>The foundation for the analytic add-ons in Netezza TwinFin(i) is some sort of low-level “analytic executables.” Not understanding exactly what these are is my biggest area of confusion in the whole TwinFin(i) stack. Are they all C++, with everything translated into same? Is there Java all the way down as an alternative? (E.g., Hadoop is written in Java.) Anyhow, whatever it is, it&#8217;s surely a big improvement on <a href="../../../../../2007/09/27/the-netezza-developer-network/">Netezza&#8217;s prior Verilog-based generation of analytic extensibility technology</a>.</li>
<li>The announced list of languages supported in Netezza TwinFin(i) is Java, Python, Fortran, R, and C/C++. More are coming.</li>
<li>Netezza has named a lot of analytic functions it is adding, and hinting about more to come. It has named <a href="http://cran.r-project.org/" onclick="javascript:pageTracker._trackPageview('/cran.r-project.org');">CRAN/R</a> and GNU libraries, saying those have 1900 or more functions each. Netezza has also built its own linear algebra library for TwinFin(i), called nzMatrix. And as previously noted, TwinFin(i) also boasts a Hadoop implementation.</li>
<li>I haven&#8217;t heard about much in the way of TwinFin(i)-specific IDE support.</li>
<li>I don&#8217;t really have details as to what kinds of in-memory data structures Netezza TwinFin(i) does or doesn&#8217;t support.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/22/netezza-twinfin/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>More patent nonsense &#8212; Google MapReduce</title>
		<link>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/</link>
		<comments>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/#comments</comments>
		<pubDate>Thu, 11 Feb 2010 19:29:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Google]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1565</guid>
		<description><![CDATA[Google recently received a patent for MapReduce. The first and most general claim is (formatting and emphasis mine):
A system for large-scale processing of data, comprising:

a plurality of processes executing on a plurality of interconnected processors;
the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, [...]]]></description>
			<content:encoded><![CDATA[<p>Google recently received a <a href="http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&amp;Sect2=HITOFF&amp;d=PALL&amp;p=1&amp;u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&amp;r=1&amp;f=G&amp;l=50&amp;s1=7,650,331.PN.&amp;OS=PN/7,650,331&amp;RS=PN/7,650,331" onclick="javascript:pageTracker._trackPageview('/patft.uspto.gov');">patent</a> for MapReduce. The first and most general claim is (formatting and emphasis mine):<span id="more-1565"></span></p>
<blockquote><p>A system for large-scale processing of data, comprising:</p>
<ul>
<li>a plurality of processes executing on a plurality of interconnected processors;</li>
<li>the plurality of processes including a master process, for coordinating a data processing job for processing a set of input data, and worker processes;</li>
<li>the master process, in response to a request to perform the data processing job, assigning input data blocks of the set of input data to respective ones of the worker processes;</li>
<li>each of a first plurality of the worker processes <strong>including an application-independent map module</strong> for retrieving a respective input data block assigned to the worker process by the master process and <strong>applying an application-specific map operation</strong> to the respective input data block to produce intermediate data values, wherein at least a subset of the intermediate data values each comprises a <strong>key/value pair,</strong> and wherein at least two of the first plurality of the worker processes operate simultaneously so as to perform the application-specific map operation in <strong>parallel</strong> on distinct, respective input data blocks;</li>
<li>a partition operator for processing the produced intermediate data values to produce a plurality of intermediate data sets, wherein each respective intermediate data set includes <strong>all key/value pairs for a distinct set of respective keys,</strong> and wherein at least one of the respective intermediate data sets includes respective ones of the key/value pairs produced by a plurality of the first plurality of the worker processes;</li>
<li>and each of a second plurality of the worker processes including <strong>an application-independent reduce module for retrieving data,</strong> the retrieved data comprising at least a subset of the key/value pairs from a respective intermediate data set of the plurality of intermediate data sets and applying <strong>an application-specific reduce operation</strong> to the retrieved data to produce final output data corresponding to the distinct set of respective keys in the respective intermediate data set of the plurality of intermediate data sets, and wherein at least two of the second plurality of the worker processes operate simultaneously so as to perform the application-specific reduce operation in <strong>parallel</strong> on multiple respective subsets of the produced intermediate data values.</li>
</ul>
</blockquote>
<p><em>The way a patent works is that you make a big claim and, just in case it&#8217;s later invalidated, you also make more specialized sub-claims. What&#8217;s more, in a software patent, you claim everything twice, once as a &#8220;system&#8221; and once as a &#8220;method.&#8221;</em></p>
<p>When a patent takes that long to issue and has a core claim that wordy, one can assume there was much back and forth with the PTO (Patent and Trademark Office) to whittle it down to something they felt they could approve. At a guess, I&#8217;d conjecture that the supposedly unique parts of the claim are concentrated in the areas I bolded above, and that the PTO doesn&#8217;t think the claim would be patentable unless most or all of them were included.</p>
<p>So should the claim have been approved even so? Let&#8217;s consider prior art. <a href="../../../../../2009/10/06/oracle-mapreduce/">Oracle has long been able to parallelize ala MapReduce</a>. I don&#8217;t see anything in the claim that isn&#8217;t preceded by what Oracle did, except maybe the emphasis on key/value pairs. (And the same statement applies to the other 15 claims in the patent, at least on a quick skim.) I forget the details of SenSage&#8217;s quasi-MapReduce, which also preceded the Google patent filing, but I imagine something similar would be true about it.</p>
<p>There is no doubt that Google popularized the ideas of MapReduce &#8212; which turns out to have been a worthy public service. In one great example of that popularization, <a href="http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf" onclick="javascript:pageTracker._trackPageview('/www.cs.stanford.edu');">the seminal paper on parallel data mining</a> is almost laughable in how it <a href="../../../../../2009/10/15/mapreduce-webinar-slides/">deviates from MapReduce key/value pair formalism</a> &#8212; but it still seems to have been inspired by Google&#8217;s MapReduce. But that&#8217;s a different matter; popularization != invention, even though there&#8217;s a certain connection between the two in patent law. Actually, Google also often does get credit for having &#8220;invented&#8221; MapReduce, including regrettably in the marketing materials of clients I can&#8217;t talk out of saying that and which now might be looking into the barrel of the Google patent (hello Aster); but again, saying something doesn&#8217;t make it enforceable in court.</p>
<p>So what it all boils down to is:</p>
<p><strong>Should Google&#8217;s patent on the idea of parallelizing the handling of sets of application-visible key/value pairs be regarded as valid?</strong></p>
<p>The United States PTO, which is paid to think about these things, has evidently decided Yes. I disagree. In simplest terms, my reason is that key/value pairs have been around for decades, and so:</p>
<p><strong>Anything which was known or obvious without special reference to key/value pairs doesn&#8217;t suddenly become non-obvious when key/value pairs are mixed in.</strong></p>
<p>If Google ever tries to enforce its MapReduce patent, I&#8217;m available as an expert witness for the other side.</p>
<p><strong><em>Related links</em></strong></p>
<ul>
<li><a href="http://gigaom.com/2010/01/19/why-hadoop-users-shouldnt-fear-googles-new-mapreduce-patent/" onclick="javascript:pageTracker._trackPageview('/gigaom.com');">GigaOm</a> and <a href="http://arstechnica.com/open-source/news/2010/01/googles-mapreduce-patent-what-does-it-mean-for-hadoop.ars" onclick="javascript:pageTracker._trackPageview('/arstechnica.com');">Ars Technica</a> on the Google MapReduce patent</li>
<li>Another <a href="http://www.dbms2.com/2010/01/15/vertica-sybase-ipatent-litigation/" >silly software patent</a> issue</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/11/google-mapreduce-patent/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Interesting trends in database and analytic technology</title>
		<link>http://www.dbms2.com/2010/01/31/trends-database-aanalytic-technology/</link>
		<comments>http://www.dbms2.com/2010/01/31/trends-database-aanalytic-technology/#comments</comments>
		<pubDate>Mon, 01 Feb 2010 02:11:17 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1492</guid>
		<description><![CDATA[My project for the day is blogging based on my “Database and analytic technology: State of the union” talk of a few days ago. (I called it that because of when it was given, because it mixed prescriptive and descriptive elements, and because I wanted to call attention to the fact that I cover the [...]]]></description>
			<content:encoded><![CDATA[<p>My project for the day is blogging based on my “<a href="http://www.dbms2.com/2009/11/25/new-england-database-summit-january-28-2010/" >Database and analytic technology: </a><a href="http://www.dbms2.com/2009/11/25/new-england-database-summit-january-28-2010/" >State of the union</a>” talk of a few days ago. (I called it that because of when it was given, because it mixed prescriptive and descriptive elements, and because I wanted to call attention to the fact that I cover the <em>union</em> of database and analytic technologies – the <em>intersection</em> of those two sectors is an area of particular focus, but is far from the whole of my coverage.)</p>
<p>One section covered recent/ongoing/near-future trends that I thought were particularly interesting, including:<span id="more-1492"></span></p>
<p><strong>Simpler database technology,</strong> by which I mean DBMS that are:</p>
<ul>
<li>Easier 	to administer than market-leading systems &#8230;</li>
<li>… even if at the cost of being special-purpose</li>
<li>E.g.,
<ul>
<li>MySQL and older mid-tier RDBMS such as Progress</li>
<li>Many analytic DBMS and appliances, most notably Netezza&#8217;s</li>
</ul>
</li>
</ul>
<p>For general purpose or OLTP uses, I&#8217;m not a big fan of MySQL (not enough progress in making it industrial-strength), PostgreSQL (no good company behind it – I&#8217;m a non-fan of EnterpriseDB), or Ingres (open source or not, it&#8217;s an antiquated system that hasn&#8217;t been invested in as much as Oracle, DB2 or SQL Server).</p>
<p>But I get the impression there are a lot of contenders among small startups, featuring very new architectures for OLTP or general-purpose database management. VoltDB comes to mind. NimbusDB is finally within range of getting funded. Dan Weinreb told me Friday he knows of a bunch of others as well. And that&#8217;s all before we even get into the <a href="http://www.dbms2.com/2009/12/12/legit-nosql-key-value-store/" >NoSQL</a> kind of alternative.</p>
<p><strong>Flexible storage architectures.</strong> That&#8217;s starting out with an emphasis on hybrid columnar, as in the examples of <a href="http://www.dbms2.com/2009/08/04/pax-analytica-row-and-column-stores-begin-to-come-together/" >Vertica</a> and <a href="http://www.dbms2.com/2009/10/14/greenplum-hybrid-columnar/" >Greenplum</a>. Oracle (to whom I&#8217;m under no NDA obligation) and other vendors (to whom I am) are going that way as well.</p>
<p><strong>Multi-tier database architectures,</strong> by which I mean at least two things:</p>
<ul>
<li>The database tier/server tier split of Exadata</li>
<li>Hybrid RAM/disk architectures, examples of which include
<ul>
<li>Vertica&#8217;s RAM-based write-optimized store</li>
<li><a href="http://www.dbms2.com/2009/10/18/introduction-to-sensage/" >Sensage&#8217;s CEP-in-the-DBMS</a></li>
<li>This in-memory analytics stuff we keep hearing about from the BI vendors</li>
<li>Any true in-memory/disk hybrid, such as the regrettably sidelined <a href="http://www.dbms2.com/2007/12/21/ibm-acquires-soliddb/" >solidDB</a></li>
<li>Smart thinking by numerous DBMS vendors about optimizing the use of RAM and/or Level 2 cache</li>
</ul>
</li>
</ul>
<p>Netezza is particularly interesting to watch in this regard because it:</p>
<ul>
<li>Had a pretty strict storage/other processing split in prior product generations and &#8230;</li>
<li>… <a href="http://www.dbms2.com/2009/07/30/netezza-new-product-family/" >ditched that in its latest generation</a> …</li>
<li>… which however is focused on optimizing the use of RAM cache</li>
</ul>
<p>Also noteworthy is Petascan, the stealth-mode –and therefore harder to watch right now <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  – company I keep teasing about, which makes a strong case for carrying the database/storage tier split into the flash/solid-state memory technology generation. <a href="../2009/04/20/calpont-update-you-read-it-here-first/">Calpont</a> also has a server/storage tier split, but that&#8217;s of mainly theoretical interest unless and until Calpont actually ships an MPP version of <a href="../2009/11/07/calponts-infinidb/">InfiniDB</a>.</p>
<p><strong>Cheaper parts,</strong> which have of course been a huge trend for decades.<a href="../2010/01/31/flash-pcmsolid-state-memory-disk/"> Solid-state memory</a> will soon conquer the world. Meanwhile, cheaper sensors drive that <a href="../2010/01/17/three-broad-categories-of-data/">machine-generated data</a> I keep talking about.</p>
<p>An ever-better understanding of <strong>scale-out technology,</strong> in several respects, including:</p>
<ul>
<li>Query, notably data movement for MPP DBMS</li>
<li>Update, especially minimalistic DBMS approaches, be they sharded MySQL or more NoSQLish</li>
<li>Number-crunching, especially via MapReduce and/or parallel analytic libraries integrated into DBMS</li>
</ul>
<p>Cool trends I touched on more briefly include:</p>
<ul>
<li>More data being available for analysis. This was a core theme of my <a href="http://www.dbms2.com/2009/07/30/netezza-enzee-universe/" >Enzee Universe keynote speeches</a>; there are also some notes on it in my 	post based on my <a href="http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/" >Boston Big Data Summit</a> talk.</li>
<li>More users being served by analytics. Ditto.</li>
<li>Data exploration/visualization, ala QlikView, Spotfire, or Tableau, and also the faceted stuff.</li>
<li>The democratization of data mining. But I&#8217;m not as sure of that one as of the others&#8230;</li>
</ul>
<p>One area I flat-out forgot to mention is <a href="http://www.dbms2.com/2009/06/08/the-future-of-data-marts/" >easy data mart spin-out</a>.</p>
<p><em><strong>Other posts based on my January, 2010 New England Database Summit keynote address</strong></em></p>
<ul>
<li><a title="Data-based snooping — a huge threat to liberty that we’re all helping make worse" href="../2010/01/31/data-based-snooping-threat-libert/">Data-based snooping — a huge threat to liberty that we’re all helping make worse</a></li>
<li><a title="Flash, other solid-state memory, and disk" href="../2010/01/31/flash-pcmsolid-state-memory-disk/">Flash, other solid-state memory, and disk</a></li>
<li><a title="Open issues in database and analytic technology" href="../2010/02/01/open-issues-in-database-and-analytic-technology/">Open issues in database and analytic technology</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/01/31/trends-database-aanalytic-technology/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Clearing up MapReduce confusion, yet again</title>
		<link>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/</link>
		<comments>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/#comments</comments>
		<pubDate>Wed, 30 Dec 2009 10:50:53 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Splunk]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1371</guid>
		<description><![CDATA[I&#8217;m frustrated by a constant need &#8212; or at least urge   &#8212; to correct myths and errors about MapReduce. Let&#8217;s try one more time:

MapReduce was named and popularized &#8212; but not invented &#8212; by Google.
&#8220;MapReduce&#8221; variously refers to:

A programming paradigm
Execution engines that implement the programming paradigm
Distributed file systems that work with the execution [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m frustrated by a constant need &#8212; or at least urge <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  &#8212; to correct <a href="http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/" >myths and errors about MapReduce</a>. Let&#8217;s try one more time:<span id="more-1371"></span></p>
<ul>
<li>MapReduce was named and popularized &#8212; but not invented &#8212; by Google.</li>
<li>&#8220;MapReduce&#8221; variously refers to:
<ul>
<li>A programming paradigm</li>
<li>Execution engines that implement the programming paradigm</li>
<li>Distributed file systems that work with the execution engines</li>
</ul>
</li>
<li>In particular, Hadoop is a MapReduce execution engine that includes or is closely associated with HDFS (Hadoop Distributed File System).</li>
<li>MapReduce and analytic DBMS can interact in a number of different ways, including:
<ul>
<li>Tight integration between a DBMS and exposed MapReduce functionality, e.g. <a href="http://www.dbms2.com/2009/10/15/mapreduce-webinar-slides/" >Aster Data&#8217;s SQL/MapReduce</a> or Greenplum.</li>
<li>Integrated MapReduce &#8220;under the covers&#8221;, e.g. SenSage or <a href="http://www.dbms2.com/2009/10/06/oracle-mapreduce/" >Oracle</a>. This may or may not follow all the rules Google laid out for MapReduce, but it&#8217;s at least similar in spirit.</li>
<li>Looser coupling between DBMS and a MapReduce system, e.g. <a href="http://www.dbms2.com/2009/08/04/verticas-version-of-mapreduce-integration/" >Vertica/Hadoop</a>, in which MapReduce may or may not run on a different cluster than the DBMS.</li>
<li>Not at all, except perhaps insofar as a quasi-DBMS such as <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >Hive</a> is implemented over a MapReduce system such as Hadoop/HDFS.</li>
</ul>
</li>
<li>As predicted by <a href="http://www.strategicmessaging.com/monashs-first-law-of-commercial-semantics-explained/2009/01/09/" onclick="javascript:pageTracker._trackPageview('/www.strategicmessaging.com');">Monash&#8217;s First Law of Commercial Semantics</a>, different vendors have individual variants on those themes. For example, as per <a href="http://www.splunk.com/product" onclick="javascript:pageTracker._trackPageview('/www.splunk.com');">a registration-required white paper</a>, Splunk is moving to publicly expose a not-quite-complete form of MapReduce.</li>
<li>MapReduce implementations such as Hadoop are sometimes regarded as part of the <a href="http://www.dbms2.com/2009/12/12/legit-nosql-key-value-store/" >NoSQL</a> &#8220;movement&#8221;. When they are, many generalities about NoSQL &#8212; such as that it doesn&#8217;t deal with analytics &#8212; are falsified.</li>
<li>So far as I can tell, mainstream enterprise (as opposed to web, scientific, investment, etc.) data mining folks may be looking at MapReduce for data mining, but they haven&#8217;t done much to adopt it yet. Probably that&#8217;s because the outfits who have the greatest need are the same ones that have the largest sunk investments in more traditional ways of doing data mining.</li>
<li>Cloudera != Hadoop. On the other hand, if you want to use Hadoop, it makes a lot of sense to do business with Cloudera.</li>
<li>Non-DBMS MapReduce != Hadoop. On the other hand, Hadoop is the default choice for non-DBMS MapReduce.</li>
<li>MapReduce != Hadoop, period. DBMS-based MapReduce is also a legitimate technical strategy.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Webinar on MapReduce for complex analytics (Thursday, December 3, 10 am and 2 pm Eastern)</title>
		<link>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/</link>
		<comments>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/#comments</comments>
		<pubDate>Wed, 02 Dec 2009 20:57:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1267</guid>
		<description><![CDATA[The second in my two-webinar series for Aster Data will occur tomorrow, twice (both live), at 10 am and 2 pm Eastern time. The other presenters will be Jonathan Goldman, who was Principal Scientist at LinkedIn but now has joined Aster himself, and Steve Wooledge of Aster (playing host). Key links are:

Registration for tomorrow&#8217;s webinars
Replay [...]]]></description>
			<content:encoded><![CDATA[<p>The second in my two-webinar series for Aster Data will occur tomorrow, twice (both live), at 10 am and 2 pm Eastern time. The other presenters will be Jonathan Goldman, who was Principal Scientist at LinkedIn but now has joined Aster himself, and Steve Wooledge of Aster (playing host). Key links are:</p>
<ul>
<li>Registration for <a href="http://www.asterdata.com/wc_091203_masteringmapreduce/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');">tomorrow&#8217;s webinars</a></li>
<li>Replay of the <a href="http://www.asterdata.com/masteringmapreduce2/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');"> first webinar</a></li>
<li>My slides from the <a href="http://www.dbms2.com/2009/10/15/mapreduce-webinar-slides/" >first webinar</a></li>
</ul>
<p>The main subjects of the webinar will be:</p>
<ul>
<li>Some review of material from the first webinar (all three presenters)</li>
<li>Discussion of how MapReduce can help with three kinds of analytics:
<ul>
<li>Pattern matching (Jonathan will give detail)</li>
<li>Number-crunching (I&#8217;ll cover that, and it will be short)</li>
<li>Graph analytics (I haven&#8217;t written the slides yet, but my starting point will be some of the <a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/" >relationship analytics</a> ideas we discussed in August)</li>
</ul>
</li>
</ul>
<p>Arguably, aspects of data transformation fit into each of those three categories, which may help explain why data transformation has been so prominent among the early applications of MapReduce.</p>
<p>As you can see from Aster&#8217;s title for the webinar (which they picked while I was on vacation), at least their portion will be focused on customer analytics, e.g. web analytics.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Boston Big Data Summit keynote outline</title>
		<link>http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/</link>
		<comments>http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/#comments</comments>
		<pubDate>Mon, 23 Nov 2009 06:25:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[DBMS product categories]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Humor]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1227</guid>
		<description><![CDATA[Last month, Bob Zurek asked me to give a talk on “Big Data”, where “big” is anything from a few terabytes on up, then moderate a panel on cloud computing. We agreed that I could talk just from notes, without slides. So, since I have them typed up, I&#8217;m posting them below.

The top two points [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Last month, Bob Zurek asked me to give a talk on <a href="http://www.dbms2.com/2009/10/09/presentations-upcoming/" >“Big Data”, where “big” is anything from a few terabytes on up</a>, then moderate a panel on cloud computing. We agreed that I could talk just from notes, without slides. So, since I have them typed up, I&#8217;m posting them below.</p>
<p><span id="more-1227"></span></p>
<p style="margin-bottom: 0in;">The top two points from Q&amp;A probably were:</p>
<ul>
<li><strong>Big Data and the cloud actually 	have relatively little to do with each other,</strong> <a href="http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/" >a few exceptions</a> notwithstanding, especially if the data is in a shared-nothing DBMS 	(as opposed to, say, a MapReduce-oriented file cluster). Two 	principal reasons are:
<ul>
<li>Redistributing data from node to 	node is a little slow, undermining some of the elasticity benefits 	of the cloud.</li>
<li><a href="http://www.dbms2.com/2009/05/29/sneakernet-to-the-cloud/" >Getting data into the cloud in the 	first place is a lot slow</a>.</li>
</ul>
</li>
<li><strong>The NoSQL movement is a lot like 	the Ron Paul campaign</strong> &#8212; it consists of people who are dissatisfied 	with the status quo, whose dissatisfaction has a lot to do with 	insufficient liberty and/or excessive expenditure, and who otherwise 	don&#8217;t have a whole lot in common with each other.</li>
</ul>
<p style="margin-bottom: 0in;">Anyhow, here are my notes for the talk, edited in just a couple of places for readability or linkage.</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><strong>Quick introduction</strong></p>
<ul>
<li>Big Data vs. cloud</li>
<li>How big is Big Data?</li>
<li>At the low end of that range, 	there&#8217;s little you can&#8217;t do with conventional technology if you 	have:
<ul>
<li>An unlimited budget for hardware</li>
<li>An unlimited budget for software</li>
<li>An unlimited budget for people, 	especially Oracle DBAs</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Big Data in OLTP</strong></p>
<ul>
<li>Hard-core OLTP
<ul>
<li>Focus of DBMS technology for a 	long-time</li>
<li>Big budgets because each 	transaction has significant value</li>
<li>Tough to get users to change 	technologies</li>
</ul>
</li>
<li>Lighter-weight OLTP
<ul>
<li>Classic example = web companies
<ul>
<li>Big ones &#8212;  retail-oriented ones 	(eBay, Amazon) partially excepted &#8212; <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >rolled their own technology 	stacks</a></li>
<li>Reluctant to give money to anybody
<ul>
<li>Open source, etc.</li>
</ul>
</li>
</ul>
</li>
<li>Difficulty finding market
<ul>
<li>Product vs. feature
<ul>
<li>Clustering/HA/DR/whatever</li>
<li>Ditto cloud enablement</li>
</ul>
</li>
<li>True products haven&#8217;t found much 	traction yet</li>
</ul>
</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Analytic Big Data use cases</strong></p>
<ul>
<li>Kinds of data for analytics
<ul>
<li>More of same != big</li>
<li>More detail and/or new kinds
<ul>
<li>Complete data sets</li>
<li>Transactions</li>
<li>Call details</li>
<li>Tick/trade history</li>
<li>Web clickstreams</li>
<li>Network event logs</li>
<li>Other machine-generated data</li>
<li>CAM bottom line
<ul>
<li>Anything human-generated should 	and will be retained in its entirety</li>
<li>Quantities of machine-generated 	data retained should and will grow roughly in line w/ computing cost 	reductions (Moore&#8217;s Law, etc.)</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>Analytic uses of Big Data
<ul>
<li>Analytics is mainly about three 	things
<ul>
<li>Problem detection</li>
<li>Customer relationship improvement
<ul>
<li>(Those overlap when the customer 	relationship is bad)</li>
</ul>
</li>
<li>Financial statements on steroids</li>
</ul>
</li>
</ul>
<ul>
<li>Main kinds of analytics
<ul>
<li>What BI vendors traditionally sell
<ul>
<li>General reporting and dashboards</li>
<li>Ad-hoc query (now driven from 	those reports and dashboards)</li>
<li>Planning (allegedly integrated 	with BI)</li>
</ul>
</li>
<li>Research
<ul>
<li>Ad hoc relational query (worth 	mentioning twice because it drives so much of the market)</li>
<li>Data mining</li>
<li>Most web search and web mining</li>
</ul>
</li>
<li>Operational/near-real-time</li>
<li>Archiving/compliance</li>
</ul>
</li>
<li>What gets Big?
<ul>
<li>Mainly research and archiving</li>
<li>But when reporting or operational 	get Big, you have really interesting computing problems</li>
</ul>
</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Technology issues and trends</strong></p>
<ul>
<li>Moore&#8217;s Law
<ul>
<li>CPUs &#8212; All about cores, hence 	parallelism is key</li>
<li>RAM</li>
<li>SSDs – hence replace disks</li>
<li>Sensors – hence generate lots 	more data</li>
</ul>
</li>
<li>Kryder&#8217;s Law
<ul>
<li>But <a href="http://www.dbms2.com/2005/11/13/breaking-the-disk-speed-barrier/" >rotational speeds up only 	12.5X since Eisenhower Administration</a></li>
<li>Hence solid-state memory (or RAM) 	will soon take over</li>
</ul>
</li>
<li>In the mean time, I/O bottlenecks 	have had to be beaten
<ul>
<li>Hence sequential scans</li>
<li>Hence <a href="http://www.dbms2.com/2007/03/26/index-light-mpp-data-warehouse-appliances/" >index-light</a> architectures</li>
<li>Hence columnar</li>
</ul>
</li>
<li>DBMS “overhead”
<ul>
<li>Raw license and maintenance fees – 	software increasing fraction of total</li>
<li>OLTP vestiges – locking and all 	that</li>
<li>DBAs
<ul>
<li>People costs = huge fraction of 	total</li>
<li>Index-lightness addresses</li>
<li>So does appliance</li>
</ul>
</li>
<li>Many people don&#8217;t really know how to 	write SQL</li>
</ul>
</li>
<li>Configuration
<ul>
<li>Appliance/tightly-balanced
<ul>
<li>Netezza</li>
<li>Teradata earlier</li>
<li>Greenplum/Sun</li>
<li>Oracle</li>
<li>IBM</li>
<li>Microsoft/Madison</li>
</ul>
</li>
<li>Commodity/do what you want
<ul>
<li>Vertica</li>
<li>Greenplum now</li>
<li>Infobright, Aster and others</li>
<li>MapReduce-oriented file systems</li>
</ul>
</li>
<li><a href="http://www.dbms2.com/2009/10/25/data-warehouse-balanced-hardware-configuration/" >Extreme rigidity is silly</a>
<ul>
<li><a href="http://www.dbms2.com/2009/10/25/teradata-hardware-strategy-and-tactics/" >Teradata, Oracle have both 	signaled moving to more modularity</a></li>
<li>Big driver of that = heterogeneous 	storage
<ul>
<li>Cheap disk</li>
<li>Expensive disk</li>
<li>Solid-state</li>
<li>RAM</li>
</ul>
</li>
</ul>
<ul>
<li>CPU/storage ratio is even more of a 	driver</li>
</ul>
</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Theoretically defensible ways to segment the market</strong></p>
<ul>
<li><a href="http://www.dbms2.com/2009/09/10/analytic-speed-latency/" >Latency requirements</a>
<ul>
<li>High availability and low latency 	go together</li>
</ul>
</li>
<li>Query types
<ul>
<li>Simultaneous users for same</li>
</ul>
</li>
<li>Database size</li>
<li>Budget</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Actual segments right now</strong></p>
<ul>
<li><a href="http://www.dbms2.com/2009/08/24/teradatas-active-enterprise-data-warehouse-story/" >Utter ADW/EDW</a></li>
<li>Data mart
<ul>
<li>Size</li>
<li>Naturally columnar vs. naturally 	row-based</li>
</ul>
</li>
<li>Operational/frontline</li>
<li>Less dramatic/smaller EDW</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Aster Data 4.0 and the evolution of &#8220;advanced analytic(s) servers&#8221;</title>
		<link>http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/</link>
		<comments>http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/#comments</comments>
		<pubDate>Sat, 31 Oct 2009 01:56:55 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1198</guid>
		<description><![CDATA[Since Linda and I are leaving on vacation in a few hours, Aster Data graciously gave me permission to morph its “12:01 am Monday, November 2” embargo into “late Friday night.”
Aster Data is officially announcing the 4.0 release of nCluster. There are two big pieces to this announcement:

Aster is 	offering a slick vision for integrating [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;"><em>Since Linda and I are leaving on vacation in a few hours, Aster Data graciously gave me permission to morph its “12:01 am Monday, November 2” embargo into “late Friday night.”</em></p>
<p style="margin-bottom: 0in; font-style: normal;">Aster Data is officially announcing the 4.0 release of nCluster. There are two big pieces to this announcement:</p>
<ul>
<li>Aster is 	offering a slick vision for integrating big-database management and 	general analytic processing on the same MPP cluster, under the 	not-so-slick name “Data-Application Server.”</li>
<li>Aster is also 	offering a sophisticated vision for workload management.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">In addition, Aster has matured nCluster in various ways, for example cleaning up a performance problem with single-row updates.</p>
<p style="margin-bottom: 0in; font-style: normal;">Highlights of the Aster “Data-Application Server” story include:<span id="more-1198"></span></p>
<ul>
<li>At its core, 	the Aster “Data-Application Server” is the Aster nCluster MPP 	analytic DBMS, enhanced with basic application server functionality 	(I didn&#8217;t ask for details of that part), running on the same 	nCluster worker nodes that answer SQL queries.</li>
<li>Thus, Aster is 	eliminating a lot of the data movement that plagues three-tier 	architectures and other less-integrated approaches.</li>
<li>The Aster 	“Data-Application Server” further offers integrated workload 	management for applications and queries; more on that below.</li>
<li>The Aster 	“Data-Application Server” requires applications to be 	parallelized and invoked via Aster&#8217;s <a href="../2009/10/15/mapreduce-webinar-slides/">SQL/MapReduce.</a></li>
<li>As befits a 	MapReduce-based system, the Aster “Data-Application Server” lets 	you write your applications in lots of different languages (the 	usual suspects, and it also does .NET).</li>
<li>The Aster 	“Data-Application Server” runs applications in their own process 	spaces, protecting the DBMS server from crashes and other damaging 	behavior.</li>
<li>The Aster 	“Data-Application Server” allows applications to manage memory 	themselves, persistently, and not just via relational constructs. 	Thus, if you want your application to maintain a graph, mini rules 	engine, and/or finite state machine, you can, without doing SQL 	contortions.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">In a compelling proof point for the Aster Data-Application Server&#8217;s slickness, Aster has leapfrogged Teradata and Netezza in the extent to which SAS functionality is integrated into Aster&#8217;s DBMS. (Aster and SAS both say that you can do full SAS modeling in parallel on Aster, but even so I wouldn&#8217;t be surprised to discover there were some parts of SAS&#8217; system that turned out to be exceptions.) Of course, Aster is hardly the only analytic DBMS vendor to have the idea of explicitly enhancing general analytic processing; that&#8217;s why we see lots of MapReduce announcements, and it&#8217;s also why Teradata enhanced its UDFs (User-Defined Functions) to have some kind of persistent memory.* But I don&#8217;t know of anybody else whose approach is quite so elegant and general at this time.</p>
<p style="margin-bottom: 0in;"><em>*Unfortunately, I don&#8217;t yet know much about Teradata&#8217;s UDF enhancements. I neglected to drill down on Global Persistent Memory when it was mentioned a couple of times at Teradata Partners last week, and Teradata was unable to accommodate my request this week for a rapid follow-up briefing on the subject.</em></p>
<p style="margin-bottom: 0in; font-style: normal;">Aster&#8217;s approach to workload management is similarly stylish. The idea is:</p>
<ul>
<li>Lots of 	variables are available to be taken into account (e.g., user role, 	expected query duration, actual duration of a running query, etc.)</li>
<li>SQL statements 	can be written against any of these variables.</li>
<li>The SQL 	statements serve as rules to set query/task priorities.</li>
<li>There seem to 	be a few different ways to measure priority, including explicit 	allocation of CPU or I/O resources, as well as more conventional 	“This group of queries gets higher priority than that one” 	kinds of metrics.</li>
<li>The whole 	thing provides integrated workload management for queries, 	applications, load jobs, data redistribution, and so on.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">Right now the interface is – well, you&#8217;re manipulating a SQL table. A more conventional workload management GUI is slated for the second quarter of 2010.</p>
<p style="margin-bottom: 0in; font-style: normal;">Discussing subjects such as mirroring and ILM (Information Lifecycle Management) with Aster can be tricky, as Aster uses the word “partition” in confusing ways. Anyhow, Aster has a few different levels of compression, and the ability to apply different levels of compression to different partitions, to change compression levels via ALTER TABLE, and to alter (presumably increase) compression on the fly when doing online backup. Aster is also part of a growing trend to eschew RAID, instead doing mirroring in its own software.  (Other examples of this strategy would be <span><a href="http://www.dbms2.com/2009/10/06/oracle-and-vertica-on-compression-and-other-physical-data-layout-features/" >Vertica</a>, <a href="http://www.dbms2.com/2008/09/28/oracle-database-machine-performance-and-compression/" >Oracle Exadata/ASM</a>, and <a href="http://www.dbms2.com/2009/10/25/teradata-hardware-strategy-and-tactics/" >Teradata Fallback</a>.) </span><span>Prior to nCluster 4.0, this caused a problem, in that the block sizes for mirroring were so large as to create a lag in transactional updating. But Aster says this problem is now solved, and indeed claims that nCluster 4.0 is superior to most rivals in transactional efficiency.</span></p>
<p style="margin-bottom: 0in;">And finally, while I was talking w/ Aster Data anyway, I checked up on cloud and MapReduce customer penetration. The answers were:</p>
<ul>
<li>Aster has two serious production 	cloud users, both of which have been disclosed for a while, namely:
<ul>
<li>ShareThis, which runs Aster 		nCluster on Amazon EC2</li>
<li>Didit, which runs Aster nCluster 		on AppNexus</li>
</ul>
</li>
<li>Outside of those two, Aster sees 	some cloud use for test, development, prototyping, etc.</li>
<li>Every single Aster customer uses 	<a href="../2009/10/15/mapreduce-webinar-slides/">SQL/MapReduce</a> &#8212; i.e., they invoke MapReduce via Aster nCluster SQL queries.</li>
<li>Some of those customers use MapReduce for ETL, some use it 	for actual analytics.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Three big myths about MapReduce</title>
		<link>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/</link>
		<comments>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 16:14:37 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Michael Stonebraker]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1135</guid>
		<description><![CDATA[Once again, I find myself writing and talking a lot about MapReduce.  But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:

MapReduce is something very new
MapReduce involves strict 	adherence to the Map-Reduce programming paradigm
MapReduce is a single technology

So let&#8217;s give it a try.
When Dave DeWitt and Mike [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Once again, I find myself writing and talking a lot about MapReduce.  But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:</p>
<ul>
<li>MapReduce is something very new</li>
<li>MapReduce involves strict 	adherence to the Map-Reduce programming paradigm</li>
<li>MapReduce is a single technology</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-1135"></span>So let&#8217;s give it a try.</p>
<p style="margin-bottom: 0in;">When Dave DeWitt and Mike Stone<span style="font-style: normal;">braker leveled <a href="../2008/01/18/the-great-mapreduce-debate/">their famous blast at MapReduce</a>, many people thought they overstated their case. But one part of their story – one that both Mike and Dave say was most central to their case – was never effectively refuted, n</span>amely the claim that these ideas aren&#8217;t particularly new. I haven&#8217;t actually read enough computer science literature to have an independent opi<span style="font-style: normal;">nion on that issue. But I&#8217;ll say this – claims from companies such as <a href="../2009/10/18/introduction-to-sensage/">SenSage</a>, <a href="../2009/10/06/oracle-mapreduce/">Oracle</a>, or <a href="../2009/10/18/technical-introduction-to-splunk/">Splunk</a> that “We&#8217;ve be</span>en doing MapReduce all along” seem pretty credible to me.</p>
<p style="margin-bottom: 0in;">True, what those companies were doing things may not have looked exactly like the instant-classic MapReduce programming paradigm. But the same is true of many things almost everybody would agree count as MapReduce.  In particular, it is often not the case that you alternate Map and Reduce steps, each of whose outputs is a set of simple &lt;Key, Value&gt; pairs, with data redistributed based on Key at every step.</p>
<p style="margin-bottom: 0in;">Here are some examples of what I mean, drawn from <a href="http://www.asterdata.com/blog/index.php/2009/10/15/mastering-mapreduce/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');">my recent MapReduce webinar</a>.</p>
<ul>
<li>If you do text indexing in 	MapReduce, your goal is to wind up with a text index. So at some 	point you Reduce to a pair &lt;WordName, {all the (DocumentID, 	offset) pairs for the whole corpus, suitably ordered}&gt;.  That&#8217;s a 	heckuva compound “Value”.</li>
<li>The goal of data mining is usually 	to estimate a rather small number of parameters based on a large 	overall data set, often – depending on algorithm – in the form 	of a single vector. When you do that in MapReduce. you partition 	data among nodes, calculate something on each node that is 	structured more or less like your final vector. So when it comes 	time for the reduce, you just ship all of your vectors – one per 	node – to a single Reduce node, and do the appropriate math. 	Redistribution based on Key would be quite pointless.</li>
<li>When you sessionize clickstream 	logs in MapReduce, you may have just as many output records as input 	records. However, they now are reformatted, and might have a 	SessionID appended. In those cases, Reduce isn&#8217;t doing much by the 	way of reduction.</li>
<li>And as I happens in some 	<a href="../2009/08/04/verticas-version-of-mapreduce-integration/">Vertica-Hadoop</a> use cases around mortgage trading, sometimes MapReduce can even make 	data s<span style="font-style: normal;">ets vastly larger.</span></li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">By no means do I think this is a weakness of the MapReduce programming paradigm. Rather, I think it&#8217;s a MapReduce strength. But it&#8217;s not quite the way MapReduce has been promoted and explained to the IT public.</p>
<p style="margin-bottom: 0in; font-style: normal;">Finally: MapReduce, as commonly conceived, spans two different – albeit closely related – technology domains:</p>
<ul>
<li>Parallel 	programming</li>
<li>Distributed 	data management</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">For example, I imagine Greenplum&#8217;s and Vertica&#8217;s MapReduce/SQL combined syntaxes are very similar to each others. But Vertica&#8217;s data management implementation of MapReduce, which relies on Hadoop, is very different from Greenplum&#8217;s, which is tied into the Greenplum DBMS. Similary, non-DBMS MapReduce implementations are commonly associated with distributed file systems – notably HDFS (Hadoop Distributed File Systems) or Google&#8217;s internal GFS (Google File System). In those systems, the parallel language execution part should be aware of how the distributed file management part works – but perhaps that awareness can be pretty lightweight.</p>
<p style="margin-bottom: 0in; font-style: normal;">Right now, this is a distinction pretty much without a difference. If you choose an implementation of MapReduce &#8212; like pure Hadoop (say in the Cloudera distribution) or Hadoop-Vertica or Aster Data&#8217;s SQL/MapReduce – you&#8217;re basically picking an entire technology stack. But those stacks are going to do a whole lot of changing and maturing in the near future – and as they do, it&#8217;s likely that projects will interact or even combine in all sorts of interesting ways.</p>
<p style="margin-bottom: 0in; font-style: normal;"><strong>Bottom line: There are a lot of different ways to exploit MapReduce-related technology.</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/18/three-big-myths-about-mapreduce/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Introduction to SenSage</title>
		<link>http://www.dbms2.com/2009/10/18/introduction-to-sensage/</link>
		<comments>http://www.dbms2.com/2009/10/18/introduction-to-sensage/#comments</comments>
		<pubDate>Sun, 18 Oct 2009 16:02:42 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Telecommunications]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1115</guid>
		<description><![CDATA[I visited with SenSage on my two most recent trips to San Francisco. Both visits were, through no fault of SenSage&#8217;s, hasty.  Still, I think I have enough of a handle on SenSage basics to be worth writing up.
General SenSage highlights include:


SenSage used to be known as 	Addamark.
SenSage used to characterize 	itself as being [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I visited with SenSage on my two most recent trips to San Francisco. Both visits were, through no fault of SenSage&#8217;s, hasty.  Still, I think I have enough of a handle on SenSage basics to be worth writing up.</p>
<p style="margin-bottom: 0in;">General SenSage highlights include:</p>
<p><span id="more-1115"></span></p>
<ul>
<li>SenSage used to be known as 	Addamark.</li>
<li>SenSage used to characterize 	itself as being in the Security Information Management (SIM) market.</li>
<li>Now SenSage characterizes itself 	(approximately) as selling technology built around a columnar DBMS 	that happens to be pretty good at log analysis, compliance, and/or 	archiving.</li>
<li>More concisely, SenSage says it is 	in the <a href="http://sensage.com/company/index.php" onclick="javascript:pageTracker._trackPageview('/sensage.com');">event data 	warehouse</a> category.  (The same could arguably be said of 	<a href="http://www.dbms2.com/?p=1119" >Splunk</a>.)</li>
<li>SenSage says it has &gt;400 paying 	customers, of which ~200 are direct.</li>
<li>SenSage has &gt;120 employees and, 	like Splunk, is profitable.</li>
<li>SenSage has enjoyed &gt;50% annual 	revenue growth the past four years.</li>
<li>Some SenSage deals are in the 	multiple-million dollar range.</li>
<li>A major SenSage channel partner – 	dozens of installations &#8212; is SAP, which resells SenSage software on 	HP hardware is a “Compliance Log Warehouse.”</li>
<li>A hot market for SenSage is CDRs 	(Call Detail Records).</li>
<li>SenSage says that, among analytic 	DBMS vendors, it competes with Oracle, IBM, Teradata, Netezza and, 	to some extent, Vertica and Greenplum.</li>
</ul>
<p>Technical SenSage highlights include:</p>
<ul>
<li>SenSage&#8217;s core technology is an 	append-only columnar DBMS, with no master node.</li>
<li>SenSage&#8217;s DBMS uses no indexes and 	requires “no” database administration.</li>
<li>SenSage&#8217;s database is 	range-partitioned, with the range-partition key always being time.</li>
<li>SenSage has something it calls SQO 	(Sparse Query Optimization), which sounds a lot like Netezza zone 	maps. SQO never yields a false negative on whether data is in a 	block, never yields a false positive on equality predicates, and 	only rarely yields a false positive on range predicates.</li>
<li>SenSage&#8217;s database uses large 	block sizes – typically 250,000 records/block, at 200-250 bytes 	per record.  (That&#8217;s in the range of 64 megabytes/block.)</li>
<li>SenSage says its software can load 	10-50,000 records/second/node. If I&#8217;m doing the arithmetic 	correctly, that&#8217;s roughly 7-40 gigabytes/node/hour.</li>
<li>SenSage collects log data into its 	event data warehouse in what it characterizes as an agentless 	manner. Even so, it seems that for a majority of kinds of data 	sources one does have to write custom agents. The two other ways to 	get data into SenSage – and presumably most of the data volume 	comes through these – are:
<ul>
<li>File transfer in the usual way</li>
<li>syslog</li>
</ul>
</li>
<li>SenSage says its software can read 	100s of data sources, and that this is a huge competitive advantage. 	I&#8217;m not totally sure how that jibes with the prior point.</li>
<li>SenSage says it gets 5X 	compression on CDR data, 10-20X on other kinds of logs. That&#8217;s not 	too far off from <a href="../2008/09/24/vertica-finally-spells-out-its-compression-claims/">Vertica&#8217;s 	compression figures</a>.</li>
<li>SenSage says that it has 	datatype-aware compression as well as more standard stuff, with 	VARCHAR compressing particularly well.</li>
<li>In particular, SenSage uses both 	dictionary/token and delta compression.</li>
<li>SenSage&#8217;s software is pretty 	agnostic with respect to storage kind – DAS (Direct Attached 	Storage), SAN (Storage-Area Network), or content-addressable. In 	particular, there&#8217;s only about a 4% performance hit for using 	content-addressable storage.</li>
<li>When using WORM (Write Once Read 	Many) storage like EMC&#8217;s Centera, SenSage leaves record locator 	information behind on ordinary storage and otherwise queries the 	WORM storage just like it queries anything else.</li>
<li>SenSage says it has been using 	MapReduce since “Day 1”.</li>
<li>Probably not coincidentally, you 	can use Perl and other aggregates in SenSage SQL statements.</li>
<li>Perhaps also not coincidentally, 	SenSage says it has a number of advanced built-in analytic 	functions, including some focused on sessionization.</li>
</ul>
<p style="margin-bottom: 0in;">In addition to all that, SenSage offers a built-in event processing engine, consisting of:</p>
<ul>
<li>A finite-state machine correlation 	engine.</li>
<li>A proprietary event processing 	language.</li>
<li>A GUI to “abstract” (i.e., 	generate?) the event processing language.</li>
</ul>
<p style="margin-bottom: 0in;">The SenSage event processing engine is used to generate alerts. Data that comes into SenSage actually is passed to two places at once, namely to both the event processing engine and the database itself.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/18/introduction-to-sensage/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
