<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS2 -- DataBase Management System Services &#187; Application areas</title>
	<atom:link href="http://www.dbms2.com/category/applications/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 18 Mar 2010 05:19:19 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Aster Data nCluster 4.5</title>
		<link>http://www.dbms2.com/2010/02/22/aster-data-ncluster-4-5/</link>
		<comments>http://www.dbms2.com/2010/02/22/aster-data-ncluster-4-5/#comments</comments>
		<pubDate>Mon, 22 Feb 2010 08:20:13 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1617</guid>
		<description><![CDATA[Like Vertica, Netezza, and Teradata, Aster is using this week to pre-announce a forthcoming product release, Aster Data nCluster 4.5. Aster is really hanging its identity on “Big Data Analytics” or some variant of that concept, and so the two major named parts of Aster nCluster 4.5 are:

Aster Data Analytic Foundation, a set of analytic [...]]]></description>
			<content:encoded><![CDATA[<p>Like <a href="http://www.dbms2.com/2010/02/22/vertica-4/" >Vertica</a>, <a href="http://www.dbms2.com/2010/02/22/netezza-twinfin/" >Netezza</a>, and Teradata, Aster is using this week to pre-announce a forthcoming product release, Aster Data nCluster 4.5. Aster is really hanging its identity on “Big Data Analytics” or some variant of that concept, and so the two major named parts of Aster nCluster 4.5 are:</p>
<ul>
<li><strong>Aster Data Analytic Foundation,</strong> a set of analytic packages prebuilt in <a href="../2009/06/09/aster-data-nclustersql-mapreduce/">Aster&#8217;s SQL-MapReduce</a><strong></strong></li>
<li><strong>Aster Data Developer Express,</strong> an Eclipse-based IDE (Integrated Development Environment) for developing and testing applications built on Aster nCluster, Aster SQL-MapReduce, and Aster Data Analytic Foundation</li>
</ul>
<p>And in other Aster news:</p>
<ul>
<li>Along with the development GUI in Aster nCluster 4.5, there is also a new administrative GUI.</li>
<li>Aster has certified that nCluster works with Fusion I/O boards, because at least one retail industry prospect cares. However, that in no way means that arm&#8217;s-length Fusion I/O certification is Aster&#8217;s ultimate <a href="../2010/01/31/flash-pcmsolid-state-memory-disk/">solid-state memory</a> strategy.</li>
<li>I had the wrong impression about how far Aster/SAS integration has gotten. So far, it&#8217;s just at the connector level.</li>
</ul>
<p>Aster Data Developer Express evidently does some cool stuff, like providing some sort of parallelism testing right on your desktop. It also generates lots of stub code, saving humans from the tedium of doing that. Useful, obviously.</p>
<p>But mainly, I want to write about the analytic packages.<span id="more-1617"></span> I&#8217;m not convinced that they&#8217;re a big deal in themselves yet, or that a whole lot of person-months have gone into their combined development. Still, I think they provide a great indication of one direction in which analytic functionality is going. And by the way, Aster promises to release a lot more of that kind of thing over the next 12 months.</p>
<p>Aster&#8217;s flagship analytic package is <a href="../2009/02/10/aster-data-npath/">nPath</a>, which is like a <strong>regular expression matcher,</strong> but <strong>for (time) series of data</strong> rather than for character strings. The main use for nPath is in pulling specific kinds of event sequences out of web or network event logs. However, one could imagine uses in other sectors that focus on temporal or sequential data (e.g., trading, intelligence, other sensor analysis), should existing SQL- and/or CEP-based technologies not prove sufficiently flexible. Aster 4.5 adds some new aggregation capabilities around nPath.</p>
<p>Other not-wholly-new packages in the Aster Data Analytic Foundation announcement are for <strong>sessionization</strong> (of clickstream data and the like) and <strong>tokenization </strong>(of text/character string data). While sessionization can be done in SQL, Aster thinks its MapReduce-based version is faster, since it doesn&#8217;t require self-joins. Makes sense. Aster&#8217;s tokenization sounds lame, however – text analytics in MapReduce tends to reinvent simplistic wheels for no clear reason, and Aster doesn&#8217;t seem to be an exception. (Aster would argue, however, that anything it does in SQL-MapReduce is more flexible than pure SQL or pure MapReduce alternatives.)</p>
<p>Another example of better-living-without-self-joins is Aster&#8217;s new <strong>market basket</strong> package. This lets you look at a set of point-of-sale data, pick a small integer N, and pull out all the sets of N things that were bought by the same person at the same time. I haven&#8217;t probed the claim in detail, but Aster implies there&#8217;s less combinatorial explosion in its approach than it is in the self-join alternative.</p>
<p><em>Note: Gartner highlighted self joins as a performance challenge in its recent </em><a href="../2010/02/10/gartner-magic-quadrant-data-warehouse-2009-2010/">Data Warehouse Magic Quadrant</a><em>.</em></p>
<p>Aster is also releasing a few <strong>statistical and general analytic functions</strong> &#8212; specifically (and I quote a slide):</p>
<ul>
<li>exponential moving average</li>
<li>weighted moving average</li>
<li>simple moving average</li>
<li>volume-weighted average price</li>
<li>correlation</li>
<li>linear regression</li>
<li>logistic regression</li>
<li>approximate_percentile</li>
<li>approximate_count_distinct</li>
</ul>
<p>The point of the last two items on the list is that if you set a non-zero tolerance for error, you can you can count things or order them into bins very efficiently – especially in terms of RAM &#8212; while being guaranteed not to exceed your error tolerance.</p>
<p><em>Note: One obvious inference from this list &#8212; which Aster gladly confirms &#8212; is that Aster has high hopes of selling to the financial services industry. </em></p>
<p>Finally, Aster is releasing its first pure <strong>graph-analytic</strong> function, for finding the shortest path between a given pair of nodes.</p>
<p>While I had the Aster folks on the phone anyway, I also took the opportunity to ask about the Aster nCluster 4.0 capability to create fairly persistent non-relational in-memory data structures. Specifically, I asked whether different users could access the same in-memory structure, and was told that this is a little klugey but not too horrendous. That suggests Aster&#8217;s capability may be a strict superset of UDF-based (User-Defined Function) approaches to meeting the same need, at least from a functionality standpoint. However, ease of creating those in-memory structures may still be better in the more SQL/UDF-centric approach favored by Teradata.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/22/aster-data-ncluster-4-5/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Quick thoughts on the StreamBase Component Exchange</title>
		<link>http://www.dbms2.com/2010/02/16/quick-thoughts-on-the-streambase-component-exchange/</link>
		<comments>http://www.dbms2.com/2010/02/16/quick-thoughts-on-the-streambase-component-exchange/#comments</comments>
		<pubDate>Tue, 16 Feb 2010 15:04:22 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[StreamBase]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1591</guid>
		<description><![CDATA[Streambase is announcing something called the StreamBase Component Exchange, for developers to exchange components to be used with the StreamBase engine, presumably on an open source basis. I simultaneously think:

This is a good idea, and many software vendors should do it if they aren&#8217;t already.
It&#8217;s no big deal.

For reasons why, let me quote an email [...]]]></description>
			<content:encoded><![CDATA[<p>Streambase is announcing something called the <a href="http://streambase.com/b6409b0d-7d1f-4cf8-99b9-98b2b1858628/press-release-detail.htm" onclick="javascript:pageTracker._trackPageview('/streambase.com');">StreamBase Component Exchange</a>, for developers to exchange components to be used with the StreamBase engine, presumably on an open source basis. I simultaneously think:</p>
<ul>
<li>This is a good idea, and many software vendors should do it if they aren&#8217;t already.</li>
<li>It&#8217;s no big deal.</li>
</ul>
<p>For reasons why, let me quote an email I just sent to an inquiring reporter:</p>
<ul>
<li>StreamBase sells mainly to the financial services and intelligence community markets. Neither group will share much in the way of core algorithms.</li>
<li>But both groups are <a href="http://www.dbms2.com/2009/01/27/introduction-to-pentaho/" >pretty interested in open source software</a> even so. (I think for both the price and customizability benefits.)</li>
<li>Open source software commonly gets community contributions for connectors, adapters, and (national) language translations.</li>
<li>But useful contributions in other areas are much rarer.</li>
<li>Linden Labs is one of StreamBase&#8217;s <a href="http://www.dbms2.com/2009/03/09/independent-cep-vendors-continue-to-flounder/" >few significant customers outside its two core markets</a>.</li>
<li>All of the above are consistent with the press release (which quotes only one StreamBase customer &#8212; guess who?).</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/16/quick-thoughts-on-the-streambase-component-exchange/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>The Sybase Aleri RAP</title>
		<link>http://www.dbms2.com/2010/02/05/sybase-aleri-rap/</link>
		<comments>http://www.dbms2.com/2010/02/05/sybase-aleri-rap/#comments</comments>
		<pubDate>Sat, 06 Feb 2010 00:05:11 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Aleri and Coral8]]></category>
		<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Sybase]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1545</guid>
		<description><![CDATA[Well, I got a quick Sybase/Aleri briefing, along with multiple apologies for not being prebriefed. (Main excuse: News was getting out, which accelerated the announcement.) Nothing badly contradicted my prior post on the Sybase/Aleri deal.
To understand Sybase&#8217;s plans for Aleri and CEP, it helps to understand Sybase&#8217;s current CEP-oriented offering, Sybase RAP. So far as [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Well, I got a quick Sybase/Aleri briefing, along with multiple apologies for not being prebriefed. <em>(Main excuse: News was getting out, which accelerated the announcement.)</em> Nothing badly contradicted my prior post on <a href="http://www.dbms2.com/2010/02/04/sybase-aleri-acquisitio/" >the Sybase/Aleri deal</a>.</p>
<p style="margin-bottom: 0in;">To understand Sybase&#8217;s plans for Aleri and CEP, it helps to understand Sybase&#8217;s current CEP-oriented offering, <strong>Sybase RAP.</strong> So far as I ca<span style="font-weight: normal;">n tell, Sybase RAP has to date only been sold in the form of</span><strong> Sybase RAP: The Trading Edition.</strong> In that guise, Sybase RAP has been sold to &gt;40 outfits since its May, 2008 launch, mainly big names in the investment banking and stock exchange sectors. If I understood correctly, the next target market for Sybase RAP is telcos, for real-time network tuning and management.</p>
<p style="margin-bottom: 0in;">In addition to any domain-specific applications, Sybase RAP has three layers:</p>
<ul>
<li><strong>CEP (Complex Event Processing).</strong> Sybase RAP CEP is based on a version of the Coral8 engine Sybase 	licensed and has been subsequently developing.</li>
<li><strong>In-memory DBMS.</strong> Sybase&#8217;s 	IMDB is part of (but I guess separable from) and has the same API as 	Sybase&#8217;s OLTP DBMS Adaptive Server Enterprise (ASE, aka Sybase 	Classic).</li>
<li><strong>Sybase IQ.</strong> Actually, Sybase 	used the phrase “based on Sybase IQ,” but I&#8217;m guessing it&#8217;s just 	Sybase IQ.</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-1545"></span>In theory, there could be a DBMS other than Sybase IQ, such as Sybase ASE or even Oracle, because Sybase IMDB can talk to a variety of DBMS. I didn&#8217;t get the impression, however, that in practice there were any Sybase RAP installations whose persistent DBMS was anything other than Sybase IQ.</p>
<p style="margin-bottom: 0in;">Aleri had all along had something called Project Ohio, to merge Coral8 with Aleri Classic.  Now Sybase&#8217;s own CEP engineering team is being added to the mix, schedules are being reconsidered and haven&#8217;t been disclosed yet. <em>(If one woman can produce one baby in nine months, how long does it take nine women to produce a baby?) </em>Apparently Sybase has a dozen programmers in the CEP area, plus ~20 more on Sybase RAP, not counting QA, documentation, etc.; that represents a significant bump to the overall Aleri development team.</p>
<p style="margin-bottom: 0in;">Sybase doesn&#8217;t seem to have decided what to do yet with the various <a href="../2008/10/20/coral8-proposes-cep-as-a-bi-data-platform/">business intelligence</a>/real-time OLAP engine products and technologies it is inheriting from Aleri.</p>
<p style="margin-bottom: 0in;">And finally, some metrics:</p>
<ul>
<li>The Sybase/Aleri guys estimate 	that 1/3 of of Aleri&#8217;s customers and even less of its revenue came 	from outside the financial services sector. They did say the 	non-financial-services business was “starting to pick up,” but 	not very convincingly.</li>
<li>Sybase IQ is now up to &gt;1800 	customers, with &gt;200 new ones in 2009.</li>
<li>Sybase IQ indeed has users taking 	in market feeds up to 3 terabytes a day, so it probably  matches 	Vertica in having at least several-hundred-terabyte databases in the 	financial sector.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/05/sybase-aleri-rap/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Quick thoughts on Sybase/Aleri</title>
		<link>http://www.dbms2.com/2010/02/04/sybase-aleri-acquisitio/</link>
		<comments>http://www.dbms2.com/2010/02/04/sybase-aleri-acquisitio/#comments</comments>
		<pubDate>Thu, 04 Feb 2010 16:15:19 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Aleri and Coral8]]></category>
		<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Sybase]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1530</guid>
		<description><![CDATA[Sybase announced an asset purchase that amounts to a takeover of CEP (Complex Event Processing) Aleri. Perhaps not coincidentally, Sybase already had technology under the hood from Aleri predecessor/acquiree Coral8, for financial services uses (notwithstanding that between Aleri Classic and Coral8, Aleri Classic was the one of the two more focused on financial services). Quick [...]]]></description>
			<content:encoded><![CDATA[<p>Sybase announced an asset purchase that amounts to a takeover of CEP (Complex Event Processing) <a href="http://www.dbms2.com/2009/03/25/aleri-update/" >Aleri</a>. Perhaps not coincidentally, <a href="http://magmasystems.blogspot.com/2009/03/sybase-and-coral8.html" onclick="javascript:pageTracker._trackPageview('/magmasystems.blogspot.com');">Sybase already had technology under the hood from Aleri predecessor/acquiree Coral8</a>, for financial services uses (notwithstanding that between Aleri Classic and Coral8, Aleri Classic was the one of the two more focused on financial services). Quick reactions include:</p>
<ul>
<li>The folks at Sybase still haven&#8217;t figured out when to prebrief me. <em>(Edit: I&#8217;ve been <a href="http://www.dbms2.com/2010/02/05/sybase-aleri-rap/" >briefed</a> subsequently.)</em></li>
<li>Sybase/Aleri is a potentially powerful combination, if they can effectively address the point I just made about <a href="http://www.dbms2.com/2010/02/01/open-issues-in-database-and-analytic-technology/" >integrating disparate latencies</a>. That said, I&#8217;m not expecting a lot, because <a href="http://www.dbms2.com/2009/03/09/independent-cep-vendors-continue-to-flounder/" >the CEP industry always disappoints me</a>.</li>
<li><a href="http://www.dbms2.com/2009/05/13/microsoft-announced-cep-this-week-too/" >Microsoft</a>, <a href="http://www.dbms2.com/2009/05/13/ibm-system-s-infosphere-streams-processing/" >IBM</a>, and (somewhat less clearly) <a href="http://www.dbms2.com/2008/01/16/oracle-bea/" >Oracle</a> are all trying to do CEP inhouse. Sybase is making a good choice in having serious CEP inhouse itself</li>
<li>Surely the main focus and financial justification for the Sybase/Aleri acquisition is the financial services market.</li>
<li>Specifically, I expect the focus of technical integration between Aleri and Sybase&#8217;s DBMS products to start with Sybase IQ.</li>
<li>Coral8 had <a href="http://www.dbms2.com/2008/10/20/coral8-proposes-cep-as-a-bi-data-platform/" >some interesting ideas about how to integrate CEP with OLTP/operational BI</a>, but I&#8217;m not aware that they got much traction.</li>
<li>I bet there are use cases where Sybase tries and fails to sell <span style="text-decoration: line-through;">Adaptive Server</span> SQL Anywhere that CEP would be a better technical fit, but I don&#8217;t immediately see much practical business significance to that observation.</li>
<li>While this deal could easily strengthen the Vertica/StreamBase partnership, I don&#8217;t see any reason why it would lead those two companies to actually merge.</li>
</ul>
<p><em><strong>Related link</strong></em></p>
<ul>
<li><a href="http://www.dbms2.com/2009/09/10/analytic-speed-latency/" >Thinking about analytic latency</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/04/sybase-aleri-acquisitio/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Three broad categories of data</title>
		<link>http://www.dbms2.com/2010/01/17/three-broad-categories-of-data/</link>
		<comments>http://www.dbms2.com/2010/01/17/three-broad-categories-of-data/#comments</comments>
		<pubDate>Sun, 17 Jan 2010 15:31:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1421</guid>
		<description><![CDATA[People often try to draw a distinction between:

Traditional data of the sort 	that&#8217;s stored in relational databases, aka “structured.”
Everything else, aka 	“unstructured” or “semi-structured” or “complex.”

There are plenty of problems with these formulations, not the least of which is that the supposedly “unstructured” data is the kind that actually tends to have interesting internal structures. [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">People often try to draw a distinction between:</p>
<ul>
<li>Traditional data of the sort 	that&#8217;s stored in relational databases, aka “structured.”</li>
<li>Everything else, aka 	“unstructured” or “semi-structured” or “complex.”</li>
</ul>
<p style="margin-bottom: 0in;">There are plenty of problems with these formulations, not the least of which is that the supposedly “unstructured” data is the kind that actually tends to have interesting internal structures. But of the many reasons why these distinctions don&#8217;t tend to work very well, I think the most important one is that:</p>
<p><strong>Databases shouldn&#8217;t be divided into just two categories. </strong><span style="font-weight: normal;"> Even as a rough-cut approximation, </span><strong>they should be divided into three,</strong><span style="font-weight: normal;"> namely:</span></p>
<ul>
<li><strong>Human/Tabular</strong> data &#8211;i.e., human-generated data that fits well 	into relational tables or arrays</li>
<li><strong>Human/Nontabular</strong> data &#8212; i.e., all other data generated by humans</li>
<li><strong>Machine-Generated</strong> data</li>
</ul>
<p style="margin-bottom: 0in;">Even that trichotomy is grossly oversimplified, for reasons such as:</p>
<ul>
<li>These categories overlap.</li>
<li>There are kinds of data that get 	into fuzzy border zones.</li>
<li>Not all data in each category has 	all the same properties.</li>
</ul>
<p style="margin-bottom: 0in;">But at least as a starting point, I think this basic categorization has some value.<span id="more-1421"></span></p>
<p style="margin-bottom: 0in;">By <strong>human-generated data that fits well into relational tables or arrays,</strong> what I really mean is: <strong>the input from most conventional kinds transactions</strong> – purchase/sale, inventory/manufacturing, employment status change, etc. This is the core data managed by OLTP relational DBMS everywhere. It is also the core data in analytic relational or MOLAP databases. The vast majority of what we think or know about “database management” applies primarily to data of this kind, in large part because of two fundamental properties of this information:</p>
<ul>
<li>It is meaningful to contemplate 	this data as being 100% accurate and complete (even if that goal is 	difficult to achieve in the real world).</li>
<li>This data is precise – i.e., one 	can check predicates against it and (give or take regrettable data 	imperfections) get inarguable yes/no answers.</li>
</ul>
<p style="margin-bottom: 0in;">For most enterprises, this is the most important data they have. It was created as a result of expensive business activities. It deals directly with money, employees, physical goods, and the rest of the things that make an enterprise go. It can be fruitfully analyzed in ever more ways, which is why it should never be thrown out or even entirely relegated to tape, now that data warehouse software, hardware, and storage has become so cheap. (“Disk is the new tape.”) And because of the importance of both preserving and accessing it, it should often be stored in multiple copies – OLTP, data warehouse, data mart, in-memory analytics, near-line quasi-archive, MOLAP cubes (if you must) and so on, plus of course replicas for high throughput and availability.</p>
<p style="margin-bottom: 0in;">But <strong>humans generate many other kinds of data as well,</strong> especially in a form directly suitable for <strong>communication</strong> – text (in many formats), documents (text or otherwise), pictures, videos, etc. <a href="../2005/12/09/relational-dbms-versus-text-data/">Traditional relational databases are a poor home for this kind of data</a> because:</p>
<ul>
<li>This data often deals with 	opinions or aesthetic judgments – there is little concept of 	perfect accuracy.</li>
<li>Similarly, there is little concept 	of perfect completeness.</li>
<li>There&#8217;s also little concept of 	perfectly, unarguably accurate query results – different people 	will have different opinions as to what comprises good results for a 	search.</li>
<li>Queries don&#8217;t lend themselves to 	binary answers; rather, documents can have differing degrees of 	relevancy.</li>
</ul>
<p style="margin-bottom: 0in;">Systems for managing this kind of data are much less advanced than relational database managers. Nobody knows how to get all the information out of a text document, or query all of it if they could, and the story is even worse for non-text examples. The systems that give the best query results aren&#8217;t necessarily the same ones that have the best database administration features. Basically, this area is still a mess, and it&#8217;s a mess that consumes a huge fraction of all the data storage products sold today.</p>
<p style="margin-bottom: 0in;">But give or take questions of storage efficiency and deduplication, if humans created that kind of data, they put a lot of effort into it, so it&#8217;s worth keeping. Besides, compliance regulations commonly mandate that we do so – except, perhaps, when they mandate that we throw it away.</p>
<p style="margin-bottom: 0in;"><strong>Machine-generated data</strong> is a whole other can of worms. Paradigmatic examples of what I mean by “machine-generated data” include:</p>
<ul>
<li>Computer, network, and other 	equipment logs</li>
<li>Satellite and similar telemetry 	(whether for espionage or science)</li>
<li>Location data such as RFID chip 	readings, GPS system output, etc.</li>
<li>Temperature and other 	environmental sensor readings</li>
<li>Sensor readings from factories, 	pipelines, etc.</li>
<li>Output from many kinds of medical 	device, in hospitals and (increasingly) homes alike</li>
</ul>
<p style="margin-bottom: 0in;">Unlike human-generated data, whose growth is constrained by macro factors such as population and total level of economic activity, <strong>machine-generated data will continue to grow as fast as Moore&#8217;s Law lets it. </strong><span style="font-weight: normal;">That fact has two profound consequences:</span></p>
<ul>
<li><strong>It is unrealistic to hope ever 	to keep most or all machine-generated data,</strong><span style="font-weight: normal;"> whereas I think that&#8217;s exactly what should and will happen with human-generated data</span></li>
<li><span style="font-weight: normal;">Before 	long, </span><strong>most data (by volume) will be machine-generated</strong></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">And so it is not really an exaggeration to say that <strong>machine-generated data is the future of data management.</strong></span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">I&#8217;d like to close this long post by immediately pointing out some of the flaws in this simple trichotomy. One obvious gray area lies in<strong> hybrid human/machine-generated data,</strong> three big examples of which are:</span></p>
<ul>
<li><span style="font-weight: normal;">Web 	clickstreams</span></li>
<li><span style="font-weight: normal;">Call 	detail records (CDR)</span></li>
<li><span style="font-weight: normal;">Stock 	trades</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">In all three cases, we are quickly getting to the point where this data is preserved in its entirety (even if the network event data associated with the web logs is reduced before storage). And in each case it fits pretty well into RDBMS, although Hadoop has a role to play as well. So pretending it&#8217;s purely human-generated probably isn&#8217;t all that misleading.<br />
</span>
</p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">Another gray area lies in text that gets linguistically processed – i.e. via <a href="http://www.texttechnologies.com/2007/12/23/text-mining-myths-realities/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">text-mining</a> tools – with the output placed into a relational database. I don&#8217;t immediately see a workaround for that flaw in my labeling scheme.  So let&#8217;s just say no taxonomy is perfect.*</span></p>
<p style="margin-bottom: 0in;"><em><span style="font-weight: normal;">*Come to think of it, that&#8217;s one of the problems holding back text-mining technology. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </span></em></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;"><span style="font-weight: normal;">And of course some of the <a href="../2009/12/12/legit-nosql-key-value-store/">NoSQL</a> folks would note that I was oversimplifying when I tied my first category specifically to relational DBMS. So would the folks at <a href="../2010/01/15/intersystems-cache-highlights/">Intersystems</a>.</span></span></p>
<p style="margin-bottom: 0in; font-style: normal; font-weight: normal;">But the biggest oversimplification stems from this:</p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">As Mike Stonebraker* and I argued a couple of years ago, I really <a href="../2008/04/10/my-own-data-management-software-taxonomy/">think that database management technologies should be divided into 10+ categories.</a> </span></p>
<p style="margin-bottom: 0in;"><em><span style="font-weight: normal;">*Note: The links to Stonebraker&#8217;s own posts will be broken until Vertica&#8217;s webmaster gets his/her act together. But you can find them under other URLs via web search.)</span></em></p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/01/17/three-broad-categories-of-data/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>There sure seem to be a lot of inaccuracies on ParAccel&#8217;s website</title>
		<link>http://www.dbms2.com/2010/01/15/there-sure-seem-to-be-a-lot-of-inaccuracies-on-paraccels-website/</link>
		<comments>http://www.dbms2.com/2010/01/15/there-sure-seem-to-be-a-lot-of-inaccuracies-on-paraccels-website/#comments</comments>
		<pubDate>Fri, 15 Jan 2010 04:47:00 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Telecommunications]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1393</guid>
		<description><![CDATA[In what is actually an interesting post on database compression, ParAccel CTO Barry Zane threw in
Anyone who has met with us knows ParAccel shies away from hype.
But like many things ParAccel says, that is not true.
The latest whoppers came in the form of several customers ParAccel listed on its website who hadn&#8217;t actually bought ParAccel&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>In what is actually an <a href="http://paraccel.com/data_warehouse_blog/?p=192" onclick="javascript:pageTracker._trackPageview('/paraccel.com');">interesting post on database compression</a>, ParAccel CTO Barry Zane threw in</p>
<blockquote><p>Anyone who has met with us knows ParAccel shies away from hype.</p></blockquote>
<p>But like many things ParAccel says, that is not true.</p>
<p>The latest whoppers came in the form of several customers ParAccel listed on its website who hadn&#8217;t actually bought ParAccel&#8217;s DBMS, nor even decided to do so. It is fairly common to to claim a customer win, then retract the claim due to lack of permission to disclose. But that&#8217;s not what happened in these cases. Based on emails helpfully shared by a ParAccel competitor competing in some of those accounts, it seems clear that <strong>ParAccel actually posted fabricated claims of customer wins.</strong> <span id="more-1393"></span></p>
<p>Another thing that was both technically and substantively false was ParAccel&#8217;s claim to be <a href="http://www.dbms2.com/2009/09/30/facts-and-rumors/" >CERTIFIED price-performance leader</a>. Obviously, this was meant to give the impression that ParAccel had been &#8220;certified&#8221; as the leader in price/performance, when the closest thing to that that was remotely true was that ParAccel had a leading position in the category of &#8220;price/performance measurements that happen to have a certification process.&#8221; At least, that was true for a short time; then ParAccel&#8217;s certification was found to have been erroneous, and got revoked, which did not however inspire ParAccel to immediately take the claim off the front page of its website.</p>
<p>ParAccel&#8217;s website also reflects a lot of praise from flagship customer LatiNode. What it perhaps understandably neglects to mention is that LatiNode is in a <a href="http://www.pepperlaw.com/publications_update.aspx?ArticleKey=1651" onclick="javascript:pageTracker._trackPageview('/www.pepperlaw.com');">dormant state</a>, placed there by acquirer Elandia due to LatiNode&#8217;s criminally corrupt customer acquisition practices.</p>
<p>I also don&#8217;t believe ParAccel&#8217;s endlessly-repeated claim that is has never lost a benchmark on performance. However, I must in fairness note that while I&#8217;ve been given names of customers who are supposed counterexamples to this claim by somebody I trust, I&#8217;ve never been able to actually verify those supposed ParAccel losses.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/01/15/there-sure-seem-to-be-a-lot-of-inaccuracies-on-paraccels-website/feed/</wfw:commentRss>
		<slash:comments>22</slash:comments>
		</item>
		<item>
		<title>Notes on RainStor, the company formerly known as Clearpace</title>
		<link>http://www.dbms2.com/2009/12/11/rainstor-clearpace/</link>
		<comments>http://www.dbms2.com/2009/12/11/rainstor-clearpace/#comments</comments>
		<pubDate>Sat, 12 Dec 2009 00:15:02 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Clearpace]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Telecommunications]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1295</guid>
		<description><![CDATA[Information preservation* DBMS vendor Clearpace officially changed its name to RainStor this week. RainStor is also relocating its CEO John Bantleman and more generally its headquarters to San Francisco. This all led to a visit with John and his colleague Ramon Chen, highlights of which included:

RainStor expects to finish the 	year with &#62; 50 users [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;"><a href="http://www.dbms2.com/2008/12/16/database-archiving-and-information-preservation/" >I</a><a href="http://www.dbms2.com/2008/12/16/database-archiving-and-information-preservation/" >nformation preservation</a>* DBMS vendor Clearpace officially changed its name to RainStor this week. RainStor is also relocating its CEO John Bantleman and more generally its headquarters to San Francisco. This all led to a visit with John and his colleague Ramon Chen, highlights of which included:<span id="more-1295"></span><!--more--></p>
<ul>
<li>RainStor expects to finish the 	year with &gt; 50 users (overwhelmingly via partners)</li>
<li>A big market for RainStor (at 	least in terms of signed partnerships and large deal activity) is 	retention of telecom records, for compliance purposes, typically for 	a 1-3 year period. This includes:
<ul>
<li>CDRs (Call Detail Records)</li>
<li>Mobile phone records including 	CDRs and missed calls</li>
<li>SMS (Short Message Service), 	including the complete text of same</li>
</ul>
</li>
<li>RainStor thinks a number of larger 	telcos have the need to store a billion records per day each. (I&#8217;m 	not sure how many subscribers such a telco would have to have).</li>
<li>John further thinks that, for the 	same query performance, RainStor can handle such a database on 4 	blades. More precisely, he says that&#8217;s what happened at a test 	conducted by a major technology firm. In the same test case, SenSage 	required 40 blades, and Oracle required 80 or more cores on a pair 	of big SMP machines.  John further says that the Oracle solution 	required a new table and new tablespace every day, while RainStor&#8217;s 	took 3 days for initial installation and required no DBA afterwards. 	However, I&#8217;m in no position to verify this report independently.</li>
<li>In a different kind of proof 	point, so extreme it gives even the RainStor folks pause, a user has 	retired 300 different applications and put their databases onto a 	single 2-core box. (Presumably, this is via RainStor&#8217;s OEM 	relationship with Informatica.)</li>
<li>Coming Very Soon are some services 	tying RainStor&#8217;s DBMS to obvious-suspect SaaS offerings. The core 	positioning is “SaaS data escrow”.i.e., RainStor will help you 	ensure that, in a worst-case scenario, there&#8217;s a nice safe copy of 	your data you can get at. RainStor also encourages you to do basic 	reporting and BI against the RainStor copy of the data, if you 	choose.</li>
<li>The idea I&#8217;ve been pushing lately 	of taking a heterogeneous replication offering like Continuent&#8217;s and 	having it feed an archiving store like RainStor&#8217;s has hit a rather 	basic snag. RainStor doesn&#8217;t actually consume change data capture 	kinds of information directly, at least as of yet, because of 	difficulties fitting such a stream into its 	guaranteed-data-immutability model.</li>
</ul>
<p><em>*I coined that category description for John in the tea room of the Park Lane Hotel. He&#8217;s subsequently embraced it enthusiastically, and I kind of like it myself. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </em></p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>
<p style="margin-bottom: 0in;">RainStor&#8217;s approach to 	compression, as described by <a href="http://www.dbms2.com/2009/05/14/the-secret-sauce-to-clearpaces-compression/" >me</a> and by <a href="http://www.rainstor.com/news-blog/blog/rainstors-secret-sauce-data-and-pattern-deduplication" onclick="javascript:pageTracker._trackPageview('/www.rainstor.com');">RainStor itself</a></p>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/11/rainstor-clearpace/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A framework for thinking about data warehouse growth</title>
		<link>http://www.dbms2.com/2009/12/07/data-warehouse-volume-growth/</link>
		<comments>http://www.dbms2.com/2009/12/07/data-warehouse-volume-growth/#comments</comments>
		<pubDate>Mon, 07 Dec 2009 13:50:47 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1278</guid>
		<description><![CDATA[There are only three ways that the amount of data stored in data warehouses can grow:

The same kinds of data are 	stored as before, with more being added over time.
The same kinds of data are stored 	as before, but in more detail.
New kinds of data are 	stored.

The first of those three ways doesn&#8217;t lead to [...]]]></description>
			<content:encoded><![CDATA[<p>There are only three ways that the amount of data stored in data warehouses can grow:</p>
<ul>
<li><strong>The same kinds of data are 	stored as before, </strong>with more being added over time.</li>
<li>The same kinds of data are stored 	as before, but in <strong>more detail.</strong></li>
<li><strong>New kinds</strong> of data are 	stored.</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-1278"></span>The first of those three ways doesn&#8217;t lead to dramatic growth. If a data warehouse goes up from 5 years of data to 6, then its overall size will grow a little over 20%.  (How little depends on what the underlying business growth is –  i.e., on how many more business events you have next year than you had 3 years ago.) That&#8217;s almost certainly going to be well-handled, by whatever technology manages your data warehouse today, given that:</p>
<ul>
<li>Chips are still subject to 	something resembling Moore&#8217;s Law.</li>
<li>Disk capacity is still subject to 	Kryder&#8217;s Law, which is like Moore&#8217;s Law but with yet faster growth 	rates.</li>
<li>DBMS software gets more performant 	over time.</li>
</ul>
<p style="margin-bottom: 0in;">So <strong>the cost of managing your same-as-before data will go down every year,</strong> even as the volume of that data grows.</p>
<p style="margin-bottom: 0in;">True, <a href="../2005/11/13/breaking-the-disk-speed-barrier/">disk rotation speeds have only increased 12.5 times since the Eisenhower Administration</a>. But <a href="../2009/10/25/teradata-hardware-strategy-and-tactics/">solid-state drives (SSDs) are getting practical for data warehousing</a> fast, so even that bottleneck eventually will get swept away. And since what we&#8217;re discussing is, basically, the first and hence presumably highest-value data to be warehoused, it&#8217;s apt to wind up on SSDs before some other kinds of data warrant that treatment.  So it&#8217;s the two other factors that drive the greatest data warehouse growth.</p>
<p style="margin-bottom: 0in;">As costs go down, the wisdom of keeping <strong>detailed data</strong> goes up. I&#8217;d go so far as to say that <strong>every piece of data generated by a human being should be preserved and kept online,</strong> legal and privacy considerations permitting.* Most forms of capital-, labor-, and/or location-based competitive advantage being commoditized and/or globalized away. But information remains a unique corporate asset.  Don&#8217;t discard it lightly.</p>
<p style="margin-bottom: 0in;"><em>*Unless there&#8217;s an explicit law mandating data destruction, legal considerations </em>should <em>permit. The idea “Let&#8217;s destroy something of irreplaceable value today, against the possibility we might be brought to judgment tomorrow” is both morally and pragmatically weird. Privacy, however, may be a different matter.</em></p>
<p style="margin-bottom: 0in; font-style: normal;">What that means in practice is that “disk is the new tape.” No-apologies performance can be had on data warehouse systems for <a href="http://www.dbms2.com/2009/07/30/the-netezza-price-point/" >$20,000/terabyte</a> or less – perhaps even <a href="http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/" >a lot less</a>. Tolerable performance may cost 3-4X less than that. I think a lot of the growth in data warehouse volumes is of exactly this kind.</p>
<p style="margin-bottom: 0in; font-style: normal;">Ultimately, however, the greatest growth in data warehouse volumes will come from <strong>new kinds of data,</strong> especially data that is partly or wholly <strong>machine-generated.</strong><span> Moore&#8217;s Law applied to sensor chips tells us that data creation will grow just as fast as the data storage capacity. And thus </span><strong>we will be throwing away most machine-generated data forever.</strong><span> But what we keep will grow – well, it probably will grow at Moore&#8217;s/Kryder&#8217;s Law speeds.</span></p>
<p style="margin-bottom: 0in; font-style: normal;"><span>That&#8217;s not to say new kinds of data are all high-volume/machine-generated. Back in 2005, I wrote<a href="http://www.computerworld.com/s/article/103054/More_Data_Makes_Your_Business_Grow?taxonomyId=9&amp;pageNumber=2" onclick="javascript:pageTracker._trackPageview('/www.computerworld.com');"> </a></span><span><a href="http://www.computerworld.com/s/article/103054/More_Data_Makes_Your_Business_Grow?taxonomyId=9&amp;pageNumber=2" onclick="javascript:pageTracker._trackPageview('/www.computerworld.com');">two</a> <a href="http://blogs.computerworld.com/node/512" onclick="javascript:pageTracker._trackPageview('/blogs.computerworld.com');">pieces</a></span><span> for </span><em><span>Computerworld</span></em><span> advocating aggressive pursuit of new data sources, and the examples I mentioned were:</span></p>
<ul>
<li><span>Loyalty cards, especially 	in gaming</span></li>
<li>Location-based analytics</li>
<li>Extra customer feedback (e.g., 	opinion surveys)</li>
<li>Price/offer testing</li>
<li>Text mining 	in general</li>
<li>Medical 	records</li>
</ul>
<p style="margin-bottom: 0in;">Today I&#8217;d add (among others):</p>
<ul>
<li>RFID</li>
<li>The raw 	output from medical test devices</li>
<li>Sensors up and down the energy supply chain</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">But some of those older, low-data-volume ideas still head my list of low-hanging analytic fruit.</p>
<p style="margin-bottom: 0in; font-style: normal;"><span>One more complication – these buckets I&#8217;m outlining are less than precise. For example:</span></p>
<ul>
<li><span>Telecom 	CDRs (Call Detail Records) are machine-generated from a seed of 	human activity. They have long been stored, but now are being kept 	in much more detail. This is why telecommunications is one of the 	top markets for data warehouse technology.</span></li>
<li><span>Stock 	trade data used to be based on human decisions. Now most of it is 	just machines buying and selling from each other. Either way, 	increasingly many investment institutions want to keep 	100-terabyte-scale databases of complete historical trade detail. 	And that is why financial services is another huge market for data 	warehouse technology.</span></li>
<li><span>Not 	long ago, web and network event logs. didn&#8217;t even exist, or were 	tiny where they did. Now they fill the largest known commercial 	databases, at firms such as </span><span><a href="http://www.dbms2.com/2009/10/01/yahoos-decapetabyte-data-warehousinghadoop/" >Yahoo</a>, 	<a href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/" >eBay</a>, and <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >Facebook</a>.</span><span> Even so, more is thrown away than kept, especially on the network 	event side, which is a multiple of the size of the pure clickstream 	data.</span></li>
<li><span>We 	don&#8217;t know exactly what all data intelligence agencies collect from 	telemetry, from monitoring commercial telecommunication traffic, and 	so on. But they&#8217;re surely throwing the vast majority away, even as 	the small part they keep is </span><span><a href="http://www.dbms2.com/2009/09/30/facts-and-rumors/" >petabyte-scale</a>.</span></li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">But none of that interferes with my main points, which are:</p>
<ul>
<li><strong>Databases 	will continue to grow very quickly.</strong></li>
<li>One big driver 	is <strong>the increasing detail in which data is kept online.</strong></li>
<li>An even bigger 	driver will be <strong>the unending ability of machines to generate ever 	greater streams of at-least-somewhat interesting data.</strong></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/07/data-warehouse-volume-growth/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Webinar on MapReduce for complex analytics (Thursday, December 3, 10 am and 2 pm Eastern)</title>
		<link>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/</link>
		<comments>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/#comments</comments>
		<pubDate>Wed, 02 Dec 2009 20:57:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1267</guid>
		<description><![CDATA[The second in my two-webinar series for Aster Data will occur tomorrow, twice (both live), at 10 am and 2 pm Eastern time. The other presenters will be Jonathan Goldman, who was Principal Scientist at LinkedIn but now has joined Aster himself, and Steve Wooledge of Aster (playing host). Key links are:

Registration for tomorrow&#8217;s webinars
Replay [...]]]></description>
			<content:encoded><![CDATA[<p>The second in my two-webinar series for Aster Data will occur tomorrow, twice (both live), at 10 am and 2 pm Eastern time. The other presenters will be Jonathan Goldman, who was Principal Scientist at LinkedIn but now has joined Aster himself, and Steve Wooledge of Aster (playing host). Key links are:</p>
<ul>
<li>Registration for <a href="http://www.asterdata.com/wc_091203_masteringmapreduce/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');">tomorrow&#8217;s webinars</a></li>
<li>Replay of the <a href="http://www.asterdata.com/masteringmapreduce2/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');"> first webinar</a></li>
<li>My slides from the <a href="http://www.dbms2.com/2009/10/15/mapreduce-webinar-slides/" >first webinar</a></li>
</ul>
<p>The main subjects of the webinar will be:</p>
<ul>
<li>Some review of material from the first webinar (all three presenters)</li>
<li>Discussion of how MapReduce can help with three kinds of analytics:
<ul>
<li>Pattern matching (Jonathan will give detail)</li>
<li>Number-crunching (I&#8217;ll cover that, and it will be short)</li>
<li>Graph analytics (I haven&#8217;t written the slides yet, but my starting point will be some of the <a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/" >relationship analytics</a> ideas we discussed in August)</li>
</ul>
</li>
</ul>
<p>Arguably, aspects of data transformation fit into each of those three categories, which may help explain why data transformation has been so prominent among the early applications of MapReduce.</p>
<p>As you can see from Aster&#8217;s title for the webinar (which they picked while I was on vacation), at least their portion will be focused on customer analytics, e.g. web analytics.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Boston Big Data Summit keynote outline</title>
		<link>http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/</link>
		<comments>http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/#comments</comments>
		<pubDate>Mon, 23 Nov 2009 06:25:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[DBMS product categories]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Humor]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1227</guid>
		<description><![CDATA[Last month, Bob Zurek asked me to give a talk on “Big Data”, where “big” is anything from a few terabytes on up, then moderate a panel on cloud computing. We agreed that I could talk just from notes, without slides. So, since I have them typed up, I&#8217;m posting them below.

The top two points [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Last month, Bob Zurek asked me to give a talk on <a href="http://www.dbms2.com/2009/10/09/presentations-upcoming/" >“Big Data”, where “big” is anything from a few terabytes on up</a>, then moderate a panel on cloud computing. We agreed that I could talk just from notes, without slides. So, since I have them typed up, I&#8217;m posting them below.</p>
<p><span id="more-1227"></span></p>
<p style="margin-bottom: 0in;">The top two points from Q&amp;A probably were:</p>
<ul>
<li><strong>Big Data and the cloud actually 	have relatively little to do with each other,</strong> <a href="http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/" >a few exceptions</a> notwithstanding, especially if the data is in a shared-nothing DBMS 	(as opposed to, say, a MapReduce-oriented file cluster). Two 	principal reasons are:
<ul>
<li>Redistributing data from node to 	node is a little slow, undermining some of the elasticity benefits 	of the cloud.</li>
<li><a href="http://www.dbms2.com/2009/05/29/sneakernet-to-the-cloud/" >Getting data into the cloud in the 	first place is a lot slow</a>.</li>
</ul>
</li>
<li><strong>The NoSQL movement is a lot like 	the Ron Paul campaign</strong> &#8212; it consists of people who are dissatisfied 	with the status quo, whose dissatisfaction has a lot to do with 	insufficient liberty and/or excessive expenditure, and who otherwise 	don&#8217;t have a whole lot in common with each other.</li>
</ul>
<p style="margin-bottom: 0in;">Anyhow, here are my notes for the talk, edited in just a couple of places for readability or linkage.</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><strong>Quick introduction</strong></p>
<ul>
<li>Big Data vs. cloud</li>
<li>How big is Big Data?</li>
<li>At the low end of that range, 	there&#8217;s little you can&#8217;t do with conventional technology if you 	have:
<ul>
<li>An unlimited budget for hardware</li>
<li>An unlimited budget for software</li>
<li>An unlimited budget for people, 	especially Oracle DBAs</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Big Data in OLTP</strong></p>
<ul>
<li>Hard-core OLTP
<ul>
<li>Focus of DBMS technology for a 	long-time</li>
<li>Big budgets because each 	transaction has significant value</li>
<li>Tough to get users to change 	technologies</li>
</ul>
</li>
<li>Lighter-weight OLTP
<ul>
<li>Classic example = web companies
<ul>
<li>Big ones &#8212;  retail-oriented ones 	(eBay, Amazon) partially excepted &#8212; <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >rolled their own technology 	stacks</a></li>
<li>Reluctant to give money to anybody
<ul>
<li>Open source, etc.</li>
</ul>
</li>
</ul>
</li>
<li>Difficulty finding market
<ul>
<li>Product vs. feature
<ul>
<li>Clustering/HA/DR/whatever</li>
<li>Ditto cloud enablement</li>
</ul>
</li>
<li>True products haven&#8217;t found much 	traction yet</li>
</ul>
</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Analytic Big Data use cases</strong></p>
<ul>
<li>Kinds of data for analytics
<ul>
<li>More of same != big</li>
<li>More detail and/or new kinds
<ul>
<li>Complete data sets</li>
<li>Transactions</li>
<li>Call details</li>
<li>Tick/trade history</li>
<li>Web clickstreams</li>
<li>Network event logs</li>
<li>Other machine-generated data</li>
<li>CAM bottom line
<ul>
<li>Anything human-generated should 	and will be retained in its entirety</li>
<li>Quantities of machine-generated 	data retained should and will grow roughly in line w/ computing cost 	reductions (Moore&#8217;s Law, etc.)</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>Analytic uses of Big Data
<ul>
<li>Analytics is mainly about three 	things
<ul>
<li>Problem detection</li>
<li>Customer relationship improvement
<ul>
<li>(Those overlap when the customer 	relationship is bad)</li>
</ul>
</li>
<li>Financial statements on steroids</li>
</ul>
</li>
</ul>
<ul>
<li>Main kinds of analytics
<ul>
<li>What BI vendors traditionally sell
<ul>
<li>General reporting and dashboards</li>
<li>Ad-hoc query (now driven from 	those reports and dashboards)</li>
<li>Planning (allegedly integrated 	with BI)</li>
</ul>
</li>
<li>Research
<ul>
<li>Ad hoc relational query (worth 	mentioning twice because it drives so much of the market)</li>
<li>Data mining</li>
<li>Most web search and web mining</li>
</ul>
</li>
<li>Operational/near-real-time</li>
<li>Archiving/compliance</li>
</ul>
</li>
<li>What gets Big?
<ul>
<li>Mainly research and archiving</li>
<li>But when reporting or operational 	get Big, you have really interesting computing problems</li>
</ul>
</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Technology issues and trends</strong></p>
<ul>
<li>Moore&#8217;s Law
<ul>
<li>CPUs &#8212; All about cores, hence 	parallelism is key</li>
<li>RAM</li>
<li>SSDs – hence replace disks</li>
<li>Sensors – hence generate lots 	more data</li>
</ul>
</li>
<li>Kryder&#8217;s Law
<ul>
<li>But <a href="http://www.dbms2.com/2005/11/13/breaking-the-disk-speed-barrier/" >rotational speeds up only 	12.5X since Eisenhower Administration</a></li>
<li>Hence solid-state memory (or RAM) 	will soon take over</li>
</ul>
</li>
<li>In the mean time, I/O bottlenecks 	have had to be beaten
<ul>
<li>Hence sequential scans</li>
<li>Hence <a href="http://www.dbms2.com/2007/03/26/index-light-mpp-data-warehouse-appliances/" >index-light</a> architectures</li>
<li>Hence columnar</li>
</ul>
</li>
<li>DBMS “overhead”
<ul>
<li>Raw license and maintenance fees – 	software increasing fraction of total</li>
<li>OLTP vestiges – locking and all 	that</li>
<li>DBAs
<ul>
<li>People costs = huge fraction of 	total</li>
<li>Index-lightness addresses</li>
<li>So does appliance</li>
</ul>
</li>
<li>Many people don&#8217;t really know how to 	write SQL</li>
</ul>
</li>
<li>Configuration
<ul>
<li>Appliance/tightly-balanced
<ul>
<li>Netezza</li>
<li>Teradata earlier</li>
<li>Greenplum/Sun</li>
<li>Oracle</li>
<li>IBM</li>
<li>Microsoft/Madison</li>
</ul>
</li>
<li>Commodity/do what you want
<ul>
<li>Vertica</li>
<li>Greenplum now</li>
<li>Infobright, Aster and others</li>
<li>MapReduce-oriented file systems</li>
</ul>
</li>
<li><a href="http://www.dbms2.com/2009/10/25/data-warehouse-balanced-hardware-configuration/" >Extreme rigidity is silly</a>
<ul>
<li><a href="http://www.dbms2.com/2009/10/25/teradata-hardware-strategy-and-tactics/" >Teradata, Oracle have both 	signaled moving to more modularity</a></li>
<li>Big driver of that = heterogeneous 	storage
<ul>
<li>Cheap disk</li>
<li>Expensive disk</li>
<li>Solid-state</li>
<li>RAM</li>
</ul>
</li>
</ul>
<ul>
<li>CPU/storage ratio is even more of a 	driver</li>
</ul>
</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Theoretically defensible ways to segment the market</strong></p>
<ul>
<li><a href="http://www.dbms2.com/2009/09/10/analytic-speed-latency/" >Latency requirements</a>
<ul>
<li>High availability and low latency 	go together</li>
</ul>
</li>
<li>Query types
<ul>
<li>Simultaneous users for same</li>
</ul>
</li>
<li>Database size</li>
<li>Budget</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Actual segments right now</strong></p>
<ul>
<li><a href="http://www.dbms2.com/2009/08/24/teradatas-active-enterprise-data-warehouse-story/" >Utter ADW/EDW</a></li>
<li>Data mart
<ul>
<li>Size</li>
<li>Naturally columnar vs. naturally 	row-based</li>
</ul>
</li>
<li>Operational/frontline</li>
<li>Less dramatic/smaller EDW</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
