<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Web analytics</title>
	<atom:link href="http://www.dbms2.com/category/applications/web-analytics/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 02 Sep 2010 09:06:44 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Big Data is Watching You!</title>
		<link>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/</link>
		<comments>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/#comments</comments>
		<pubDate>Wed, 11 Aug 2010 05:30:22 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2760</guid>
		<description><![CDATA[There&#8217;s a boom in large-scale analytics. The subjects of this analysis may be categorized as:

People
Financial trades
Electronic networks
Everything else

The most varied, interesting, and valuable of those four categories is the first one.

That may change some day, with the growing importance of machine-generated data, and of big-data science in particular. But I think it&#8217;s a fair assessment [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">There&#8217;s a boom in large-scale analytics. The subjects of this analysis may be categorized as:</p>
<ul>
<li>People</li>
<li>Financial trades</li>
<li>Electronic networks</li>
<li>Everything else</li>
</ul>
<p style="margin-bottom: 0in;">The most varied, interesting, and valuable of those four categories is the first one.</p>
<p><span id="more-2760"></span></p>
<p style="margin-bottom: 0in;"><em>That may change some day, with the growing importance of<a href="http://www.dbms2.com/2010/04/08/machine-generated-data-example/" > </a><a href="http://www.dbms2.com/2010/04/08/machine-generated-data-example/" >machine-generated data</a>,</em><em> and of <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/" >big-data science</a> </em><em>in particular. But I think it&#8217;s a fair assessment at the present, and for at least the next few years.</em></p>
<p style="margin-bottom: 0in;">Some of th<span style="font-weight: normal;">e most interesting use cases are concentrated in the areas of identifying individuals, groups of people, or behaviors of (groups of) people. For example:</span></p>
<ul>
<li>comScore works hard to <strong>identify 	individual web surfers </strong><span style="font-weight: normal;">– 	i.e. to </span><strong>deanonymize</strong><span style="font-weight: normal;"> them &#8212; even</span> though they may have given incomplete or false 	personal information.</li>
<li>Other companies at least try to 	figure out <strong>which information in a user&#8217;s profile is unreliable,</strong> so as to classify them better. (Yes, there are 62-year-old 	video-game-obsessed Lady Gaga fans, but that&#8217;s generally not the way 	to bet.)</li>
<li>Multiple telecom vendors try to 	identify who their <strong>most influential customers</strong> are (to a first 	approximation, they&#8217;re the ones most often called by the most 	people, but it surely gets more sophisticated than that). This 	information is then used to reduce churn, either by working hard to 	retain those users, or – if they do churn – to move very fast to 	retain the business from their friends.</li>
<li>Other kinds of companies do 	similar kinds of analysis, to the extent that they have enough of a 	social graph to do so. (This application is a case where the term 	“<a href="http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/" >social graph</a>” is not a misnomer.)</li>
<li><strong>Turing detectives</strong> (I just 	coined that phrase) try to determine whether users are humans or 	bots.</li>
<li>Central to detecting <strong>insurance 	fraud</strong> is identifying suspiciously close connections between 	claimants, service providers, and so on.</li>
<li>Identifying groups of people is 	also important in flagging <strong>insider trading.</strong><span style="font-weight: normal;"> Even more important are other kinds of analysis, along the lines of 	“is this normal innocent trading behavior?” </span></li>
<li><span style="font-weight: normal;">Intelligence 	agencies try to detect networks of </span><strong>terrorists</strong><span style="font-weight: normal;"> and their sympathizers. They further try to identify unusual 	patterns of communication or meetings along those networks that 	might indicate terrorist acts are being planned. (Civilian law 	enforcement agencies can use similar techniques.)</span></li>
</ul>
<p style="margin-bottom: 0in; font-weight: normal;">In most cases, the analysis and/or run-time execution of the relevant models is done with the help of analytic DBMS. Other technologies that come into play include non-DBMS MapReduce (Hadoop), graph engines, and CEP (Complex Event Processing). The vendor most heavily represented on that list is probably Aster Data, because:</p>
<ul>
<li>Aster Data is 	focused on hard-core analytics.</li>
<li>I talk a lot 	with Aster Data, and in particular had a long, detailed use-cases 	discussion with them last week.</li>
<li><span style="font-weight: normal;">The 	comScore example happens to come from a speaker at </span><a href="http://www.dbms2.com/2010/05/07/implications-onew-analytic-technology/" ><span style="font-weight: normal;">an 	Aster event</span></a><span style="font-weight: normal;"> I also 	participated in.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">And by the way, all this only scratches the surface of what will be possible down the road. It&#8217;s based mainly on where you live, what you purchase, how you behave on websites, and who you communicate with. </span><span style="color: #000080;"><span lang="zxx"><span style="text-decoration: underline;"><a href="../2010/07/04/fair-data-use/"><span style="font-weight: normal;">Other kinds of data, which could be used to be yet more intrusive</span></a></span></span></span><span style="font-weight: normal;">, generally aren&#8217;t involved.</span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">I actually have two points in drawing up this list. One is golly-gee-whiz about how a lot of analytically sophisticated applications are actually getting into production. The other is to highlight the privacy and liberty threats If This Goes On Unchecked (which is why I didn&#8217;t include some other less-people-focused examples). There&#8217;s also a related danger that, to the extent we don&#8217;t get some smart regulations to keep us safe(r), we&#8217;ll get a bunch of stupid regulations instead. </span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">The Analytic Era has only just begun.<br />
</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Why you should go to XLDB4</title>
		<link>http://www.dbms2.com/2010/07/01/why-you-should-go-to-xldb4/</link>
		<comments>http://www.dbms2.com/2010/07/01/why-you-should-go-to-xldb4/#comments</comments>
		<pubDate>Thu, 01 Jul 2010 04:23:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2456</guid>
		<description><![CDATA[Scientific data commonly:

Comes in large volumes
Is machine-generated
Is augmented by synthetic and/or 	derived data
Has a spatial and/or temporal 	structure

In those respects, it is akin to some of the hottest areas for big data analytics, including:

Investment trade data – big, 	partly machine generated, augmented (often), temporal
Web/network log data – big, 	machine-generated, post-processed into derived form, temporal
Marketing analytic [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Scientific data commonly:</p>
<ul>
<li>Comes in large volumes</li>
<li>Is machine-generated</li>
<li>Is augmented by synthetic and/or 	derived data</li>
<li>Has a spatial and/or temporal 	structure</li>
</ul>
<p style="margin-bottom: 0in;">In those respects, it is akin to some of the hottest areas for big data analytics, including:</p>
<ul>
<li>Investment trade data – big, 	partly machine generated, augmented (often), temporal</li>
<li>Web/network log data – big, 	machine-generated, post-processed into derived form, temporal</li>
<li>Marketing analytic data – big, 	post-processed into derived form</li>
<li>Genomic data</li>
</ul>
<p style="margin-bottom: 0in;">So when Jacek Becla started the <a href="http://www.dbms2.com/2009/09/12/xldb-scid/" >XLDB</a> conferences on the premise that <strong>scientific and big data analytic challenges have a lot in common,</strong> he had a point. There are several tough database problems that the science-focused folk<span style="font-style: normal;">s have taken the leading in thinking about, but which are soon going to matter to the commercial world as well. And that&#8217;s one of two big reasons why you should consider participating</span><span style="font-style: normal;"><span style="font-weight: normal;"> in </span></span><span style="font-weight: normal;"><a href="http://www-conf.slac.stanford.edu/xldb10/" onclick="javascript:pageTracker._trackPageview('/www-conf.slac.stanford.edu');">XLDB4, October 6-7, at the SLAC facility in Menlo Park, CA</a>, </span><span style="font-style: normal;"><span style="font-weight: normal;">as an attendee, spo</span></span><span style="font-style: normal;">nsor, or both. </span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">The oth</span>er big reason is that it is <strong>important for the world that XLDB succeed.</strong> <span id="more-2456"></span>Computer technology to analyze global warming is lacking; better database technology is one of the ways it could improve. Database technology also has important potential contributions to make in medical research and other worthy endeavors, and in a lot of purer science too (Jacek himself is an astronomy guy).</p>
<p style="margin-bottom: 0in;">Other reasons to get involved with XLDB4 include:</p>
<ul>
<li><strong>It doesn&#8217;t cost much.</strong> The 	whole thing is done in academic conference dollar amounts, with low 	attendance fees, the venue use probably donated by SLAC, and (I 	would guess) low sponsorship fees as well.</li>
<li><strong>It&#8217;s fun.</strong> I think I 	attended more sessions in two days at XLDB3 than at the all the 	other conferences I went to last year combined.</li>
</ul>
<p style="margin-bottom: 0in;">All in all, I intend to spend three whole days attending conference sessions that week, which is something I almost never do. But it&#8217;s for my two favorite causes – scientific data management and liberty/privacy. Stayed tuned for news on the latter front soon.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>The <a href="http://www-conf.slac.stanford.edu/xldb10/" onclick="javascript:pageTracker._trackPageview('/www-conf.slac.stanford.edu');">XLDB4 conference web site</a></li>
<li>My original post on <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/" >issues in 	scientific management</a>, most of which will be discussed at XLDB4</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/01/why-you-should-go-to-xldb4/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cloudera Enterprise and Hadoop evolution</title>
		<link>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/</link>
		<comments>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 17:22:27 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2440</guid>
		<description><![CDATA[I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I&#8217;d say:  

If you are or want to be a serious 	MapReduce user – and you&#8217;re past the “play around over the 	weekend” stage &#8212; you probably should have either:

A serious non-DBMS MapReduce 	distribution.
MapReduce integrated into your [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I&#8217;d say:  <span id="more-2440"></span></p>
<ul>
<li>If you are or want to be a serious 	MapReduce user – and you&#8217;re past the “play around over the 	weekend” stage &#8212; you probably should have either:
<ul>
<li>A serious non-DBMS MapReduce 	distribution.</li>
<li>MapReduce integrated into your 	analytic DBMS.</li>
<li>Both.</li>
</ul>
</li>
<li>The obvious choice for non-DBMS 	MapReduce is Hadoop.</li>
<li>The obvious choice for a Hadoop 	distribution is <strong>Cloudera Enterprise.</strong></li>
<li>Cloudera Enterprise has three main 	aspects, in an inseparable bundle:
<ul>
<li>Distributions for a double-digit 	number of open source projects. It&#8217;s nice having all that in one 	package – unless, of course, you like playing with Tinkertoys.</li>
<li>Proprietary Cloudera code.</li>
<li>Cloudera support.</li>
</ul>
</li>
<li>Cloudera says its proprietary code 	is and in the future is planned to be concentrated – at least in 	large part &#8212; on integrating open source technology with closed 	source products. This has the virtue of being targeted directly at 	that segment of the market which has proven it&#8217;s actually willing to 	pay money for software.</li>
<li>Cloudera Enterprise areas of 	focus, now and in the presumed future, include:
<ul>
<li><strong>Core Hadoop engine,</strong> which 	Cloudera says is quite predictably and appropriately evolving more 	slowly than the tools around it.</li>
</ul>
<ul>
<li><strong>Development, management and 	administrative tools,</strong> including:
<ul>
<li><strong>Pig</strong> and <strong>Hive</strong>. Cloudera says &gt;70% 	of Facebook Hadoop jobs are initiated through Hive, and the same is 	true of Yahoo and Pig.</li>
<li>Connectivity to commercial tools.</li>
<li>The product formerly known as 	“Cloudera Desktop.”</li>
</ul>
</li>
<li><strong>Workflow</strong>, which in this context 	refers to letting you create a Hadoop application as a sequence of 	small steps, rather than forcing you to kluge it into being one 	unwieldy thing. At the moment, this is much less widely adopted than 	Pig and Hive, but Cloudera has high hopes for it, because of its 	obvious benefits in modularity and manageability.</li>
<li><strong>Quasi-DBMS technology.</strong> Besides Hive and Pig, this includes <strong>HBase.</strong> Cloudera says there has 	been considerable demand for HBase, and it is pleased that project 	is now mature enough to ship. Cloudera stresses that it intends 	HBase not for OLTP, but as an adjunct to analytic processing. E.g., 	Cloudera suggests HBase would be a fine vehicle for replicating 	dimension tables across each node of a cluster.</li>
<li><strong>Data connectivity, </strong><span style="font-weight: normal;">e.g. 	to MySQL or to sensor log files.</span></li>
</ul>
</li>
<li>Cloudera Enterprise pricing is 	well below DBMS prices – not by a full order of magnitude, if I&#8217;m 	right about everybody&#8217;s quantity discount policies, but even so by a 	lot. Details are NDA.</li>
</ul>
<p style="margin-bottom: 0in;">Cloudera sometimes sends confusing signals about its beliefs and strategies. For example, one can get different stories depending on whether one talks to:</p>
<ul>
<li>Somebody at Cloudera who comes 	primarily from the user and open source communities.</li>
<li>Somebody at Cloudera who has 	actually worked at a software company before.</li>
</ul>
<p style="margin-bottom: 0in;">But I predict that Cloudera will now stick for a while with more or less the strategy outlined above.</p>
<p style="margin-bottom: 0in;">Naturally, we also talked about Hadoop adoption. Highlights of that part – no doubt somewhat biased towards Cloudera&#8217;s own customer base &#8212; included:</p>
<ul>
<li>Notwithstanding <a href="http://www.dbms2.com/2009/04/14/ebay-thinks-mpp-dbms-clobber-mapreduce/" >eBay&#8217;s prior 	skepticism about MapReduce</a>, it is quoted saying nice things in a Cloudera press release, 	and has apparently become quite a large Hadoop user, starting out 	with a search-quality use case.</li>
<li>Typical Hadoop deployment sizes 	are 10 nodes or so when experimenting, 80-500+ in production.</li>
<li>10 terabytes/node – I&#8217;m pretty 	sure Cloudera meant of user data &#8212; is not inconceivable, so a 	cost-conscious 500-node user could have 5 petabytes of data managed 	by Hadoop.</li>
<li>Cloudera has half a dozen 	customers at the 75+ node production level.</li>
<li>Web and financial services are the 	two vertical markets moving most aggressively into Hadoop 	production. The government is also in significant Hadoop production, 	but the details of that are classified.</li>
<li>Web uses for Hadoop include:
<ul>
<li>Clickstream – sessionization, 	etc. – that&#8217;s a super-mainstream use.</li>
<li>Search – analyzing search 	attempts in conjunction with structured data.</li>
<li>Machine learning (for ad serving, 	etc.).</li>
</ul>
</li>
<li>Financial services uses for Hadoop 	include:
<ul>
<li>Internal trading rule 	enforcement/fraud detection.</li>
<li>Complex ETL.</li>
<li>Portfolio risk assessment 	(typically overnight).</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">None of this is inconsistent with previous surveys of <a href="http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/" >Hadoop use cases</a>.</p>
<p style="margin-bottom: 0in; font-style: normal;">Various users talked at the Hadoop Summit this week. I wasn&#8217;t there, and won&#8217;t write about their stories for now. That said, <a href="http://www.slideshare.net/kevinweil/hadoop-at-twitter-hadoop-summit-2010" onclick="javascript:pageTracker._trackPageview('/www.slideshare.net');">Twitter&#8217;s slide deck</a> from same has some interesting stuff, including:</p>
<ul>
<li><span style="font-style: normal;">7 	TB/day ETLed from MySQL.</span></li>
<li><span style="font-style: normal;">Petabytes-being-stored 	accordingly coming soon.</span></li>
<li><span style="font-style: normal;">Open 	sourcing their ETL tool Crane.</span></li>
<li><span style="font-style: normal;">3-4X 	LZO compression at little CPU cost.</span></li>
<li><span style="font-style: normal;">HBase 	is a more usable for them than HDFS, which isn&#8217;t mutable enough.</span></li>
<li><span style="font-style: normal;">Pig 	= 5% of code and coding effort vs. vanilla Hadoop at 30% or less 	performance hit.</span></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>The most important part of the “social graph” is neither social nor a graph</title>
		<link>http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/</link>
		<comments>http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/#comments</comments>
		<pubDate>Tue, 08 Jun 2010 05:18:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2235</guid>
		<description><![CDATA[“Social graph” is a highly misleading term, and so is “social network analysis.” By this I mean:
There&#8217;s something akin to “social graphs” and “social network analysis” that is more or less worthy of all the current hype – but graphs and network analysis are only a minor part of the whole story.
In particular, the most [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">“Social graph” is a highly misleading term, and so is “social network analysis.” By this I mean:</p>
<p><strong>There&#8217;s something akin to “social graphs” and “social network analysis” that is more or less worthy of all the current hype – but graphs and network analysis are only a minor part of the whole story.</strong></p>
<p style="margin-bottom: 0in;">In particular, <strong>the most important parts of the Facebook “social graph” are neither social nor a graph. </strong><span style="font-weight: normal;">Rather, what&#8217;s really important is an aggregate</span><strong> Profile of Revealed Preferences</strong><span style="font-weight: normal;">, of which person-to-person connections or other things best modeled by a graph play only a small part.</span></p>
<p style="margin-bottom: 0in;"><span id="more-2235"></span>Let me hasten to note that – even when viewed narrowly &#8212; the ideas of “social graph”and “<a href="../2009/08/21/social-network-analysis-aka-relationship-analytics/">social network analysis</a>” do have significance. Nontrivial use cases to date for big data social network analysis include:</p>
<ul>
<li>Intelligence agencies identify and 	analyze terrorist networks. Corporations and civilian law 	enforcement do the same for fraud networks.</li>
<li>Telephone companies use calling 	data to figure out which of their customers are most likely to 	influence which other customers in the decision to keep or change 	service providers. (Frankly, I find that rather creepy.)</li>
<li>Social networks figure out which 	other members you&#8217;re likely to know, and encourage you to connect 	with them.</li>
</ul>
<p style="margin-bottom: 0in;">Epidemiologists aspire to add to that list, based on their success to date using much more micro forms of social network analysis. But after that, I&#8217;m running out of examples. Sure, graph analytics is good for a bunch of other things (e.g., biology at the genetic or molecular level), but those have little or nothing to do with “social graphs” or social network analysis as they are commonly understood.</p>
<p style="margin-bottom: 0in;"><em>Note: Of course, it is also the case that everything can be modeled by entity-attribute-value triples, and those can always be modeled by graphs. But so what?</em></p>
<p style="margin-bottom: 0in;">Let&#8217;s consider what, in a marketer&#8217;s ideal world, would go into yo<span style="font-weight: normal;">ur Profile of Revealed Preferences. Raw data might include:</span></p>
<ul>
<li><strong>Personally identifyING 	information. </strong>Duh. This is what makes everything else possible.</li>
<li><strong>Purchase transaction data.</strong> Lots of it. Like, everything on your credit card statements.</li>
<li><strong>Demographic and lifestyle 	information.</strong> Address, date of birth, educational history, race, 	household composition, and so on.</li>
<li><strong>Affiliations.</strong> Politics, 	religion, group membership of any kind. (OK, that&#8217;s partly social.)</li>
<li><strong>Explicitly stated opinions, 	preferences and desires,</strong><span style="font-weight: normal;"> including:</span>
<ul>
<li>Any surveys you have filled out.</li>
<li><strong>Any recommendations you have 	made</strong> (e.g., through the Facebook Like feature).</li>
<li>The text of anything you&#8217;ve 	written and posted – and, very ideally, of your private emails as 	well.</li>
<li>Any <strong>wish lists</strong> you&#8217;ve 	filled in.</li>
</ul>
</li>
<li><strong>Attention information.</strong> What 	you clicked on, what you looked at, and all that stuff website 	owners track.</li>
<li><strong>Your movements, </strong><span style="font-weight: normal;">to 	the extent they are tracked. (E.g., via Foursquare and the like.)</span></li>
<li><strong>Your gaming activities</strong><span style="font-weight: normal;"> and the like. (This is social mainly to the extent it overlaps with 	other categories I&#8217;ve already mentioned.)</span></li>
<li><strong>Your medical information.</strong><span style="font-weight: normal;"> </span></li>
<li><strong>Who you communicate with, and 	what you communicate with them about.</strong><span style="font-weight: normal;"> (Hey! There&#8217;s something else social!)</span></li>
<li><span style="font-weight: normal;">Similar </span><strong>information about the people you communicate with.</strong></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">My core </span><strong>privacy</strong><span style="font-weight: normal;"> thoughts about that data include:</span></p>
<ul>
<li><strong>Individuals deserve the right 	to control all that information.</strong><span style="font-weight: normal;"> At a minimum, they deserve total control over how the data (raw or 	derived) is passed from the service – e.g., website – where it 	naturally resides (e.g., where it is originated) to any other place.</span></li>
<li>Given a chance, <strong>individuals 	would make fine-grained choices about what parts of their Profile of 	Revealed Preferences are available to which organizations.</strong> Reasons include:</li>
<li>Individuals have rather complex 	trust relationships with different kinds of merchants and marketers.</li>
<li>Consumers get different benefits 	from sharing information with different kinds of merchants and 	marketers. (Sometimes personalization is a large benefit. Sometimes 	it&#8217;s just creepy. And some companies actively bribe you to give them 	information they can use to sell to you.)</li>
</ul>
<p style="margin-bottom: 0in;">When one frame things this way, two rather difficult technological questions naturally arise.</p>
<ol>
<li>Suppose, implausibly, that a 	single entity were allowed to control and use (for marketing) all of 	your Profile of Revealed Preferences information. How would they 	store and analyze it?</li>
<li>How does the answer to #1 change 	because control over the information will, in fact, be fragmented?</li>
</ol>
<p style="margin-bottom: 0in;">It&#8217;s tough enough to answer these questions for data about one person. Trying to include all but the simplest information about other people is and will for years remain quite infeasible. So, for the most part, <strong>this is not “social” information.</strong></p>
<p style="margin-bottom: 0in;">It&#8217;s also <strong>not naturally a “graph.”</strong> Similarly, it is <strong>not a good candidate for network analysis.</strong> To see why, let me outline <strong>why I used the name “Profile of Revealed Preferences”:</strong></p>
<ul>
<li>The reason marketers want this 	data is, mainly, because they want to know what appeals to you, and 	how strongly you feel about it.</li>
<li>The analytic process often entails 	taking explicit choices you have made, and inferring other 	preferences from them.</li>
<li>The output of the analytic process 	is often one or more “scores” that then get plugged into various 	selection algorithms to determine what you should be shown or 	offered. At least implicitly, these algorithms are predicting what 	you will or won&#8217;t respond well to.</li>
</ul>
<p style="margin-bottom: 0in;">Not much graph-like there.</p>
<p style="margin-bottom: 0in;">This post has gotten pretty long, so I&#8217;ll stop here without spelling anything else out. But questions I still hope to address down the road include:</p>
<ul>
<li>How sho<span style="font-weight: normal;">uld 	Profile of Revealed Preferences data</span> be stored?</li>
<li>Suppose we want to pass around 	derived results and not the raw data. How could we ever get to 	standards that would make such interchange realistic?</li>
<li>If we only have raw data to pass 	around, what are the implications for privacy, liberty, and the 	structure of the online industries?</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Notes on SciDB and scientific data management</title>
		<link>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/</link>
		<comments>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/#comments</comments>
		<pubDate>Sat, 22 May 2010 08:04:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[SciDB]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2178</guid>
		<description><![CDATA[I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That&#8217;s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about issues in scientific data management. Here&#8217;s some of what has transpired since then.
The main [...]]]></description>
			<content:encoded><![CDATA[<p>I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That&#8217;s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/" >issues in scientific data management</a>. Here&#8217;s some of what has transpired since then.</p>
<p>The main new activity I know of has been in the open source <a href="http://www.scidb.org/" onclick="javascript:pageTracker._trackPageview('/www.scidb.org');">SciDB</a> project.   <span id="more-2178"></span></p>
<ul>
<li>A company called Zetics has been started to commercialize SciDB. As of now, the entire staff seems to be CEO Marilyn Matz, techie Paul Brown, and part of Mike Stonebraker. Marilyn says Zetics has some venture capital, but even under NDA didn&#8217;t tell me who it was from. Zetics does not have its own web site.</li>
<li>Marilyn tells me there are 20-25 contributors to SciDB, led by Paul Brown and Mike Stonebraker. Brown is full-time. Persistent Systems has been donating the efforts of a few of its employees. Some <a href="http://www.lsst.org/lsst" onclick="javascript:pageTracker._trackPageview('/www.lsst.org');">LSST</a> folks have been doing SciDB work backed by grant money. Most or all of the rest seem to be purer volunteers. Some Russians have been particularly active.</li>
<li>Release 0.5 of SciDB is expected in June. Release 1.0 is expected in September. This is a rewrite; prior demo code has been scrapped. Perhaps not coincidentally, it&#8217;s also a small slip from prior project plans.</li>
<li>The array data model is an example of what&#8217;s being implemented first. (Duh &#8212; you can&#8217;t have a DBMS without a data model.) Support for uncertainty is an example of what&#8217;s been deferred until later.</li>
<li>As has been clear since XLDB3 last August, one major target market for SciDB is genomic research.</li>
<li>It&#8217;s obvious that the oil and gas industry, with all its geospatial data, should be interested in SciDB. But there&#8217;s not much activity in that regard; outreach is evidently needed. If you can think of somebody in that sector (or anywhere else) who should be alerted to SciDB, please ping them.</li>
<li>Interest from web analytics users in SciDB seems to have receded a bit from the days when eBay almost funded the project.</li>
</ul>
<p>In other scientific data management news,</p>
<ul>
<li>Microsoft put out a book called <a href="http://research.microsoft.com/en-us/collaboration/fourthparadigm/" onclick="javascript:pageTracker._trackPageview('/research.microsoft.com');">The Fourth Paradigm</a> on scientific database management. The whole thing can be downloaded, very officially, as a giant PDF. I think it&#8217;s worth skimming. I don&#8217;t think it&#8217;s worth actually reading. (I did read it.)</li>
<li><a href="http://www-conf.slac.stanford.edu/xldb/" onclick="javascript:pageTracker._trackPageview('/www-conf.slac.stanford.edu');">XLDB4</a> will be at Stanford October 5-7. Unlike prior XLDBs, it will have an open (i.e., no invitation required) part.</li>
</ul>
<p>Finally, you surely are aware of the whole &#8220;Climategate&#8221; mess, in which major climate researchers&#8217; email was hacked and many unkind conclusions were drawn. Well, one of the most technical parts of the disclosure was in a long series of Read Me files, in which an unfortunate programmer lamented about <a href="http://di2.nu/foia/HARRY_READ_ME-20.html" onclick="javascript:pageTracker._trackPageview('/di2.nu');">the difficulty of reconstructing published results from files at hand</a>. These turned out to illustrate a classic problem that SciDB or alternatives are meant to solve:</p>
<ul>
<li>Raw data was impossible to use without various adjustments to regularize it (the word &#8220;regridding&#8221; comes up a lot, for example). Massaging was needed before analytics could be done on it.</li>
<li>The raw data was thrown out or lost, and could not be reconstructed (why they couldn&#8217;t have asked the suppliers of the data to give it to them again was unclear in this case, since it wasn&#8217;t original experimental data).</li>
<li>It was thus impossible to massage the data in any new or improved way.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Truviso evidently reinvents itself</title>
		<link>http://www.dbms2.com/2010/05/04/truviso-evidently-reinvents-itself/</link>
		<comments>http://www.dbms2.com/2010/05/04/truviso-evidently-reinvents-itself/#comments</comments>
		<pubDate>Tue, 04 May 2010 19:26:03 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Truviso]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2045</guid>
		<description><![CDATA[When Aleri bought Coral8 last year, I wrote that the independent CEP (Complex Event Processing) vendors were floundering. Aleri quickly threw in the towel and sold out to Sybase, which hardly changed my opinion. StreamBase actually is persevering, but not with any kind of breakout success. Big vendors, such as Microsoft and IBM, have at [...]]]></description>
			<content:encoded><![CDATA[<p>When Aleri bought Coral8 last year, I wrote that <a href="http://www.dbms2.com/2009/03/09/independent-cep-vendors-continue-to-flounder/" >the independent CEP (Complex Event Processing) vendors were floundering</a>. Aleri quickly threw in the towel and <a href="http://www.dbms2.com/2010/02/05/sybase-aleri-rap/" >sold out to Sybase</a>, which hardly changed my opinion. <a href="http://www.dbms2.com/2010/02/16/quick-thoughts-on-the-streambase-component-exchange/" >StreamBase actually is persevering</a>, but not with any kind of breakout success. Big vendors, such as <a href="http://www.dbms2.com/2009/05/13/microsoft-announced-cep-this-week-too/" >Microsoft</a> and <a href="http://www.dbms2.com/2009/05/18/followup-on-ibm-system-sinfosphere-streams/" >IBM</a>, have at least some aspirations of eventually filling the gap.</p>
<p>Meanwhile, Truviso &#8212; which never got much market traction in the first place &#8212; was in hiding; Roman Bukary never did keep his promise to brief me on the company&#8217;s new and improved strategy. Then Truviso had yet another management change, amidst rumors that it was repositioning away from CEP. As per a press release Truviso emailed today, that&#8217;s now official, with Truviso&#8217;s main business being something to do with web analytics.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/04/truviso-evidently-reinvents-itself/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Vertica update</title>
		<link>http://www.dbms2.com/2010/04/29/vertica-zynga/</link>
		<comments>http://www.dbms2.com/2010/04/29/vertica-zynga/#comments</comments>
		<pubDate>Fri, 30 Apr 2010 03:44:59 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1973</guid>
		<description><![CDATA[Last month, Vertica&#8217;s CEO Ralph Breslauer quit,* and Vertica made it sound like there would be a new CEO late in April. And indeed, as of April 29, there was. He&#8217;s a guy I&#8217;ve never heard of before named Chris Lynch, apparently quite the sales machine builder. The most substance I&#8217;ve found is a pair [...]]]></description>
			<content:encoded><![CDATA[<p>Last month, <a href="http://www.dbms2.com/2010/03/19/vertica-update-4/" >Vertica&#8217;s CEO Ralph Breslauer</a> quit,* and Vertica made it sound like there would be a new CEO late in April. And indeed, as of April 29, there was. He&#8217;s a guy I&#8217;ve never heard of before named <a href="http://www.vertica.com/company/news/Vertica-appoints-Christopher-Lynch-new-president-and-CEO" onclick="javascript:pageTracker._trackPageview('/www.vertica.com');">Chris Lynch</a>, apparently quite the sales machine builder. The most substance I&#8217;ve found is a pair of <a href="http://www.masshightech.com/stories/2010/04/26/daily40-Vertica-names-Acopia-vet-Lynch-to-CEO-post.html" onclick="javascript:pageTracker._trackPageview('/www.masshightech.com');">Mass High Tech</a> <a href="http://www.masshightech.com/stories/2010/04/26/daily42-New-Vertica-CEO-Lynch-talks-of-plans-to-hire.html" onclick="javascript:pageTracker._trackPageview('/www.masshightech.com');">articles</a> &#8212; the latter exceedingly typo-ridden &#8212; to the general effect that:</p>
<ul>
<li>Vertica plans to build a massive, world-conquering sales force.</li>
<li>If Vertica dips back into negative cash flow to do that and has to raise more venture capital, so be it.</li>
<li>&#8220;Triple-digit&#8221; revenue growth is expected for this year.</li>
</ul>
<p><em><span id="more-1973"></span>*I&#8217;ve since heard more both from Ralph and his former colleagues, and I&#8217;m comfortable taking the move more or less at face value &#8212; for some reasons he doesn&#8217;t want to spell out, Ralph really wanted to move back home to South Africa.</em></p>
<p>While they were at it, Vertica also put out a press release reporting very good <a href="http://www.vertica.com/company/news/worlds-top-social-gaming-companies-tap-Vertica" onclick="javascript:pageTracker._trackPageview('/www.vertica.com');">success in the social gaming market</a>. The biggest and best known of the bunch is Zynga. Three months ago, <a href="http://tdwi.org/Blogs/WayneEckerson/2010/02/Zynga.aspx" onclick="javascript:pageTracker._trackPageview('/tdwi.org');">Wayne Eckerson</a> had figures of 3 TB/day added to the database, 200 nodes, and &gt;40 million users. Now Zynga is using a figure of &gt;65 million daily users and 230 nodes. More precisely, at Zynga:</p>
<ul>
<li>There are two Vertica databases with identical data.</li>
<li>Each Zynga Vertica database runs on 115 nodes.</li>
<li>Zynga&#8217;s two Vertica database clusters are used for different applications.</li>
<li>It&#8217;s undisclosed exactly what Zynga runs on what Vertica cluster. But best practice would be to put mission-critical, fast-response stuff on one cluster, and use the other for longer-running or less-critical queries &#8212; plus have it be available as hot standby &#8212; given that I don&#8217;t see much reason to put data geographically close to users around the world for reasons of latency or whatever.</li>
<li>An undisclosed amount of data, amounting to all of what Wayne earlier estimated at 3 TB, is added to each of Zynga&#8217;s Vertica databases daily.</li>
</ul>
<p>In other news, Vertica now states its customer count as being &gt;130.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/29/vertica-zynga/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Examples of machine-generated data</title>
		<link>http://www.dbms2.com/2010/04/08/machine-generated-data-example/</link>
		<comments>http://www.dbms2.com/2010/04/08/machine-generated-data-example/#comments</comments>
		<pubDate>Thu, 08 Apr 2010 19:20:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1868</guid>
		<description><![CDATA[Not long ago I pointed out that much future Big Data growth will be in the area of machine-generated data, examples of which include:

Computer, network, and other 	equipment logs
Satellite and similar telemetry 	(whether for espionage or science)
Location data such as RFID chip 	readings, GPS system output, etc.
Temperature and other 	environmental sensor readings
Sensor readings from factories, [...]]]></description>
			<content:encoded><![CDATA[<p>Not long ago I pointed out that much future Big Data growth will be in the area of <a href="http://www.dbms2.com/2010/01/17/three-broad-categories-of-data/" >machine-generated data</a>, examples of which include:<span id="more-1868"></span></p>
<ul>
<li>Computer, network, and other 	equipment logs</li>
<li>Satellite and similar telemetry 	(whether for espionage or science)</li>
<li>Location data such as RFID chip 	readings, GPS system output, etc.</li>
<li>Temperature and other 	environmental sensor readings</li>
<li>Sensor readings from factories, 	pipelines, etc.</li>
<li>Output from many kinds of medical 	device, in hospitals and (increasingly) homes alike</li>
</ul>
<p>The core idea here is that human-generated data can grow only as fast as human data-generating activities allow it to, but machine-generated data is limited only by capital budgets and Moore&#8217;s Law.  So <strong>machines&#8217; ability to generate data is growing a lot faster than humans&#8217;.</strong></p>
<p>Up to this point, I think there&#8217;s broad agreement, at least on the part of anybody who&#8217;s thought about it this way for very long. But that still leaves open questions as to which kinds of <strong>machine-generated data will matter first.</strong> The big five that matter right now are:</p>
<ul>
<li><strong>Web logs</strong> (partially machine-generated, but tied to human actions)</li>
<li><strong>Call detail records</strong> (CDRs &#8212; ditto)</li>
<li><strong>Financial instrument trades</strong> (some purely machine-generated, some human-based)</li>
<li><strong>Network event logs</strong> (commonly associated with web logs)</li>
<li><strong>Telemetry</strong> collected by the government (especially for intelligence purposes)</li>
</ul>
<p>A large fraction of all the 100 TB+ or petabyte+ data warehouse activity I know of falls into those areas.</p>
<p>Following along quickly are:</p>
<ul>
<li><strong>Online game data</strong>. Since late last year, online game companies have come up over and over again as an important category of data warehousing/analytics users. Like most of the categories above, the gaming area actually features a hybrid between human- and machine-generated data.</li>
<li><strong>Genetic research data,</strong> although I don&#8217;t know to what extent the investment in data gathering is concentrated among the few obvious big pharmaceutical companies. Other health care data (research or clinical) will come along too, but doesn&#8217;t seem to be there yet.</li>
</ul>
<p>Until recently I would have added:</p>
<ul>
<li><strong>Energy exploration, energy production, energy refining, and/or utility network data</strong></li>
</ul>
<p>But while those areas seemed poised to get hot last year, I haven&#8217;t heard much about them the past few months, with a few exceptions:</p>
<ul>
<li>Accenture&#8217;s observation that new smart grids will generate <a href="http://www.b-eye-network.co.uk/view/13007" onclick="javascript:pageTracker._trackPageview('/www.b-eye-network.co.uk');">up to eight orders of magnitude more data</a> than old dumb grids do</li>
<li>The recent article about <a href="http://money.cnn.com/2010/03/26/news/companies/terralliance_tech_full.fortune/index.htm" onclick="javascript:pageTracker._trackPageview('/money.cnn.com');">the Terralliance fiasco</a> (new kinds of oil exploration analytics, going beyond seismological data)</li>
<li>Lots of  concern about security flaws in utility smart grids.</li>
</ul>
<p>Finally, I&#8217;ve been assuming that a big area going forward is <strong>location data,</strong> especially <strong>personal movement data.</strong> The data volumes involved could be similar to or even greater than those of CDRs. But <a href="http://www.dbms2.com/2010/04/04/privacy-liberty-continued/" >privacy</a> concerns with that are obviously immense. (Of course, in the case of Foursquare, this sort of overlaps with freely-shared game data.)</p>
<p>If you want to make all this more tangible in your mind, one area to look for ideas is in the huge amount of news about various kinds of innovative sensors. Sources include:</p>
<ul>
<li>Somebody named Landon Cox, who maintains a couple feeds of  <a href="http://webpartner.com/SensorStuff" onclick="javascript:pageTracker._trackPageview('/webpartner.com');">sensor</a> news.</li>
<li>A <a href="http://twitter.com/SensorsExpo" onclick="javascript:pageTracker._trackPageview('/twitter.com');">Twitter feed</a>, apparently associated with a Sensor Expo.</li>
<li>Another Twitter feed, this one from <a href="https://twitter.com/spaughts" onclick="javascript:pageTracker._trackPageview('/twitter.com');">Sun Labs</a>. (I have no idea what Oracle is or isn&#8217;t doing with the <a href="http://www.sunspotworld.com/" onclick="javascript:pageTracker._trackPageview('/www.sunspotworld.com');">Sun SPOT</a> project that links to.)</li>
<li>Yet another <a href="http://twitter.com/measurement" onclick="javascript:pageTracker._trackPageview('/twitter.com');">Twitter feed</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/08/machine-generated-data-example/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Notes on the evolution of OLTP database management systems</title>
		<link>http://www.dbms2.com/2010/04/05/oltp-database-management-systems-2/</link>
		<comments>http://www.dbms2.com/2010/04/05/oltp-database-management-systems-2/#comments</comments>
		<pubDate>Mon, 05 Apr 2010 08:22:03 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Akiban]]></category>
		<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EnterpriseDB and Postgres Plus]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Mid-range]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[VoltDB and H-Store]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1841</guid>
		<description><![CDATA[The past few years have seen a spate of startups in the analytic DBMS business. Netezza, Vertica, Greenplum, Aster Data and others are all reasonably prosperous, alongside older specialty product vendors Teradata and Sybase (the Sybase IQ part).  OLTP (OnLine Transaction Processing) and general purpose DBMS startups, however, have not yet done as well, with [...]]]></description>
			<content:encoded><![CDATA[<p>The past few years have seen a spate of startups in the analytic DBMS business. Netezza, Vertica, Greenplum, Aster Data and others are all reasonably prosperous, alongside older specialty product vendors Teradata and Sybase (the Sybase IQ part).  OLTP <span style="font-weight: normal;">(OnLine Transaction Processing) </span>and general purpose DBMS startups, however, have not yet done as well, with such success as there has been (MySQL, Intersystems Cache&#8217;, solidDB&#8217;s exit, etc.) generally accruing to products that originated in the 20th Century.</p>
<p>Nonetheless, OLTP/general-purpose data management startup activity has recently picked up, targeting what I see as some very real opportunities and needs. So as a jumping-off point for further writing, I thought it might be interesting to collect a few observations about the market in one place.  These include:</p>
<ul>
<li><span style="font-weight: normal;">Big-brand 	OLTP/general-purpose DBMS have more “stickiness” 	than analytic DBMS.</span></li>
<li><span style="font-weight: normal;">By 	number, most of an enterprise&#8217;s OLTP/general-purpose databases are low-volume and 	low-value. </span></li>
<li>Most 	interesting new OLTP/general-purpose data management products are <span style="font-style: normal;">either 	MySQL-based or NoSQL.</span></li>
<li>It&#8217;s not yet 	clear whether MySQL will prevail over MySQL forks, or vice-versa, or 	whether they will co-exist.</li>
<li>The era of 	silicon-centric relational DBMS is coming.</li>
<li>The emphasis 	on scale-out and reducing the cost of joins spans the NoSQL and 	SQL-based worlds.<em> </em></li>
<li><span style="font-weight: normal;">Users&#8217; 	instance on “free” could be a major problem for OLTP DBMS 	innovation. </span></li>
</ul>
<p style="margin-bottom: 0in;">I shall explain.<span id="more-1841"></span></p>
<p style="margin-bottom: 0in;"><strong>Big-brand OLTP/general-purpose DBMS have more “stickiness” than analytic DBMS.</strong></p>
<ul>
<li>OLTP 	applications are more complex than analytic ones, and hence more 	tightly wired into particular brands of DBMS. For example, 	third-party packaged OLTP applications are typically portable among 	only a few brands of DBMS. But third-party business intelligence 	tools, and the BI “applications” built in them, are more easily 	and widely portable.</li>
<li>Specific technical observations 	such as “OLTP apps tend to use stored procedures, which are 	DBMS-specific” or “OLTP apps tend to have lots and lots of 	tables” serve to underscore the first point.</li>
<li>An enterprise&#8217;s highest-value data 	is commonly the financial stuff handled by its core OLTP systems, so 	those are the last things they want to mess around with just to get 	some cost savings. Security, high availability, and so on are major 	considerations that can outweigh cost.</li>
</ul>
<p style="margin-bottom: 0in;"><strong>By number, most of an enterprise&#8217;s OLTP/general-purpose databases are low-volume and low-value. </strong>Indeed, “OLTP” is often a misnomer, which is why I tend to go with “general-purpose” or some similarly wishy-washy phrase instead.</p>
<ul>
<li>In theory, this is a ripe area for 	what I&#8217;ve called <a href="http://www.dbms2.com/category/database-management-system/mid-range/" >mid-range DBMS</a>.</li>
<li>The big brand vendors try hard to 	keep as many of those databases for themselves as they can. 	Enterprise-wide license pricing helps. Going forward, so will 	virtualization/consolidation strategies, such as <a href="http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/" >Oracle&#8217;s 	Exadata-centric approach</a>.</li>
<li>A variety of mid-range DBMS 	alternatives beyond the big brands have technical merit, at least in 	some cases and configurations – MySQL, PostgreSQL, Intersystems 	Cache&#8217;, and so on.</li>
<li>The only such mid-range DBMS 	alternative with much large enterprise business momentum, however, 	appears to be MySQL.</li>
</ul>
<p style="margin-bottom: 0in;"><strong>&#8220;General-purpose&#8221; might be a better term than &#8220;OLTP&#8221; anyway.</strong></p>
<ul>
<li>I don&#8217;t have a link, but it&#8217;s widely agreed that over half of the processing on an &#8220;OLTP&#8221; enterprise app is commonly reporting and so on.</li>
<li>&#8220;Operational BI&#8221; is progressing by fits and starts, but it is progressing.</li>
<li>Anything customer-facing &#8212; web-based, call center, or otherwise &#8212; is likely to include a heavy dose of &#8220;real-time&#8221; analytic optimization.</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Most interesting new OLTP/general-purpose data management products are <span style="font-style: normal;">either MySQL-based or NoSQL.</span></strong></p>
<ul>
<li><a href="http://www.dbms2.com/2009/06/22/h-store-horizontica-voltdb/" >VoltDB</a> is the main 	exception that jumps to mind.</li>
<li>This isn&#8217;t true in the analytic 	DBMS area, where Netezza, Greenplum, Aster, Vertica and others 	started from PostgreSQL&#8217;s code, APIs, or both.</li>
</ul>
<p style="margin-bottom: 0in;"><strong>It&#8217;s not yet clear whether MySQL will prevail over MySQL forks, or vice-versa, or whether they will co-exist.</strong></p>
<ul>
<li>MySQL is a limited product without 	all the third-party storage engines that are being developed.</li>
<li><a href="http://www.dbms2.com/2009/12/14/oracle-mysql-storage-engine/" >Oracle&#8217;s promise of MySQL good 	behavior</a> has an expiration date.</li>
<li>None of the MySQL front-end 	alternatives are remotely mature yet.</li>
</ul>
<p style="margin-bottom: 0in;"><strong>The era of silicon-centric relational DBMS is coming.</strong></p>
<ul>
<li>I think “silicon” means 	“solid-state memory” as much as or more than it means “RAM,” 	but that&#8217;s not yet certain.</li>
<li>What is pretty certain is that, 	thanks to Moore&#8217;s Law, some kind of silicon will increasingly 	replace disk.</li>
<li><a href="http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/" >Oracle&#8217;s increasingly 	Flash-centric story</a> is a challenge to everybody.</li>
<li>RAM-centric VoltDB will launch 	fairly soon. (By the way, while VoltDB still has <a href="http://www.dbms2.com/2009/06/22/h-store-horizontica-voltdb/" >a lot in common 	with H-Store</a>, they&#8217;re not exactly the same thing. And <a href="http://bit.ly/9QxjV2." onclick="javascript:pageTracker._trackPageview('/bit.ly');">H-Store 	research</a> is progressing too.)</li>
<li><span style="font-style: normal;"><a href="http://rethinkdb.com/" onclick="javascript:pageTracker._trackPageview('/rethinkdb.com');">RethinkDB</a> is being de</span>veloped, focused directly on solid-state memory. 	Based on the sparse information available online, RethinkDB sounds 	somewhat like a dumbed-down H-Store.</li>
<li>New disk-based vendors may never 	optimize their use of disk, instead targeting a solid-state future. 	(E.g., I think Akiban should and quite well might follow this path.)</li>
</ul>
<p style="margin-bottom: 0in; font-weight: normal;"><strong>The emphasis on scale-out and reducing the cost of joins spans the NoSQL and SQL-based worlds.</strong> We hear that from the <a href="http://www.dbms2.com/2010/03/14/nosql-taxonomy/" >NoSQL</a> guys all the time. But I also just heard it from <a href="http://www.dbms2.com/2010/04/03/akiban-highlights/" >Akiban</a>.</p>
<p style="margin-bottom: 0in;"><strong>Users&#8217; instance on “free” could be a major problem for OLTP DBMS innovation.</strong> Vendors of new OLTP data management technologies often feel obligated to open source their products, notwithstanding the historical lack of revenue in the open source OLTP DBMS market. As just one of many examples,  <a href="http://www.novaspivack.com/uncategorized/evri-ties-the-knot-with-twine" onclick="javascript:pageTracker._trackPageview('/www.novaspivack.com');">Nova Spivack</a> wrote:</p>
<blockquote>
<p style="margin-bottom: 0in;">I have recently seen some new graph data storage products that may provide the levels of scale and performance needed, but pricing has not been determined yet. In short, storage and retrieval of semantic graph datasets is a big unsolved challenge that is holding back the entire industry. We need federated database systems that can handle hundreds of billions to trillions of triples under high load conditions, in the cloud, on commodity hardware and open source software. Only then will it be affordable to make semantic applications and services at Web-scale.</p>
</blockquote>
<p style="margin-bottom: 0in;">I hear similar things from other startups, who evidently believe they need and/or are entitled to enjoy sophisticated, high-performance, zero-cost, specialized database management technology.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/05/oltp-database-management-systems-2/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>The retention of everything</title>
		<link>http://www.dbms2.com/2010/04/04/the-retention-of-everything/</link>
		<comments>http://www.dbms2.com/2010/04/04/the-retention-of-everything/#comments</comments>
		<pubDate>Sun, 04 Apr 2010 07:25:37 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1833</guid>
		<description><![CDATA[I&#8217;d like to reemphasize a point I&#8217;ve been making for a while about data retention:

As costs go down, the wisdom of keeping detailed data goes up. I’d go so far as to say that every piece of data generated by a human being should be preserved and kept online, legal and privacy considerations permitting.* Most [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;d like to reemphasize a point I&#8217;ve been making for a while about <a href="http://www.dbms2.com/2009/12/07/data-warehouse-volume-growth/" >data retention</a>:<span id="more-1833"></span></p>
<blockquote>
<p style="margin-bottom: 0in;">As costs go down, the wisdom of keeping <strong>detailed data</strong> goes up. I’d go so far as to say that <strong>every piece of data generated by a human being should be preserved and kept online,</strong> legal and privacy considerations permitting.* Most forms of capital-, labor-, and/or location-based competitive advantage being commoditized and/or globalized away. But information remains a unique corporate asset. Don’t discard it lightly.</p>
<p style="margin-bottom: 0in;"><em>*Unless there’s an explicit law mandating data destruction, legal considerations </em>should <em>permit. The idea “Let’s destroy something of irreplaceable value today, against the possibility we might be brought to judgment tomorrow” is both morally and pragmatically weird. Privacy, however, may be a different matter.</em></p>
</blockquote>
<p style="margin-bottom: 0in;">That applies to the structured/tabular kinds of data I tend to focus on in this blog. It applies even more to anything that&#8217;s like a document (or email, instant message, whatever) somebody has taken the trouble to place into words.  A top document-oriented archiving analyst (and my good friend), David Ferris, <a href="http://www.ferris.com/2008/04/02/expect-to-archive-everything/" onclick="javascript:pageTracker._trackPageview('/www.ferris.com');">quite</a> <a href="http://www.ferris.com/2008/03/30/you-dear-reader-are-immortal/" onclick="javascript:pageTracker._trackPageview('/www.ferris.com');">agrees</a>. As David puts it:</p>
<blockquote><p>I think we’ll end up archiving everything, except egregious garbage like spam:</p>
<ul>
<li> It’s too hard to get users to conform to policy.</li>
<li> Automated methods of capturing a human-understandable policy, for example “tax records,” are too hard to implement through automatic filters. The filters are too inaccurate.</li>
<li> It’s impractical to get users to classify everything, and automatic classification is too crude.</li>
<li> You never know what you might want later. Stuff you think you won’t want now may end up being very useful.</li>
<li> The cost of storage is trivial when looked at on a per-user basis.</li>
</ul>
</blockquote>
<p>In particular, I think information destruction is a crude instrument for the protection of privacy, wasteful at best, and likely to be vigorously resisted by governments and large businesses.  For example:</p>
<ul>
<li>Businesses are increasingly subject to retention-oriented compliance regulation. Your lawyers may want you to destroy information that could be used to sue you, but governments won&#8217;t let you.</li>
<li>Information about individuals&#8217; web surfing is being retained, under law, so that they may be fingered later for pornography consumption or illegal file sharing. I deplore some of the ways <a href="http://www.monashreport.com/2006/06/06/freedom-even-without-data-privacy/" onclick="javascript:pageTracker._trackPageview('/www.monashreport.com');">web-surfing data can be and is being used</a>, and want <a href="http://www.dbms2.com/2010/04/04/privacy-liberty-continued/" >laws passed to rein them in</a>. But the retention will happen.</li>
<li>Marketers want all that data. Duh.</li>
<li><a href="http://blogs.computerworld.com/node/1002" onclick="javascript:pageTracker._trackPageview('/blogs.computerworld.com');">Electronic health records</a> are coming &#8212; slowly, but they&#8217;ll get here some day.</li>
</ul>
<p>Besides, <a href="http://www.dbms2.com/category/database-management-system/archiving-and-information-preservation/" >archiving technologies</a> are getting ever more cost-effective.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/04/the-retention-of-everything/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
