<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; RDF and graphs</title>
	<atom:link href="http://www.dbms2.com/category/datatype/rdf-graph-database/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 02 Sep 2010 09:06:44 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Big Data is Watching You!</title>
		<link>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/</link>
		<comments>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/#comments</comments>
		<pubDate>Wed, 11 Aug 2010 05:30:22 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2760</guid>
		<description><![CDATA[There&#8217;s a boom in large-scale analytics. The subjects of this analysis may be categorized as:

People
Financial trades
Electronic networks
Everything else

The most varied, interesting, and valuable of those four categories is the first one.

That may change some day, with the growing importance of machine-generated data, and of big-data science in particular. But I think it&#8217;s a fair assessment [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">There&#8217;s a boom in large-scale analytics. The subjects of this analysis may be categorized as:</p>
<ul>
<li>People</li>
<li>Financial trades</li>
<li>Electronic networks</li>
<li>Everything else</li>
</ul>
<p style="margin-bottom: 0in;">The most varied, interesting, and valuable of those four categories is the first one.</p>
<p><span id="more-2760"></span></p>
<p style="margin-bottom: 0in;"><em>That may change some day, with the growing importance of<a href="http://www.dbms2.com/2010/04/08/machine-generated-data-example/" > </a><a href="http://www.dbms2.com/2010/04/08/machine-generated-data-example/" >machine-generated data</a>,</em><em> and of <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/" >big-data science</a> </em><em>in particular. But I think it&#8217;s a fair assessment at the present, and for at least the next few years.</em></p>
<p style="margin-bottom: 0in;">Some of th<span style="font-weight: normal;">e most interesting use cases are concentrated in the areas of identifying individuals, groups of people, or behaviors of (groups of) people. For example:</span></p>
<ul>
<li>comScore works hard to <strong>identify 	individual web surfers </strong><span style="font-weight: normal;">– 	i.e. to </span><strong>deanonymize</strong><span style="font-weight: normal;"> them &#8212; even</span> though they may have given incomplete or false 	personal information.</li>
<li>Other companies at least try to 	figure out <strong>which information in a user&#8217;s profile is unreliable,</strong> so as to classify them better. (Yes, there are 62-year-old 	video-game-obsessed Lady Gaga fans, but that&#8217;s generally not the way 	to bet.)</li>
<li>Multiple telecom vendors try to 	identify who their <strong>most influential customers</strong> are (to a first 	approximation, they&#8217;re the ones most often called by the most 	people, but it surely gets more sophisticated than that). This 	information is then used to reduce churn, either by working hard to 	retain those users, or – if they do churn – to move very fast to 	retain the business from their friends.</li>
<li>Other kinds of companies do 	similar kinds of analysis, to the extent that they have enough of a 	social graph to do so. (This application is a case where the term 	“<a href="http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/" >social graph</a>” is not a misnomer.)</li>
<li><strong>Turing detectives</strong> (I just 	coined that phrase) try to determine whether users are humans or 	bots.</li>
<li>Central to detecting <strong>insurance 	fraud</strong> is identifying suspiciously close connections between 	claimants, service providers, and so on.</li>
<li>Identifying groups of people is 	also important in flagging <strong>insider trading.</strong><span style="font-weight: normal;"> Even more important are other kinds of analysis, along the lines of 	“is this normal innocent trading behavior?” </span></li>
<li><span style="font-weight: normal;">Intelligence 	agencies try to detect networks of </span><strong>terrorists</strong><span style="font-weight: normal;"> and their sympathizers. They further try to identify unusual 	patterns of communication or meetings along those networks that 	might indicate terrorist acts are being planned. (Civilian law 	enforcement agencies can use similar techniques.)</span></li>
</ul>
<p style="margin-bottom: 0in; font-weight: normal;">In most cases, the analysis and/or run-time execution of the relevant models is done with the help of analytic DBMS. Other technologies that come into play include non-DBMS MapReduce (Hadoop), graph engines, and CEP (Complex Event Processing). The vendor most heavily represented on that list is probably Aster Data, because:</p>
<ul>
<li>Aster Data is 	focused on hard-core analytics.</li>
<li>I talk a lot 	with Aster Data, and in particular had a long, detailed use-cases 	discussion with them last week.</li>
<li><span style="font-weight: normal;">The 	comScore example happens to come from a speaker at </span><a href="http://www.dbms2.com/2010/05/07/implications-onew-analytic-technology/" ><span style="font-weight: normal;">an 	Aster event</span></a><span style="font-weight: normal;"> I also 	participated in.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">And by the way, all this only scratches the surface of what will be possible down the road. It&#8217;s based mainly on where you live, what you purchase, how you behave on websites, and who you communicate with. </span><span style="color: #000080;"><span lang="zxx"><span style="text-decoration: underline;"><a href="../2010/07/04/fair-data-use/"><span style="font-weight: normal;">Other kinds of data, which could be used to be yet more intrusive</span></a></span></span></span><span style="font-weight: normal;">, generally aren&#8217;t involved.</span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">I actually have two points in drawing up this list. One is golly-gee-whiz about how a lot of analytically sophisticated applications are actually getting into production. The other is to highlight the privacy and liberty threats If This Goes On Unchecked (which is why I didn&#8217;t include some other less-people-focused examples). There&#8217;s also a related danger that, to the extent we don&#8217;t get some smart regulations to keep us safe(r), we&#8217;ll get a bunch of stupid regulations instead. </span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">The Analytic Era has only just begun.<br />
</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Objectivity Infinite Graph</title>
		<link>http://www.dbms2.com/2010/06/19/objectivity-infinite-graph/</link>
		<comments>http://www.dbms2.com/2010/06/19/objectivity-infinite-graph/#comments</comments>
		<pubDate>Sat, 19 Jun 2010 12:05:45 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Objectivity and Infinite Graph]]></category>
		<category><![CDATA[RDF and graphs]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2306</guid>
		<description><![CDATA[I chatted Wednesday night with Darren Wood, the Australia-based lead developer of Objectivity&#8217;s Infinite Graph database product. Background includes:

Objectivity is a profitable, 	decades-old object-oriented DBMS vendor with about 50 employees.
Like some other object-oriented 	DBMS of its generation, Objectivity is as much a toolkit for 	building DBMS as it is a real finished DBMS product. Objectivity [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I chatted Wednesday night with Darren Wood, the Australia-based lead developer of Objectivity&#8217;s Infinite Graph database product. Background includes:</p>
<ul>
<li>Objectivity is a profitable, 	decades-old object-oriented DBMS vendor with about 50 employees.</li>
<li>Like some other object-oriented 	DBMS of its generation, Objectivity is as much a toolkit for 	building DBMS as it is a real finished DBMS product. Objectivity 	sales are typically for custom deals, where Objectivity helps with 	the programming.</li>
<li>The way Objectivity works is 	basically:
<ul>
<li>You manage objects in memory, in 	the format of your choice.</li>
<li>Objectivity bangs them to disk, 	across a network.</li>
<li>Objectivity manages the 	(distributed) pointers to the objects.</li>
<li>You can, if you choose, hard code 	exactly which objects are banged to which node.</li>
<li>Objectivity&#8217;s DML for reading data 	is very different from Objectivity&#8217;s DML for writing data. (I think 	the latter is more like the program code itself, while the former is 	more like regular DML.)</li>
<li>The point of Objectivity is not so 	much to have fast I/O. Rather, it is to minimize the CPU cost of 	getting the data that comes across the wire into useful form.</li>
</ul>
</li>
<li>Darren got the idea of putting a 	generic graph DBMS front-end on Objectivity while doing a 	<a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/" >relationship analytics</a> project for an Australian intelligence 	agency.</li>
<li>Darren redoubled his efforts to 	sell the project internally at Objectivity after read<span style="font-style: normal;">ing 	what I wrote about relationship analytics back in 200</span>6 or so.</li>
<li>There is now a 5 or so person team 	developing Infinite Graph.</li>
<li>Infinite Graph is just now going 	out to beta test.</li>
</ul>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><a href="http://www.infinitegraph.com/" onclick="javascript:pageTracker._trackPageview('/www.infinitegraph.com');">Infinite Graph</a> is an API or language binding on top of Objectivity that:</p>
<ul>
<li>Hides a lot of Objectivity&#8217;s 	complexity.</li>
<li>Is suitable for graph/relationship 	analytics.</li>
</ul>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><span id="more-2306"></span>The main point of the Infinite Graph beta test is to see whether Objectivity got the API right. By way of contrast, Objectivity is still just researching the DBMS optimization side of things. According to Darren, what makes that so hard is that if you partition the graph in some smart way, probably through some kind of costly algorithm to determine “least connectedness,” a bit more additional data can thoroughly invalidate your results. Thus, Darren is focused more on ensuring that performance is good even if data is distributed around the network in annoying ways.</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">One performance win that Infinite Graph seems to get (almost?) for free from being built on top of Objectivity is lots of prefetching. Specifically, graph nodes and their edges are stored together, just like objects and their pointers are in traditional Objectivity &#8212; and if a node is retrieved, the nodes it&#8217;s connected to might also get retrieved as a background operation, before they&#8217;re even needed. More generally, Objectivity has always tried to be fast about traversing pointers, and that is a whole lot like traversing graph edges.</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">As a future, Infinite Graph is looking at ideas from <a href="http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html" onclick="javascript:pageTracker._trackPageview('/googleresearch.blogspot.com');">Google&#8217;s Pregel</a>. As Darren characterizes it, in Pregel you wrap up information about a graph node and ship it off to another computing node if the next graph node you need is over there. Darren suspects that the extreme form of this strategy would not be ideal. (I gather from Darren that Google has realized the same thing from the getgo.) Instead, he&#8217;s pinning his hopes more on smarts about when to do that (costly) shipping, and when to just fetch the information back to the compute node currently being used.</p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">The most interesting part of our discussion, in my opinion, was about applications and application functionality. In a nutshell, Darren se</span><span style="font-style: normal;"><span style="font-weight: normal;">ems to think that it&#8217;s all about the edges, rathe</span></span><span style="font-style: normal;">r than the nodes themselves. (My words, not his.) In particular:</span></p>
<ul>
<li><strong>Edges are first-class citizens</strong> in Infinite Graph, just as nodes are.</li>
<li><strong>Graphs typically are polluted 	with lots of insignificant edges.</strong> Examples include:
<ul>
<li>If you&#8217;re tracking people&#8217;s 	telephone traffic, lots of folks call the local pizza parlor. 	Indeed, it&#8217;s common to look for “star” nodes like that that have 	very high connectivity, and excise from the graph to reduce noise.</li>
<li>Many measures of relationship 	include minor relationships. Facebook friends? LinkedIn connections? 	Occasional phone calls? Next door neighbors? All of those can 	indicate very minor relationships.</li>
</ul>
</li>
<li><span style="font-weight: normal;">Therefore, 	in Infinite Graph, </span><strong>edges (can) have weights.</strong> Darren 	says this is a widely-used capability in graph applications. The 	core reason is to let you distinguish between significant and 	insignificant edges. Note that these weights can be calculated based 	on the raw data and stored back into the database.</li>
<li><span style="font-weight: normal;">In 	Infinite Graph, </span><strong>edges can also have effectiveness date 	intervals.</strong> E.g., if you live at an address for a certain period, 	that&#8217;s when the edge connecting you to it is valid.</li>
<li>In general in Infinite Graph, 	<strong>edges can carry</strong><span style="font-weight: normal;"> arbitrary 	or at least flexible </span><strong>“qualifier”/attribute 	information.</strong></li>
<li><strong>For many applications, the 	number of possible nodes is fundamentally limited. </strong><span style="font-weight: normal;">There 	are only so many people in the world, so many street addresses, so 	many telephone numbers, and so on. (There was a time this wasn&#8217;t 	believed to be the case, because timestamping was done at the node 	rather than edge level. But I find persuasive Darren&#8217;s argument that 	it works better on edges.) <em>Edit: Even so, <a href="http://www.theregister.co.uk/2010/05/19/darpa_smite/" onclick="javascript:pageTracker._trackPageview('/www.theregister.co.uk');">DARPA is thinking in the billions-of-nodes range</a>.</em><br />
</span></li>
<li><span style="font-weight: normal;">Darren 	is in general agreement with my observation that </span><a href="http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/" ><span style="font-weight: normal;">the 	“social graph” shouldn&#8217;t primarily be regarded as a graph</span></a><span style="font-weight: normal;">.</span></li>
<li><span style="font-weight: normal;">Yes, 	the paradigmatic examples of intelligence agency graph analytics are 	telephone or even IP traffic analysis. Nodes can wind up with lots 	of edges connecting them. Full analysis of the graphs exceeds even 	the computing capacity available to governments.</span></li>
<li><span style="font-weight: normal;">On 	a happy civil liberties note, Darren observed that Australian 	intelligence has a lot of red tape restricting them from getting 	this kind information. Basically, they can only get chunks of 	information “on demand”. An awkward side effect of this is that 	when they do get it, it could be in any number of formats.</span></li>
</ul>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/19/objectivity-infinite-graph/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>The most important part of the “social graph” is neither social nor a graph</title>
		<link>http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/</link>
		<comments>http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/#comments</comments>
		<pubDate>Tue, 08 Jun 2010 05:18:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2235</guid>
		<description><![CDATA[“Social graph” is a highly misleading term, and so is “social network analysis.” By this I mean:
There&#8217;s something akin to “social graphs” and “social network analysis” that is more or less worthy of all the current hype – but graphs and network analysis are only a minor part of the whole story.
In particular, the most [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">“Social graph” is a highly misleading term, and so is “social network analysis.” By this I mean:</p>
<p><strong>There&#8217;s something akin to “social graphs” and “social network analysis” that is more or less worthy of all the current hype – but graphs and network analysis are only a minor part of the whole story.</strong></p>
<p style="margin-bottom: 0in;">In particular, <strong>the most important parts of the Facebook “social graph” are neither social nor a graph. </strong><span style="font-weight: normal;">Rather, what&#8217;s really important is an aggregate</span><strong> Profile of Revealed Preferences</strong><span style="font-weight: normal;">, of which person-to-person connections or other things best modeled by a graph play only a small part.</span></p>
<p style="margin-bottom: 0in;"><span id="more-2235"></span>Let me hasten to note that – even when viewed narrowly &#8212; the ideas of “social graph”and “<a href="../2009/08/21/social-network-analysis-aka-relationship-analytics/">social network analysis</a>” do have significance. Nontrivial use cases to date for big data social network analysis include:</p>
<ul>
<li>Intelligence agencies identify and 	analyze terrorist networks. Corporations and civilian law 	enforcement do the same for fraud networks.</li>
<li>Telephone companies use calling 	data to figure out which of their customers are most likely to 	influence which other customers in the decision to keep or change 	service providers. (Frankly, I find that rather creepy.)</li>
<li>Social networks figure out which 	other members you&#8217;re likely to know, and encourage you to connect 	with them.</li>
</ul>
<p style="margin-bottom: 0in;">Epidemiologists aspire to add to that list, based on their success to date using much more micro forms of social network analysis. But after that, I&#8217;m running out of examples. Sure, graph analytics is good for a bunch of other things (e.g., biology at the genetic or molecular level), but those have little or nothing to do with “social graphs” or social network analysis as they are commonly understood.</p>
<p style="margin-bottom: 0in;"><em>Note: Of course, it is also the case that everything can be modeled by entity-attribute-value triples, and those can always be modeled by graphs. But so what?</em></p>
<p style="margin-bottom: 0in;">Let&#8217;s consider what, in a marketer&#8217;s ideal world, would go into yo<span style="font-weight: normal;">ur Profile of Revealed Preferences. Raw data might include:</span></p>
<ul>
<li><strong>Personally identifyING 	information. </strong>Duh. This is what makes everything else possible.</li>
<li><strong>Purchase transaction data.</strong> Lots of it. Like, everything on your credit card statements.</li>
<li><strong>Demographic and lifestyle 	information.</strong> Address, date of birth, educational history, race, 	household composition, and so on.</li>
<li><strong>Affiliations.</strong> Politics, 	religion, group membership of any kind. (OK, that&#8217;s partly social.)</li>
<li><strong>Explicitly stated opinions, 	preferences and desires,</strong><span style="font-weight: normal;"> including:</span>
<ul>
<li>Any surveys you have filled out.</li>
<li><strong>Any recommendations you have 	made</strong> (e.g., through the Facebook Like feature).</li>
<li>The text of anything you&#8217;ve 	written and posted – and, very ideally, of your private emails as 	well.</li>
<li>Any <strong>wish lists</strong> you&#8217;ve 	filled in.</li>
</ul>
</li>
<li><strong>Attention information.</strong> What 	you clicked on, what you looked at, and all that stuff website 	owners track.</li>
<li><strong>Your movements, </strong><span style="font-weight: normal;">to 	the extent they are tracked. (E.g., via Foursquare and the like.)</span></li>
<li><strong>Your gaming activities</strong><span style="font-weight: normal;"> and the like. (This is social mainly to the extent it overlaps with 	other categories I&#8217;ve already mentioned.)</span></li>
<li><strong>Your medical information.</strong><span style="font-weight: normal;"> </span></li>
<li><strong>Who you communicate with, and 	what you communicate with them about.</strong><span style="font-weight: normal;"> (Hey! There&#8217;s something else social!)</span></li>
<li><span style="font-weight: normal;">Similar </span><strong>information about the people you communicate with.</strong></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">My core </span><strong>privacy</strong><span style="font-weight: normal;"> thoughts about that data include:</span></p>
<ul>
<li><strong>Individuals deserve the right 	to control all that information.</strong><span style="font-weight: normal;"> At a minimum, they deserve total control over how the data (raw or 	derived) is passed from the service – e.g., website – where it 	naturally resides (e.g., where it is originated) to any other place.</span></li>
<li>Given a chance, <strong>individuals 	would make fine-grained choices about what parts of their Profile of 	Revealed Preferences are available to which organizations.</strong> Reasons include:</li>
<li>Individuals have rather complex 	trust relationships with different kinds of merchants and marketers.</li>
<li>Consumers get different benefits 	from sharing information with different kinds of merchants and 	marketers. (Sometimes personalization is a large benefit. Sometimes 	it&#8217;s just creepy. And some companies actively bribe you to give them 	information they can use to sell to you.)</li>
</ul>
<p style="margin-bottom: 0in;">When one frame things this way, two rather difficult technological questions naturally arise.</p>
<ol>
<li>Suppose, implausibly, that a 	single entity were allowed to control and use (for marketing) all of 	your Profile of Revealed Preferences information. How would they 	store and analyze it?</li>
<li>How does the answer to #1 change 	because control over the information will, in fact, be fragmented?</li>
</ol>
<p style="margin-bottom: 0in;">It&#8217;s tough enough to answer these questions for data about one person. Trying to include all but the simplest information about other people is and will for years remain quite infeasible. So, for the most part, <strong>this is not “social” information.</strong></p>
<p style="margin-bottom: 0in;">It&#8217;s also <strong>not naturally a “graph.”</strong> Similarly, it is <strong>not a good candidate for network analysis.</strong> To see why, let me outline <strong>why I used the name “Profile of Revealed Preferences”:</strong></p>
<ul>
<li>The reason marketers want this 	data is, mainly, because they want to know what appeals to you, and 	how strongly you feel about it.</li>
<li>The analytic process often entails 	taking explicit choices you have made, and inferring other 	preferences from them.</li>
<li>The output of the analytic process 	is often one or more “scores” that then get plugged into various 	selection algorithms to determine what you should be shown or 	offered. At least implicitly, these algorithms are predicting what 	you will or won&#8217;t respond well to.</li>
</ul>
<p style="margin-bottom: 0in;">Not much graph-like there.</p>
<p style="margin-bottom: 0in;">This post has gotten pretty long, so I&#8217;ll stop here without spelling anything else out. But questions I still hope to address down the road include:</p>
<ul>
<li>How sho<span style="font-weight: normal;">uld 	Profile of Revealed Preferences data</span> be stored?</li>
<li>Suppose we want to pass around 	derived results and not the raw data. How could we ever get to 	standards that would make such interchange realistic?</li>
<li>If we only have raw data to pass 	around, what are the implications for privacy, liberty, and the 	structure of the online industries?</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Information found in public-facing social networks</title>
		<link>http://www.dbms2.com/2010/04/08/social-networks-graph/</link>
		<comments>http://www.dbms2.com/2010/04/08/social-networks-graph/#comments</comments>
		<pubDate>Thu, 08 Apr 2010 14:52:55 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[RDF and graphs]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1862</guid>
		<description><![CDATA[Here are some examples illustrating two recent themes of mine, namely:

Easily-available information reveals all sorts of things about us.
Graph-based analysis is on the rise.

Pete Warden scraped all of Facebook&#8217;s social graph (at least for the United States), and put up a really interesting-looking visualization of same. Facebook&#8217;s lawyer&#8217;s came down on him, and he quickly [...]]]></description>
			<content:encoded><![CDATA[<p>Here are some examples illustrating two recent themes of mine, namely:</p>
<ul>
<li><a href="http://www.dbms2.com/2010/04/04/privacy-liberty-continued/" >Easily-available information reveals all sorts of things about us</a>.</li>
<li><a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/" >Graph-based analysis</a> is on the rise.</li>
</ul>
<p>Pete Warden scraped all of Facebook&#8217;s social graph (at least for the United States), and put up a really interesting-looking <a href="http://petewarden.typepad.com/searchbrowser/2010/02/how-to-split-up-the-us.html" onclick="javascript:pageTracker._trackPageview('/petewarden.typepad.com');">visualization</a> of same. Facebook&#8217;s lawyer&#8217;s came down on him, and he quickly agreed to <a href="http://petewarden.typepad.com/searchbrowser/2010/03/facebook-data-destruction.html" onclick="javascript:pageTracker._trackPageview('/petewarden.typepad.com');">destroy</a> the data he&#8217;d scraped, but also published ideas on how other people could <a href="http://petewarden.typepad.com/searchbrowser/2010/02/how-to-harvest-facebook-profiles-from-emails-without-logging-in.html" onclick="javascript:pageTracker._trackPageview('/petewarden.typepad.com');">duplicate</a> his work.</p>
<p>Warden has since given an <a href="http://www.fastcompany.com/1607273/exclusive-facebook-data-guru-speaks-about-why-facebook-threatened-to-sue-him" onclick="javascript:pageTracker._trackPageview('/www.fastcompany.com');">interview</a> in which he outlines some of the things researchers hoped to do with this data:<span id="more-1862"></span></p>
<blockquote><p>One request I got was someone hoping to study how social connectedness and social networking relates to finding jobs. &#8230;</p>
<p>Another request was art historians trying to figure out how artist popularity changes, spreads, and grows over time. &#8230;</p>
<p>One group wanted to see how socially connected different regions were and how that relates to disease transmission. Because places like New York and L.A. might be more closely connected than L.A. and a city somewhere else in California.</p></blockquote>
<p>I don&#8217;t have a clear sense whether anybody was proposing to do any serious graph analytics, or if the main interest was in drawing pretty pictures and hoping insight would emerge. At a guess, I&#8217;d say it was probably some of each. (Although another Pete Warden post underscores that <a href="http://petewarden.typepad.com/searchbrowser/2008/12/what-does-the-bouldertwits-graph-mean.html" onclick="javascript:pageTracker._trackPageview('/petewarden.typepad.com');">the act of drawing the visualization involves some analytics of its own</a>.)</p>
<p>There&#8217;s other great stuff in Warden&#8217;s blog too, such as this simple post on <a href="http://petewarden.typepad.com/searchbrowser/2010/03/the-unknown-marketing-databases-that-know-everything-about-you.html" onclick="javascript:pageTracker._trackPageview('/petewarden.typepad.com');">the personal data available about everybody, to everybody</a>, and &#8212; here I&#8217;m totally digressing from the main topics of this post &#8212; a wonderful illustration of <a href="http://petewarden.typepad.com/searchbrowser/2010/03/class-and-why-i-left-britain.html" onclick="javascript:pageTracker._trackPageview('/petewarden.typepad.com');">why the US gets the best immigrants</a>.</p>
<p>Meanwhile &#8212; back on topic &#8212; it seems the <a href="http://www.theregister.co.uk/2010/04/07/facebook_spying_gaza/" onclick="javascript:pageTracker._trackPageview('/www.theregister.co.uk');">Israeli military</a> is, quite reasonably, paying attention to Palestinians in the occupied territories post about themselves.</p>
<p>The <a href="http://yro.slashdot.org/firehose.pl?op=view&amp;type=story&amp;sid=10/03/31/1430256" onclick="javascript:pageTracker._trackPageview('/yro.slashdot.org');">Slashdot</a> discussion of Warden&#8217;s efforts also pointed to something earlier and similar, namely <a href="http://lumberjaph.net/blog/index.php/2010/03/25/github-explorer/" onclick="javascript:pageTracker._trackPageview('/lumberjaph.net');">graphical visualization of participation in various open source communities</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/08/social-networks-graph/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Notes on the evolution of OLTP database management systems</title>
		<link>http://www.dbms2.com/2010/04/05/oltp-database-management-systems-2/</link>
		<comments>http://www.dbms2.com/2010/04/05/oltp-database-management-systems-2/#comments</comments>
		<pubDate>Mon, 05 Apr 2010 08:22:03 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Akiban]]></category>
		<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EnterpriseDB and Postgres Plus]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Mid-range]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[VoltDB and H-Store]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1841</guid>
		<description><![CDATA[The past few years have seen a spate of startups in the analytic DBMS business. Netezza, Vertica, Greenplum, Aster Data and others are all reasonably prosperous, alongside older specialty product vendors Teradata and Sybase (the Sybase IQ part).  OLTP (OnLine Transaction Processing) and general purpose DBMS startups, however, have not yet done as well, with [...]]]></description>
			<content:encoded><![CDATA[<p>The past few years have seen a spate of startups in the analytic DBMS business. Netezza, Vertica, Greenplum, Aster Data and others are all reasonably prosperous, alongside older specialty product vendors Teradata and Sybase (the Sybase IQ part).  OLTP <span style="font-weight: normal;">(OnLine Transaction Processing) </span>and general purpose DBMS startups, however, have not yet done as well, with such success as there has been (MySQL, Intersystems Cache&#8217;, solidDB&#8217;s exit, etc.) generally accruing to products that originated in the 20th Century.</p>
<p>Nonetheless, OLTP/general-purpose data management startup activity has recently picked up, targeting what I see as some very real opportunities and needs. So as a jumping-off point for further writing, I thought it might be interesting to collect a few observations about the market in one place.  These include:</p>
<ul>
<li><span style="font-weight: normal;">Big-brand 	OLTP/general-purpose DBMS have more “stickiness” 	than analytic DBMS.</span></li>
<li><span style="font-weight: normal;">By 	number, most of an enterprise&#8217;s OLTP/general-purpose databases are low-volume and 	low-value. </span></li>
<li>Most 	interesting new OLTP/general-purpose data management products are <span style="font-style: normal;">either 	MySQL-based or NoSQL.</span></li>
<li>It&#8217;s not yet 	clear whether MySQL will prevail over MySQL forks, or vice-versa, or 	whether they will co-exist.</li>
<li>The era of 	silicon-centric relational DBMS is coming.</li>
<li>The emphasis 	on scale-out and reducing the cost of joins spans the NoSQL and 	SQL-based worlds.<em> </em></li>
<li><span style="font-weight: normal;">Users&#8217; 	instance on “free” could be a major problem for OLTP DBMS 	innovation. </span></li>
</ul>
<p style="margin-bottom: 0in;">I shall explain.<span id="more-1841"></span></p>
<p style="margin-bottom: 0in;"><strong>Big-brand OLTP/general-purpose DBMS have more “stickiness” than analytic DBMS.</strong></p>
<ul>
<li>OLTP 	applications are more complex than analytic ones, and hence more 	tightly wired into particular brands of DBMS. For example, 	third-party packaged OLTP applications are typically portable among 	only a few brands of DBMS. But third-party business intelligence 	tools, and the BI “applications” built in them, are more easily 	and widely portable.</li>
<li>Specific technical observations 	such as “OLTP apps tend to use stored procedures, which are 	DBMS-specific” or “OLTP apps tend to have lots and lots of 	tables” serve to underscore the first point.</li>
<li>An enterprise&#8217;s highest-value data 	is commonly the financial stuff handled by its core OLTP systems, so 	those are the last things they want to mess around with just to get 	some cost savings. Security, high availability, and so on are major 	considerations that can outweigh cost.</li>
</ul>
<p style="margin-bottom: 0in;"><strong>By number, most of an enterprise&#8217;s OLTP/general-purpose databases are low-volume and low-value. </strong>Indeed, “OLTP” is often a misnomer, which is why I tend to go with “general-purpose” or some similarly wishy-washy phrase instead.</p>
<ul>
<li>In theory, this is a ripe area for 	what I&#8217;ve called <a href="http://www.dbms2.com/category/database-management-system/mid-range/" >mid-range DBMS</a>.</li>
<li>The big brand vendors try hard to 	keep as many of those databases for themselves as they can. 	Enterprise-wide license pricing helps. Going forward, so will 	virtualization/consolidation strategies, such as <a href="http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/" >Oracle&#8217;s 	Exadata-centric approach</a>.</li>
<li>A variety of mid-range DBMS 	alternatives beyond the big brands have technical merit, at least in 	some cases and configurations – MySQL, PostgreSQL, Intersystems 	Cache&#8217;, and so on.</li>
<li>The only such mid-range DBMS 	alternative with much large enterprise business momentum, however, 	appears to be MySQL.</li>
</ul>
<p style="margin-bottom: 0in;"><strong>&#8220;General-purpose&#8221; might be a better term than &#8220;OLTP&#8221; anyway.</strong></p>
<ul>
<li>I don&#8217;t have a link, but it&#8217;s widely agreed that over half of the processing on an &#8220;OLTP&#8221; enterprise app is commonly reporting and so on.</li>
<li>&#8220;Operational BI&#8221; is progressing by fits and starts, but it is progressing.</li>
<li>Anything customer-facing &#8212; web-based, call center, or otherwise &#8212; is likely to include a heavy dose of &#8220;real-time&#8221; analytic optimization.</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Most interesting new OLTP/general-purpose data management products are <span style="font-style: normal;">either MySQL-based or NoSQL.</span></strong></p>
<ul>
<li><a href="http://www.dbms2.com/2009/06/22/h-store-horizontica-voltdb/" >VoltDB</a> is the main 	exception that jumps to mind.</li>
<li>This isn&#8217;t true in the analytic 	DBMS area, where Netezza, Greenplum, Aster, Vertica and others 	started from PostgreSQL&#8217;s code, APIs, or both.</li>
</ul>
<p style="margin-bottom: 0in;"><strong>It&#8217;s not yet clear whether MySQL will prevail over MySQL forks, or vice-versa, or whether they will co-exist.</strong></p>
<ul>
<li>MySQL is a limited product without 	all the third-party storage engines that are being developed.</li>
<li><a href="http://www.dbms2.com/2009/12/14/oracle-mysql-storage-engine/" >Oracle&#8217;s promise of MySQL good 	behavior</a> has an expiration date.</li>
<li>None of the MySQL front-end 	alternatives are remotely mature yet.</li>
</ul>
<p style="margin-bottom: 0in;"><strong>The era of silicon-centric relational DBMS is coming.</strong></p>
<ul>
<li>I think “silicon” means 	“solid-state memory” as much as or more than it means “RAM,” 	but that&#8217;s not yet certain.</li>
<li>What is pretty certain is that, 	thanks to Moore&#8217;s Law, some kind of silicon will increasingly 	replace disk.</li>
<li><a href="http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/" >Oracle&#8217;s increasingly 	Flash-centric story</a> is a challenge to everybody.</li>
<li>RAM-centric VoltDB will launch 	fairly soon. (By the way, while VoltDB still has <a href="http://www.dbms2.com/2009/06/22/h-store-horizontica-voltdb/" >a lot in common 	with H-Store</a>, they&#8217;re not exactly the same thing. And <a href="http://bit.ly/9QxjV2." onclick="javascript:pageTracker._trackPageview('/bit.ly');">H-Store 	research</a> is progressing too.)</li>
<li><span style="font-style: normal;"><a href="http://rethinkdb.com/" onclick="javascript:pageTracker._trackPageview('/rethinkdb.com');">RethinkDB</a> is being de</span>veloped, focused directly on solid-state memory. 	Based on the sparse information available online, RethinkDB sounds 	somewhat like a dumbed-down H-Store.</li>
<li>New disk-based vendors may never 	optimize their use of disk, instead targeting a solid-state future. 	(E.g., I think Akiban should and quite well might follow this path.)</li>
</ul>
<p style="margin-bottom: 0in; font-weight: normal;"><strong>The emphasis on scale-out and reducing the cost of joins spans the NoSQL and SQL-based worlds.</strong> We hear that from the <a href="http://www.dbms2.com/2010/03/14/nosql-taxonomy/" >NoSQL</a> guys all the time. But I also just heard it from <a href="http://www.dbms2.com/2010/04/03/akiban-highlights/" >Akiban</a>.</p>
<p style="margin-bottom: 0in;"><strong>Users&#8217; instance on “free” could be a major problem for OLTP DBMS innovation.</strong> Vendors of new OLTP data management technologies often feel obligated to open source their products, notwithstanding the historical lack of revenue in the open source OLTP DBMS market. As just one of many examples,  <a href="http://www.novaspivack.com/uncategorized/evri-ties-the-knot-with-twine" onclick="javascript:pageTracker._trackPageview('/www.novaspivack.com');">Nova Spivack</a> wrote:</p>
<blockquote>
<p style="margin-bottom: 0in;">I have recently seen some new graph data storage products that may provide the levels of scale and performance needed, but pricing has not been determined yet. In short, storage and retrieval of semantic graph datasets is a big unsolved challenge that is holding back the entire industry. We need federated database systems that can handle hundreds of billions to trillions of triples under high load conditions, in the cloud, on commodity hardware and open source software. Only then will it be affordable to make semantic applications and services at Web-scale.</p>
</blockquote>
<p style="margin-bottom: 0in;">I hear similar things from other startups, who evidently believe they need and/or are entitled to enjoy sophisticated, high-performance, zero-cost, specialized database management technology.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/05/oltp-database-management-systems-2/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Toward a NoSQL taxonomy</title>
		<link>http://www.dbms2.com/2010/03/14/nosql-taxonomy/</link>
		<comments>http://www.dbms2.com/2010/03/14/nosql-taxonomy/#comments</comments>
		<pubDate>Sun, 14 Mar 2010 23:24:45 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1708</guid>
		<description><![CDATA[I talked Friday with Dwight Merriman, founder of 10gen (the MongoDB company). He more or less convinced me of his definition of NoSQL systems, which in my adaptation goes:
NoSQL = HVSP (High Volume Simple Processing) without joins or explicit transactions
Within that realm, Dwight offered a two-part taxonomy of NoSQL systems, according to their data model [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I talked Friday with Dwight Merriman, founder of 10gen (the MongoDB company). He more or less convinced me of his definition of NoSQL systems, which in my adaptation goes:</p>
<p style="margin-bottom: 0in;"><strong>NoSQL = <a href="http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/" >HVSP (High Volume Simple Processing)</a> without joins or explicit transactions</strong></p>
<p style="margin-bottom: 0in;">Within that realm, Dwight offered a two-part taxonomy of NoSQL systems, according to their data model and replication/sharding strategy. I&#8217;d be happier, however, with at least three parts to the taxonomy:</p>
<ul>
<li>How data looks logically on a 	single node</li>
<li>How data is stored physically on a 	single node</li>
<li>How data is distributed, 	replicated, and reconciled across multiple nodes, and whether 	applications have to be aware of how the data is partitioned among 	nodes/shards.<span id="more-1708"></span></li>
</ul>
<p style="margin-bottom: 0in;">After talking with Dwight, and also with Cassandra project chair Jonathan Ellis, I feel I&#8217;m doing decently in understanding the first of those three areas. But there&#8217;s a long way yet to go on the other two.</p>
<p style="margin-bottom: 0in;">In Dwight&#8217;s opinion, as I understand it, NoSQL data models come in four general kinds.</p>
<ul>
<li><em><strong>Key-value stores,</strong></em><em> more or less pure.</em> I.e., they store keys+BLOBs (Binary Large 	OBjects), except that the “Large” part of “BLOB” may not 	come into play.</li>
<li><em><strong>Table-oriented,</strong></em><em> more or less. </em>The major examples here are Google&#8217;s BigTable, and 	Cassandra.</li>
<li><em><strong>Document-oriented,</strong></em><em> where a “document” is more like XML than free text. </em>MongoDB 	and CouchDB are the big examples here.</li>
<li><strong><em>Graph-oriented.</em> </strong><span style="font-weight: normal;">To 	date, this is the smallest area of the four. I&#8217;m reserving judgment 	as to whether I agree it&#8217;s properly included in HVSP and NoSQL.</span></li>
</ul>
<p style="margin-bottom: 0in;">As Dwight sees it, JSON (JavaScript Object Notation) is the emerging markup standard for the document-oriented data models, and to some extent the BLOB part of key-value models as well. Reasons seem to include:</p>
<ul>
<li>JSON is something web developers 	are likely to know anyway.</li>
<li>JSON, unlike XML, is schema-less. 	In the NoSQL world, that&#8217;s perceived as a good thing.</li>
<li>Perhaps for both these reasons, 	JSON is perceived as easier to use than XML.</li>
</ul>
<p style="margin-bottom: 0in;">Except as noted, I&#8217;m not aware of anything that solidly contradicts the above.</p>
<p style="margin-bottom: 0in;">Dwight went on to say that there are two main NoSQL replication/sharding models, in line with the seminal papers to which I <a href="http://www.dbms2.com/2010/03/12/some-nosql-links/" >previously linked</a>:</p>
<ul>
<li><em>Based on or resembling </em><em><strong>Dynamo.</strong></em> The core idea here is accepting <strong>eventual consistency</strong> among 	nodes as being good enough, even if that means you sometimes read 	dirty data. The benefit is that <strong>you never are blocked from 	writing.</strong> By way of contrast, systems that enforce true 	inter-node consistency (think of a two-phase commit) can shut you 	down from writing if consistency guarantees aren&#8217;t being confirmed 	in a timely manner. Thus, in a Dynamo-like scheme you write data to 	multiple nodes, via <strong>consistent hashing;</strong> then when the time 	comes you read one or more nodes, and hope that what you&#8217;re getting 	back is a correct result.</li>
<li><em>Based on or resembling </em><em><strong>BigTable.</strong></em> In this model you&#8217;re trying to keep the 	nodes fully consistent in the usual way, e.g. by synchronous 	replication. Indeed, what&#8217;s being kept consistent is both data 	itself, and metadata about the data&#8217;s location. Details surely vary 	a lot from implementation to implementation.</li>
</ul>
<p style="margin-bottom: 0in;">I&#8217;m fuzzier on this stuff than on the data models, because to date nobody has ever explained to me how an actual live system (MongoDB, Cassandra, whatever) implements its replication strategy. Also, while I think that in both these models applications are allowed to be ignorant of the replication/sharding strategy, I&#8217;m not as sure of that as I&#8217;d like to be.</p>
<p style="margin-bottom: 0in;">If we stop here, we already have something useful. MongoDB has a document data model, and is in the BigTable-like replication camp, at least at first. Cassandra has a table-like data model, and is on the Dynamo-like eventual consistency side. But to say those are the only differences that matter would be like saying that all shared-disk RDBMS (e.g., Oracle and Sybase IQ) are essentially alike. That, of course, would be nonsense.</p>
<p style="margin-bottom: 0in;">So a third dimension needed in this taxonomy is how the systems actually bang data on and off of disk (or silicon, as the case may be). I don&#8217;t yet have an overview of that. I know something of how Cassandra does it, and will write about same in a future post, but that&#8217;s about it. So please stay tuned.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/03/14/nosql-taxonomy/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Some NoSQL links</title>
		<link>http://www.dbms2.com/2010/03/12/some-nosql-links/</link>
		<comments>http://www.dbms2.com/2010/03/12/some-nosql-links/#comments</comments>
		<pubDate>Fri, 12 Mar 2010 23:51:42 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Amazon and its cloud]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Continuent]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Tokutek]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1692</guid>
		<description><![CDATA[I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I&#8217;m poking around a bit reading stuff on the subjects. Here are some links I found.

A little over a year ago, Julian Browne put up a great post on Eric Brewer&#8217;s CAP conjecture/theorem, which provides much of the impetus [...]]]></description>
			<content:encoded><![CDATA[<p>I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I&#8217;m poking around a bit reading stuff on the subjects. Here are some links I found.<span id="more-1692"></span></p>
<ul>
<li>A little over a year ago, Julian Browne put up a great post on <a href="http://www.julianbrowne.com/article/viewer/brewers-cap-theorem" onclick="javascript:pageTracker._trackPageview('/www.julianbrowne.com');">Eric Brewer&#8217;s CAP conjecture/theorem</a>, which provides much of the impetus to relax the traditional requirement for atomicity/consistency.</li>
<li>Even more directly inspirational to NoSQL technology development were two seminal papers: Google&#8217;s on <a href="http://labs.google.com/papers/bigtable.html" onclick="javascript:pageTracker._trackPageview('/labs.google.com');">BigTable</a> and Amazon&#8217;s on <a href="http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf" onclick="javascript:pageTracker._trackPageview('/s3.amazonaws.com');">Dynamo</a>. (That said, I&#8217;m having trouble getting myself to actually read them from start to finish, especially since they&#8217;ve been superseded by subsequent technology development.)</li>
<li>10gen (the MongoDB guys) hosted a NoSQL conference yesterday. Much blogging has ensued. The best post I&#8217;ve seen so far was by <a href="http://blog.marcua.net/post/442594842/notes-from-nosql-live-boston-2010" onclick="javascript:pageTracker._trackPageview('/blog.marcua.net');">Adam Marcus</a>. I find the graph database notes near the bottom particularly interesting.</li>
<li>Mark Callaghan hit back against the <a href="http://mysqlha.blogspot.com/2010/03/plays-well-with-others.html" onclick="javascript:pageTracker._trackPageview('/mysqlha.blogspot.com');">NoSQL <span style="text-decoration: line-through;">movement</span> hype</a>, and in particular against the <a href="http://www.dbms2.com/2010/03/02/cassandra-nosql-scalable-oltp/" >MySQL/memcached is passe</a>&#8216; meme. On the other hand, he also bemoaned many failings of MySQL. On the third hand, he praised or at least expressed hope for a variety of MySQL-related technologies, including <a href="http://www.dbms2.com/2009/04/16/introduction-to-tokutek/" >Tokutek&#8217;s TokuDB</a> and <a href="http://www.dbms2.com/2009/09/03/continuent-on-clustering/" >Continuent&#8217;s Tungsten</a>.</li>
<li>In connection with that debate, Mark Rendle offered a <a href="http://blog.markrendle.net/2010/03/do-you-need-relational-database.html" onclick="javascript:pageTracker._trackPageview('/blog.markrendle.net');">funny rant</a>, mainly pro-NoSQL, in the style of a Socratic dialogue.</li>
<li>John Quinn of Digg recently described <a href="http://www.stumbleupon.com/su/5099Ti/about.digg.com/node/564" onclick="javascript:pageTracker._trackPageview('/www.stumbleupon.com');">Digg&#8217;s move from MySQL to Cassandra</a>, and outlined a lot of features Digg was adding to Cassandra, all of which it is open-sourcing.</li>
<li>The NoSQL guys maintain their own long <a href="http://nosql-database.org/links.html" onclick="javascript:pageTracker._trackPageview('/nosql-database.org');">list of NoSQL-related links</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/03/12/some-nosql-links/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Aster Data nCluster 4.5</title>
		<link>http://www.dbms2.com/2010/02/22/aster-data-ncluster-4-5/</link>
		<comments>http://www.dbms2.com/2010/02/22/aster-data-ncluster-4-5/#comments</comments>
		<pubDate>Mon, 22 Feb 2010 08:20:13 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1617</guid>
		<description><![CDATA[Like Vertica, Netezza, and Teradata, Aster is using this week to pre-announce a forthcoming product release, Aster Data nCluster 4.5. Aster is really hanging its identity on “Big Data Analytics” or some variant of that concept, and so the two major named parts of Aster nCluster 4.5 are:

Aster Data Analytic Foundation, a set of analytic [...]]]></description>
			<content:encoded><![CDATA[<p>Like <a href="http://www.dbms2.com/2010/02/22/vertica-4/" >Vertica</a>, <a href="http://www.dbms2.com/2010/02/22/netezza-twinfin/" >Netezza</a>, and Teradata, Aster is using this week to pre-announce a forthcoming product release, Aster Data nCluster 4.5. Aster is really hanging its identity on “Big Data Analytics” or some variant of that concept, and so the two major named parts of Aster nCluster 4.5 are:</p>
<ul>
<li><strong>Aster Data Analytic Foundation,</strong> a set of analytic packages prebuilt in <a href="../2009/06/09/aster-data-nclustersql-mapreduce/">Aster&#8217;s SQL-MapReduce</a><strong></strong></li>
<li><strong>Aster Data Developer Express,</strong> an Eclipse-based IDE (Integrated Development Environment) for developing and testing applications built on Aster nCluster, Aster SQL-MapReduce, and Aster Data Analytic Foundation</li>
</ul>
<p>And in other Aster news:</p>
<ul>
<li>Along with the development GUI in Aster nCluster 4.5, there is also a new administrative GUI.</li>
<li>Aster has certified that nCluster works with Fusion I/O boards, because at least one retail industry prospect cares. However, that in no way means that arm&#8217;s-length Fusion I/O certification is Aster&#8217;s ultimate <a href="../2010/01/31/flash-pcmsolid-state-memory-disk/">solid-state memory</a> strategy.</li>
<li>I had the wrong impression about how far Aster/SAS integration has gotten. So far, it&#8217;s just at the connector level.</li>
</ul>
<p>Aster Data Developer Express evidently does some cool stuff, like providing some sort of parallelism testing right on your desktop. It also generates lots of stub code, saving humans from the tedium of doing that. Useful, obviously.</p>
<p>But mainly, I want to write about the analytic packages.<span id="more-1617"></span> I&#8217;m not convinced that they&#8217;re a big deal in themselves yet, or that a whole lot of person-months have gone into their combined development. Still, I think they provide a great indication of one direction in which analytic functionality is going. And by the way, Aster promises to release a lot more of that kind of thing over the next 12 months.</p>
<p>Aster&#8217;s flagship analytic package is <a href="../2009/02/10/aster-data-npath/">nPath</a>, which is like a <strong>regular expression matcher,</strong> but <strong>for (time) series of data</strong> rather than for character strings. The main use for nPath is in pulling specific kinds of event sequences out of web or network event logs. However, one could imagine uses in other sectors that focus on temporal or sequential data (e.g., trading, intelligence, other sensor analysis), should existing SQL- and/or CEP-based technologies not prove sufficiently flexible. Aster 4.5 adds some new aggregation capabilities around nPath.</p>
<p>Other not-wholly-new packages in the Aster Data Analytic Foundation announcement are for <strong>sessionization</strong> (of clickstream data and the like) and <strong>tokenization </strong>(of text/character string data). While sessionization can be done in SQL, Aster thinks its MapReduce-based version is faster, since it doesn&#8217;t require self-joins. Makes sense. Aster&#8217;s tokenization sounds lame, however – text analytics in MapReduce tends to reinvent simplistic wheels for no clear reason, and Aster doesn&#8217;t seem to be an exception. (Aster would argue, however, that anything it does in SQL-MapReduce is more flexible than pure SQL or pure MapReduce alternatives.)</p>
<p>Another example of better-living-without-self-joins is Aster&#8217;s new <strong>market basket</strong> package. This lets you look at a set of point-of-sale data, pick a small integer N, and pull out all the sets of N things that were bought by the same person at the same time. I haven&#8217;t probed the claim in detail, but Aster implies there&#8217;s less combinatorial explosion in its approach than it is in the self-join alternative.</p>
<p><em>Note: Gartner highlighted self joins as a performance challenge in its recent </em><a href="../2010/02/10/gartner-magic-quadrant-data-warehouse-2009-2010/">Data Warehouse Magic Quadrant</a><em>.</em></p>
<p>Aster is also releasing a few <strong>statistical and general analytic functions</strong> &#8212; specifically (and I quote a slide):</p>
<ul>
<li>exponential moving average</li>
<li>weighted moving average</li>
<li>simple moving average</li>
<li>volume-weighted average price</li>
<li>correlation</li>
<li>linear regression</li>
<li>logistic regression</li>
<li>approximate_percentile</li>
<li>approximate_count_distinct</li>
</ul>
<p>The point of the last two items on the list is that if you set a non-zero tolerance for error, you can you can count things or order them into bins very efficiently – especially in terms of RAM &#8212; while being guaranteed not to exceed your error tolerance.</p>
<p><em>Note: One obvious inference from this list &#8212; which Aster gladly confirms &#8212; is that Aster has high hopes of selling to the financial services industry. </em></p>
<p>Finally, Aster is releasing its first pure <strong>graph-analytic</strong> function, for finding the shortest path between a given pair of nodes.</p>
<p>While I had the Aster folks on the phone anyway, I also took the opportunity to ask about the Aster nCluster 4.0 capability to create fairly persistent non-relational in-memory data structures. Specifically, I asked whether different users could access the same in-memory structure, and was told that this is a little klugey but not too horrendous. That suggests Aster&#8217;s capability may be a strict superset of UDF-based (User-Defined Function) approaches to meeting the same need, at least from a functionality standpoint. However, ease of creating those in-memory structures may still be better in the more SQL/UDF-centric approach favored by Teradata.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/22/aster-data-ncluster-4-5/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Open issues in database and analytic technology</title>
		<link>http://www.dbms2.com/2010/02/01/open-issues-in-database-and-analytic-technology/</link>
		<comments>http://www.dbms2.com/2010/02/01/open-issues-in-database-and-analytic-technology/#comments</comments>
		<pubDate>Mon, 01 Feb 2010 22:04:31 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1507</guid>
		<description><![CDATA[The last part of my New England Database Summit talk was on open issues in database and analytic technology. This was closely intertwined with the previous section, and also relied on a lot that I&#8217;ve posted here. So I&#8217;ll just put up a few notes on that part, with lots of linkage to prior discussion [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">The last part of my <a href="http://www.dbms2.com/2009/11/25/new-england-database-summit-january-28-2010/" >New England Database Summit</a> talk was on open issues in database and analytic technology. This was closely intertwined with the <a href="http://www.dbms2.com/2010/01/31/trends-database-aanalytic-technology/" >previous section</a>, and also relied on a lot that I&#8217;ve posted here. So I&#8217;ll just put up a few notes on that part, with lots of linkage to prior discussion of the same points.<span id="more-1507"></span></p>
<p><!-- 		@page { margin: 0.79in } 		P { margin-bottom: 0.08in } --></p>
<ul>
<li>The most important issue in 	database and analytic technology, in my opinion, isn&#8217;t technological 	at all – rather, it&#8217;s the legal and political steps needed to <a href="http://www.dbms2.com/2010/01/31/data-based-snooping-threat-libert/" > preserve liberty</a> in the face of advancing, intrusive 	technology.</li>
<li>Another important issue for 	society – and this one does involve a lot of technology – is 	scientific number crunching. In particular, <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/" >database technology for 	scientific computing</a> needs to be developed much further. I&#8217;ll have 	more to say on all this soon.</li>
<li>More generally, technology needs 	to keep advancing for parallel analytics. Fortunately, it is. Watch 	this space over the next few weeks.</li>
<li>Oracle has said, in effect, that <a href="http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/" > its most important technological challenge of the decade</a> is getting 	<a href="http://www.dbms2.com/2010/01/31/flash-pcmsolid-state-memory-disk/" >solid-state memory</a> right. I agree.</li>
<li>Data volumes will keep going up, 	up, up. Technology needs to keep evolving accordingly. Much of what 	I write is on that subject.</li>
<li>Data needs to be processed and analyzed at <a href="http://www.dbms2.com/2009/09/10/analytic-speed-latency/" >very 	different latencies</a>. And there&#8217;s much further to go in integrating 	disparate latencies.</li>
<li>Analytic database management in 	the cloud hasn&#8217;t been solved yet, especially for Big Data. Among the 	reasons are the difficulty of moving data into the cloud (unless it 	originated there), the slowness of moving it from node to node in 	shared-nothing architectures (which reduces the elasticity benefit), 	and above all the long and unpredictable latencies of interprocessor 	communication while queries are running (a key subject of discussion 	at the <a href="http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/" >Boston Big Data Summit</a>).</li>
<li>Better business intelligence user 	interfaces are increasingly available. I&#8217;m thinking particularly of 	approaches with buzzwords like <a href="http://www.dbms2.com/2008/08/04/qliktech-qlikview-update/" >visualization/interactive exploration</a> or <a href="http://www.texttechnologies.com/2007/08/03/the-case-for-inxight-awareness-server/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">faceted</a>. But they aren&#8217;t well-integrated into the overall 	analytic stack, as big BI vendors are trailing the smaller ones in 	this regards. (Part of the problem relates to my previous point.)</li>
<li>Application development over text 	search isn&#8217;t in the same league as application development over 	relational DBMS. The choices are mainly XML (e.g., <a href="http://www.texttechnologies.com/2008/04/29/mark-logic-viewed-as-a-different-kind-of-text-search-technology-vendor/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">MarkLogic</a>), SQL 	for text integrated into RDBMS (limited by the weakness of those 	integrations), and something like <a href="http://www.texttechnologies.com/2008/09/20/attivio-update/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">Attivio&#8217;s Java SDK</a>. There&#8217;s a 	major conceptual barrier in building those apps, namely the 	unpredictability of query results. Still, it should be possible to 	do better.</li>
<li>Similarly, text analytics and 	conventional analytics exist well side by side. They can even be in 	the same database and/or dashboard, although in practice that is 	limited by the strong <a href="http://www.texttechnologies.com/2008/10/24/attensity-update-2/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">SaaS focus of text mining vendors and users</a>. But analytic 	integration of them is really hard. Linguistic imprecision is, in my 	opinion, only the #2 reason for this difficulty. The #1 reason is 	that trends detected by text analytics are much less precise than 	trends on tabular data – e.g., a 50% increase in a certain kind of 	complaint may be no more significant than a 5% change in a revenue 	variable.</li>
<li>I&#8217;m increasingly persuaded that <a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/" > graph analytics</a> can be handled without a graph-centric data model. 	But right now, it isn&#8217;t being handled well at all. Lots more needs 	to be done – although when it is, it will just exacerbate the 	privacy/liberty dangers that so concern me.</li>
</ul>
<p><em><strong>Other posts based on my January, 2010 New England Database Summit keynote address</strong></em></p>
<ul>
<li><a title="Data-based snooping — a huge threat to liberty that we’re all helping make worse" href="../2010/01/31/data-based-snooping-threat-libert/">Data-based snooping — a huge threat to liberty that we’re all helping make worse</a></li>
<li><a title="Flash, other solid-state memory, and disk" href="../2010/01/31/flash-pcmsolid-state-memory-disk/">Flash, other solid-state memory, and disk</a></li>
<li><a title="Interesting trends in database and analytic technology" href="../2010/01/31/trends-database-aanalytic-technology/">Interesting trends in database and analytic technology</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/01/open-issues-in-database-and-analytic-technology/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Webinar on MapReduce for complex analytics (Thursday, December 3, 10 am and 2 pm Eastern)</title>
		<link>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/</link>
		<comments>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/#comments</comments>
		<pubDate>Wed, 02 Dec 2009 20:57:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1267</guid>
		<description><![CDATA[The second in my two-webinar series for Aster Data will occur tomorrow, twice (both live), at 10 am and 2 pm Eastern time. The other presenters will be Jonathan Goldman, who was a Principal Scientist at LinkedIn but now has joined Aster himself, and Steve Wooledge of Aster (playing host). Key links are:

Registration for tomorrow&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>The second in my two-webinar series for Aster Data will occur tomorrow, twice (both live), at 10 am and 2 pm Eastern time. The other presenters will be Jonathan Goldman, who was a Principal Scientist at LinkedIn but now has joined Aster himself, and Steve Wooledge of Aster (playing host). Key links are:</p>
<ul>
<li>Registration for <a href="http://www.asterdata.com/wc_091203_masteringmapreduce/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');">tomorrow&#8217;s webinars</a></li>
<li>Replay of the <a href="http://www.asterdata.com/masteringmapreduce2/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');"> first webinar</a></li>
<li>My slides from the <a href="http://www.dbms2.com/2009/10/15/mapreduce-webinar-slides/" >first webinar</a></li>
</ul>
<p>The main subjects of the webinar will be:</p>
<ul>
<li>Some review of material from the first webinar (all three presenters)</li>
<li>Discussion of how MapReduce can help with three kinds of analytics:
<ul>
<li>Pattern matching (Jonathan will give detail)</li>
<li>Number-crunching (I&#8217;ll cover that, and it will be short)</li>
<li>Graph analytics (I haven&#8217;t written the slides yet, but my starting point will be some of the <a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/" >relationship analytics</a> ideas we discussed in August)</li>
</ul>
</li>
</ul>
<p>Arguably, aspects of data transformation fit into each of those three categories, which may help explain why data transformation has been so prominent among the early applications of MapReduce.</p>
<p>As you can see from Aster&#8217;s title for the webinar (which they picked while I was on vacation), at least their portion will be focused on customer analytics, e.g. web analytics.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
