<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; RDF and graphs</title>
	<atom:link href="http://www.dbms2.com/category/datatype/rdf-graph-database/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Wed, 08 Feb 2012 12:22:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Aster Data business trends</title>
		<link>http://www.dbms2.com/2011/09/08/aster-data-business-trends/</link>
		<comments>http://www.dbms2.com/2011/09/08/aster-data-business-trends/#comments</comments>
		<pubDate>Thu, 08 Sep 2011 05:33:56 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5204</guid>
		<description><![CDATA[Last month, I reviewed with the Aster Data folks which markets they were targeting and selling into, subsequent to acquisition by their new orange overlords. The answers aren&#8217;t what they used to be. Aster no longer focuses much on what it used to call frontline (i.e., low-latency, operational) applications; those are of course a key [...]]]></description>
			<content:encoded><![CDATA[<p>Last month, I reviewed with the Aster Data folks which markets they were targeting and selling into, subsequent to <a href="../../../../../2011/03/04/teradata-aster-data-ncluster/">acquisition</a> by their new orange overlords. The answers aren&#8217;t what they used to be. Aster no longer focuses much on what it used to call <a href="../../../../../2008/10/22/aster-data-systems-ncluster/">frontline</a> (i.e., low-latency, operational) applications; those are of course a key strength for Teradata. Rather, Aster focuses on <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> &#8212; they&#8217;ve long <a href="../../../../../2011/02/12/upcoming-webinar-on-investigative-analytics/">endorsed</a> my use of the term &#8212; and on the batch run/scoring kinds of applications that inform operational systems.</p>
<p><span id="more-5204"></span>Also, Aster no longer focuses much on the general internet industry where it got its earliest sales, its <a href="../../../../../2011/09/05/zynga-linkedin-data-warehous/">continued success at LinkedIn</a> and a recent win at <span style="text-decoration: line-through;">an (NDA) fairly-big-name internet new account</span> <em>Razorfish</em> notwithstanding. That said, the first target market Aster did share with me was &#8220;digital marketing optimization,&#8221; which includes &#8220;marketing optimization&#8221; (duh), search engine optimization (SEO), clickstream analysis, and the like. Also, Aster is going after &#8220;data scientists&#8221; in general, and that&#8217;s a term I&#8217;m still seeing used most frequently in the internet area.</p>
<p><em>I&#8217;m seeing ever more granularity as companies break down internet-related market segments. DataStax showed me a chart last week of 15 different market segments it had sold into, and at least 14 were in some way internet-related.</em></p>
<p>Rather, if Aster is to name three industries in which it has pleasingly strong sales traction, it would say manufacturing (which in Teradata lingo includes resource extraction), financial services (including insurance), and retail. A cynic might note that that breakdown, like many similar ones, adds up to fairly large swaths of the economy and the computer market, but never mind that part. (Other firms might have thrown in telecommunications and health care as well, to get even more coverage.</p>
<p>Two of Aster&#8217;s other favorite application areas are social network analysis/influencer identification and &#8212; which is analytically very similar &#8212; fraud detection/prevention. Taken together, that&#8217;s a whole lot of graph analysis. And I note with interest that the influencer identification stuff does NOT seem to be concentrated in telecom, which is the traditional sector one would imagine it being used in; all those call records are a lovely source of graph edges. Rather, the influencers seem to be identified from sources such as social media and credit card data .</p>
<p><em>Once again, this kind of thing gives me privacy jitters.</em></p>
<p>The match between Aster&#8217;s favorite industries and application areas is pretty much as you might expect &#8212; fraud in financial services, influencer analysis in retailing (and probably consumer financial services too), and digital marketing in both. As for manufacturing, the opportunities there seem to be focused on machine-generated data. That would be at least in high-tech manufacturing (I bet especially in flow-oriented stuff such as semiconductor fab) and oil/gas. Smart grid opportunities don&#8217;t seem to have arisen yet for Aster the way they have for a couple other vendors.</p>
<p>As for general Aster business trends, I think they&#8217;re good, while Aster would perhaps want to portray them as very good. Aster named a couple of impressive joint Teradata/Aster wins under NDA, but only a couple. Ramping up sales headcount is proving challenging, and some sales leadership turnover probably hasn&#8217;t helped. I do believe Aster&#8217;s spin that this is a matter of somebody being promoted quickly to a bigger job, and am optimistic about the current team &#8212; still, such moves tend to have at least short-term cost.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/08/aster-data-business-trends/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Vertica as an analytic platform</title>
		<link>http://www.dbms2.com/2011/06/20/vertica-as-an-analytic-platform/</link>
		<comments>http://www.dbms2.com/2011/06/20/vertica-as-an-analytic-platform/#comments</comments>
		<pubDate>Mon, 20 Jun 2011 06:13:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4775</guid>
		<description><![CDATA[Vertica 5.0 is coming out today, and delivering the down payment on Vertica&#8217;s analytic platform strategy. In Vertica lingo, there&#8217;s now a Vertica SDK (Software Development Kit), featuring Vertica UDT(F)s* (User-Defined Transform Functions). Vertica UDT syntax basics start:  In this release, Vertica UDTFs can only be written in C++. Other UDTF languages are promised. Otherwise, [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.dbms2.com/2011/06/20/vertica-release-5/">Vertica 5.0</a> is coming out today, and delivering the down payment on Vertica&#8217;s <a href="../../../../../2011/02/24/analytic-platforms/">analytic platform</a> strategy. In Vertica lingo, there&#8217;s now a Vertica SDK (Software Development Kit), featuring Vertica UDT(F)s* (User-Defined Transform Functions). Vertica UDT syntax basics start:  <span id="more-4775"></span></p>
<ul>
<li>In this release, Vertica UDTFs can only be      written in C++. Other UDTF languages are promised.</li>
<li>Otherwise, Vertica UDTFs sound pretty      flexible; in particular:
<ul>
<li>They can ingest and emit any number of       rows.</li>
<li>Their assumed schemas can be defined       programmatically (both input and output).</li>
</ul>
</li>
<li>Vertica UDTFs go in the SELECT clause,      not the FROM clause. I must confess to not grasping Vertica&#8217;s argument as      to why this provides great and important flexibility.</li>
<li>UDTF syntax mirrors SQL 99 Analytics      pretty closely.</li>
</ul>
<p><em>*It looks like the &#8220;F&#8221; is in the official name, but will often be dropped colloquially.</em></p>
<p>Other Vertica analytic platform highlights include:</p>
<ul>
<li>Proper integrated UDT workload      management is promised, and there&#8217;s a little bit of UDT workload      management already.*</li>
<li>Vertica is delivering some prebuilt functions      for aggregation, statistics, etc.</li>
<li>Vertica has cool <a href="http://www.dbms2.com/2011/06/20/temporal-data-time-series-and-imprecise-predicates/">temporal and time series features</a>.</li>
<li>Vertica&#8217;s geospatial support seems      pretty basic (circles and rectangles).</li>
<li>Vertica&#8217;s NDA plans moving forward are      pretty much as one would hope.</li>
</ul>
<p><em>*Vertica&#8217;s UDT workload management is RAM-only, and &#8220;honor system&#8221; &#8212; i.e., it assumes that the UDTFs declare their resource usage correctly, which Vertica says is the right way to handle in-process C++ routines.</em></p>
<p>Vertica also argues that fast-performing SQL in and of itself can amount to analytic functionality. For example, Vertica has tried to ensure that it offers great performance in the kinds of self-joins that are used in graph analysis. Since Vertica has plenty of customers among the kinds of Web and telco companies that use graph analysis today, I&#8217;m inclined to grant some benefit of the doubt here. That said, Vertica thinks 3 hops is plenty for most kinds of graph analysis people want to do, and I can think of applications (e.g. anti-terrorism) where that&#8217;s surely not the case.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/20/vertica-as-an-analytic-platform/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Investigative analytics and derived data: Enzee Universe 2011 talk</title>
		<link>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/</link>
		<comments>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/#comments</comments>
		<pubDate>Sun, 19 Jun 2011 12:13:04 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4747</guid>
		<description><![CDATA[I&#8217;ll be speaking Monday, June 20 at IBM Netezza&#8217;s Enzee Universe conference. Thus, as is my custom: I&#8217;m posting draft slides. I&#8217;m encouraging comment (especially in the short time window before I have to actually give the talk). I&#8217;m offering links below to more detail on various subjects covered in the talk. The talk concept [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ll be speaking Monday, June 20 at IBM Netezza&#8217;s <a href="http://www.netezza.com/userconference/abstract.html#620_1200">Enzee Universe</a> conference. Thus, as is my custom:</p>
<ul>
<li>I&#8217;m posting draft <a href="http://www.monash.com/uploads/Enzee-Universe-2011-Monash.ppt">slides</a>.</li>
<li>I&#8217;m encouraging comment (especially in the short time window before I have to actually give the talk).</li>
<li>I&#8217;m offering links below to more detail on various subjects covered in the talk.</li>
</ul>
<p>The talk concept started out as &#8220;advanced analytics&#8221; (as opposed to fast query, a subject amply covered in the rest of any Netezza event), as a lunch break in what is otherwise a detailed &#8220;best practices&#8221; session. So I suggested we constrain the subject by focusing on a specific application area &#8212; customer acquisition and retention, something of importance to almost any enterprise, and which exploits most areas of analytic technology. Then I actually prepared the slides &#8212; and guess what? The mix of subjects will be skewed somewhat more toward generalities than I first intended, specifically in the areas of <strong>investigative analytics </strong>and<strong> derived data. </strong>And, as always when I speak, I&#8217;ll try to raise consciousness about the issues of <a href="../../../../../2011/01/10/privacy-dangers-an-overview/">liberty and privacy</a>, our <a href="../../../../../2010/07/04/fair-data-use/">options as a society for addressing them</a>, and the crucial role we play as an industry in <a href="../../../../../2010/04/04/privacy-liberty-continued/">helping policymakers deal with these technologically-intense subjects</a>.</p>
<p>Slide 3 refers back to a post I made last December, saying there are <a href="../../../../../2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/">six useful things you can do with analytic technology</a>:</p>
<ul>
<li><strong>Operational      BI/Analytically-infused operational apps:</strong> You can make an immediate      decision.</li>
<li><strong>Planning      and budgeting:</strong> You can plan in      support of future decisions.</li>
<li><strong>Investigative      analytics (multiple disciplines):</strong> You can research, investigate, and analyze in support of future decisions.</li>
<li><strong>Business      intelligence:</strong> You can monitor      what’s going on, to see when it necessary to decide, plan, or investigate.</li>
<li><strong>More      BI:</strong> You can communicate, to help      other people and organizations do these same things.</li>
<li><strong>DBMS,      ETL, and other &#8220;platform&#8221; technologies:</strong> You can provide support, in      technology or data gathering, for one of the other functions.</li>
</ul>
<p>Slide 4 observes that <a href="http://www.dbms2.com/2011/03/03/investigative-analytics/">investigative analytics</a>:</p>
<ul>
<li>Is the most rapidly advancing of the six areas &#8230;</li>
<li>&#8230; because it most directly exploits performance &amp; scalability.</li>
</ul>
<p>Slide 5 gives my simplest overview of investigative analytics technology to date:  <span id="more-4747"></span></p>
<ul>
<li>Fast query
<ul>
<li>Persistent storage (any data volume)</li>
<li>RAM (10s -100s of gigabytes, or more)</li>
</ul>
</li>
<li>Fast analytics
<ul>
<li>Predictive modeling</li>
<li>Transformation/tagging</li>
<li>Graph</li>
</ul>
</li>
</ul>
<p>Slide 6 points out that this is all supported by cheap data creation and acquisition, specifically in the area of <a href="http://www.dbms2.com/2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>, which gets the full benefit of Moore&#8217;s Law.</p>
<p>Slides 7-13 point out how the example problem domain involves lots of analytic tasks performed on lots of kinds of data. Specific examples cited include <a href="http://www.dbms2.com/2011/04/14/attensity-update/">text analytics</a> and <a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/">graph/relationship analytics</a>.</p>
<p>Slide 14 contains the punch line, so I&#8217;ll quote it in full:</p>
<blockquote><p>Derived data</p>
<ul>
<li>You can’t keep re-analyzing all that in raw form …</li>
<li>&#8230; so don’t.</li>
</ul>
<p><em>If you have one takeaway from this session, let it be the utter importance of derived data. </em></p></blockquote>
<p>Slide 16 lists kinds of <a href="http://www.dbms2.com/2011/05/30/another-category-of-derived-data/">derived data</a> that are important in the single application of reducing telco churn:</p>
<ul>
<li>Normalized data
<ul>
<li>Parsed/sessionized logs</li>
<li>Text/sentiment highlights</li>
<li>Social network graph(s)</li>
<li>Web de-anonymization</li>
<li>Household matching</li>
</ul>
</li>
<li>Scores and buckets
<ul>
<li>Demographic</li>
<li>Psychographic</li>
<li>Offer hot buttons</li>
<li>(Dis)satisfaction</li>
<li>Credit/fraud risk</li>
<li>Lifetime customer value</li>
<li>Influence on others!</li>
</ul>
</li>
</ul>
<p>And finally, Slide 17 is my first pass at best practices for dealing with derived data:</p>
<ul>
<li>Evolving data warehouse schema</li>
<li>Data marts
<ul>
<li>Physical or virtual</li>
<li>Inputs/outputs to “EDW”</li>
</ul>
</li>
<li>“Data science”
<ul>
<li>Research != production</li>
</ul>
</li>
<li>Multiple processing pipelines
<ul>
<li>Log parsing</li>
<li>Text</li>
<li>Predictive analytics</li>
<li>Generic ETL</li>
<li>Streaming “ETL”</li>
</ul>
</li>
</ul>
<p>That last list looks like a starting point for a whole set of interesting future posts.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Whither MarkLogic?</title>
		<link>http://www.dbms2.com/2011/04/05/whither-marklogic/</link>
		<comments>http://www.dbms2.com/2011/04/05/whither-marklogic/#comments</comments>
		<pubDate>Wed, 06 Apr 2011 02:06:51 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Structured documents]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4168</guid>
		<description><![CDATA[My clients at MarkLogic have a new CEO, Ken Bado, even though former CEO Dave Kellogg was quite successful. If you cut through all the happy talk and side issues, the reason for the change is surely that the board wants to see MarkLogic grow faster, and specifically to move beyond its traditional niches of [...]]]></description>
			<content:encoded><![CDATA[<p>My clients at MarkLogic have a new CEO, Ken Bado, even though former CEO Dave Kellogg was quite successful. If you cut through all the happy talk and side issues, the reason for the change is surely that the board wants to see MarkLogic grow faster, and specifically to move beyond its traditional niches of publishing (especially technical publishing) and national intelligence.</p>
<p>So what other markets could MarkLogic pursue? Before Ken even started work, I sent over some thoughts. They included (but were not limited to):  <span id="more-4168"></span></p>
<ul>
<li>Everybody now knows that not all problems require a relational DBMS.  The NoSQL movement has seen to that.</li>
<li>Not everybody agrees that &#8220;heavyweight&#8221; enterprise DBMS are  needed for everything. The NoSQL movement has seen to that too.</li>
<li><a href="http://www.dbms2.com/2011/02/07/notes-on-document-oriented-nosql/">The &#8220;document&#8221;/&#8221;object&#8221; DBMS distinction has long been blurry</a>. XML is  full of documents, but they&#8217;re really objects. The same goes for the  JSON/quasi-JSON objects of CouchDB/Couchbase and MongoDB.  Object-oriented DBMS vendors have dabbled in XML on and off over the  years because of technical similarity. Etc.</li>
<li>MarkLogic has always focused on markets  where the database truly was about documents in the conventional sense &#8212; especially long text documents &#8212;  aka &#8220;content&#8221;. I always thought that focus was over-narrow.</li>
<li>There are various cases where law, regulation, compliance etc.  mandate the production of long text documents. I&#8217;m not sure MarkLogic has  penetrated those as well as it could have.</li>
<li>Graph DBMS  technology is going nowhere fast, largely because nobody has solved the  data distribution problem in cases big enough to need scale-out, and the  technology isn&#8217;t obviously needed in single-server cases. (But see my post on <a href="http://www.dbms2.com/2010/06/19/objectivity-infinite-graph/">Objectivity&#8217;s Infinite Graph</a>.) Even so,  graph-oriented apps are exploding, and MarkLogic should think about playing in the graph area, even if by acquisition.</li>
<li> I think what I  described in <a href="../../../../../2010/06/08/profile-of-revealed-preferences/" target="_blank">http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/</a> is non-relational and a very big market. Playing there with a  &#8220;heavyweight&#8221; DBMS is of course a challenge.</li>
<li>Coming over from Autodesk, Ken Bado hopefully knows more about the product  data management business than I do.</li>
</ul>
<p>It will be interesting to see what MarkLogic actually does.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/04/05/whither-marklogic/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Analytic performance &#8212; the persistent need for speed</title>
		<link>http://www.dbms2.com/2011/03/24/analytic-performance-the-persistent-need-for-speed/</link>
		<comments>http://www.dbms2.com/2011/03/24/analytic-performance-the-persistent-need-for-speed/#comments</comments>
		<pubDate>Thu, 24 Mar 2011 14:41:32 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[RDF and graphs]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4061</guid>
		<description><![CDATA[Analytic DBMS and other analytic platform technologies are much faster than they used to be, both in absolute and price/performance terms. So the question naturally arises, &#8220;When is the performance enough?&#8221; My answer, to a first approximation, is &#8220;Never.&#8221; Obviously, your budget limits what you can spend on analytics, and anyhow the benefit of incremental [...]]]></description>
			<content:encoded><![CDATA[<p>Analytic DBMS and other <a href="../../../../../2011/02/24/analytic-platforms/">analytic platform</a> technologies are much faster than they used to be, both in absolute and price/performance terms. So the question naturally arises, &#8220;When is the performance enough?&#8221; My answer, to a first approximation, is &#8220;Never.&#8221; Obviously, your budget limits what you can spend on analytics, and anyhow the benefit of incremental expenditure at some point can grow quite small. But if analytic processing capabilities were infinite and free, we&#8217;d do a lot more with analytics than anybody would consider today.</p>
<p>I have two lines of argument supporting this view. One is application-oriented. <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">Machine-generated data</a> will keep growing rapidly. So using that data requires ever more processing resources as well. Analytic growth, rah-rah-rah; company valuation, sis-boom-bah. Application areas include but are not at all limited to marketing, law enforcement, investing, logistics, resource extraction, health care, and science.</p>
<p>The other approach is to point out some computational areas where vastly more analytic processing resources could be used than are available today. Consider, if you will, <strong>statistical modeling, graph analytics, optimization, </strong>and <strong>stochastic planning.  <span id="more-4061"></span><br />
</strong></p>
<p><strong>Statistical</strong> practice, in many cases, still goes something like this:</p>
<ul>
<li>A data set has, for example, <a href="../../../../../2011/03/13/so-how-many-columns-can-a-single-table-have-anyway/">a thousand columns</a>.</li>
<li>Statisticians carefully choose a few dozen columns to model on.</li>
<li>They then also decide how to modify data in some of the columns for better modeling (binning, filling in nulls, whatever).</li>
<li> A linear regression ensues.</li>
</ul>
<p>That all makes sense. Sometimes using fewer variables gives better results than using more of them (because of over-fitting), and you have to pick: You can&#8217;t realistically try all 2^1000-1 variable combinations; if you allowed quadratic terms too, you&#8217;d approach 2^500,000 combinations; and the possibilities expand from there. But suppose you actually did have unlimited computational resources? Then, if nothing else, you could do a whole lot of regressions, followed by some kind of meta-analysis on the results. That&#8217;s so beyond the realm of computational reality I doubt the mathematics of same has even been carefully worked out &#8212; which is exactly my point.</p>
<p><strong>Graph analytics,</strong> to a first approximation, takes <em>order</em>(N*(E^H)) steps, where N is the number of nodes, H is the number of hops you want to go out, and E is the average number of edges per node. And that&#8217;s only for the simple stuff, which might produce inputs into further analytic steps. Such numbers get forbiddingly big, really fast.</p>
<p>Last year I talked with an LTL (Less than TruckLoad) shipping company. They had to decide which freight to put into which trucks, and then where to send the trucks. I thought for a moment, and said &#8220;In other words, the traveling salesman has to decide how to pack his knapsack?&#8221;* <strong>Optimization</strong> is computationally hard.</p>
<p><em>*Wikipedia has a wonderful <a href="http://en.wikipedia.org/wiki/List_of_NP-complete_problems">list of NP-complete problems</a>.</em></p>
<p>I was a stock analyst at the dawn of electronic spreadsheets. I actually did my first training spreadsheet using a calculator and green paper, and had one older colleague who still used a slide rule. The wonderful thing about making projections via electronic spreadsheets was that you could vary assumptions, then have the conclusions automatically recalculated. Bliss! (At least when compared to the alternatives.)</p>
<p>After a couple of months on the job, I circulated a memo about how one SHOULD do projections. I was told they almost fired me. And they had a point, because the computational power needed was ridiculous. The first part of the idea was to pull in every variable that seemed to make sense, and postulate (or test if possible) relationships among them. The second was to look at the outcomes under many different values of the independent variables.</p>
<p>In other words, I was advocating <strong>Monte Carlo stochastic planning</strong>. Well, guess what! Monte Carlo analysis is getting more widely productized, due to its usefulness in Basel 3 risk analysis. Traditional business planning still stinks. The time either has come or else is coming soon when traditional business planning should invoke Monte Carlo methods.</p>
<p><strong>Bottom line: The analytic need for speed will remain with us through the foreseeable future</strong> &#8212; and I didn&#8217;t even need to do a probabilistic analysis to figure that out. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/03/24/analytic-performance-the-persistent-need-for-speed/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Teradata, Aster Data, and Teradata/Aster</title>
		<link>http://www.dbms2.com/2011/03/04/teradata-aster-data-ncluster/</link>
		<comments>http://www.dbms2.com/2011/03/04/teradata-aster-data-ncluster/#comments</comments>
		<pubDate>Fri, 04 Mar 2011 10:44:04 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3963</guid>
		<description><![CDATA[Teradata is acquiring Aster Data. Naturally, the deal is being presented with a Treaty of Tordesillas kind of positioning &#8212; Teradata does X, Aster Data does Y, and everybody looks forward to having X and Y in the same product portfolio. That said, my initial positioning and product strategy thoughts on the Teradata/Aster combination go [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.teradata.com/t/News-Releases/2011/Teradata-to-Acquire-Aster-Data/">Teradata is acquiring Aster Data</a>. Naturally, the deal is being presented with a <a href="http://www.strategicmessaging.com/dont-try-to-emulate-the-treaty-of-torsedillas/2010/10/03/">Treaty of Tordesillas</a> kind of positioning &#8212; Teradata does X, Aster Data does Y, and everybody looks forward to having X and Y in the same product portfolio. That said, my initial positioning and product strategy thoughts on the Teradata/Aster combination go something like this.  <span id="more-3963"></span></p>
<ul>
<li>The Teradata/Aster combination genuinely is complementary.</li>
<li>Teradata focuses on and excels at multi-thousand-user, often very operational data warehousing use cases.*
<ul>
<li>Teradata&#8217;s products also do other analytic DBMS tasks, but in most cases aren&#8217;t particularly differentiated at them.</li>
<li>The use cases Teradata excels at probably can support higher pricing than generic fast-query use cases anyway.</li>
</ul>
</li>
<li>Aster is focused on <a href="../../../../../2011/02/24/analytic-platforms/">analytic platform</a> use cases.
<ul>
<li>Aster indeed introduced the first/best <a href="../../../../../2010/02/22/aster-data-ncluster-4-5/">vision for analytic platforms</a>.</li>
<li>Aster Data nCluster isn&#8217;t particularly differentiated in, say, general fast-query use cases.</li>
<li>Aster can command higher prices in analytic platform use cases than generic fast-query ones anyway.</li>
<li>Recently, Aster has been particularly focused on mapping influence graphs, and not just in the telecom industry where the idea of graph-based churn analysis started.**</li>
<li>Aster&#8217;s old <a href="../../../../../2008/10/22/aster-data-systems-ncluster/">frontline</a> &#8212; aka operational &#8212; positioning seem like a distant memory.</li>
</ul>
</li>
<li>Teradata&#8217;s and Aster&#8217;s core code lines would be very hard to integrate, and hence will not be integrated for the foreseeable future. However:
<ul>
<li>Teradata and the Aster Data nCluster DBMS can and should each be enhanced to run exactly the same SQL.</li>
<li>Aster Data nCluster runs on generic commodity hardware, and that&#8217;s great. But the temptation to have Teradata&#8217;s hardware engineers design Aster-optimized hardware will eventually become irresistible.</li>
<li>The <a href="../../../../../2010/07/31/teradata-xkoto-gridscale-rip-and-active-active-clustering/">Xkoto-based replication</a> Teradata is working on should be extended to Aster Data nCluster.</li>
<li>There surely will be knowledge transfer between the Teradata and Aster development teams in various areas. (Workload management? Security? Compression, as they both improve in that area?)</li>
</ul>
</li>
</ul>
<p><em>*At one point I noticed Teradata reposition the classical <a href="../../../../../2010/04/12/enterprise-data-warehouse-edw-myt/">EDW</a> as the place for &#8220;trusted&#8221; analytic data. That&#8217;s actually a pretty good way of putting it, sweetening (and watering down) the stifling-EDW-bureaucracy lemons, so as to make them into an appealing lemonade.</em></p>
<p><em>**At least one analyst firm has gotten all excited about that, and is positioning Aster pretty much as a social media play. One might call that a case of not seeing the Forrest for the trees &#8212; but from a graph-theoretic standpoint, that would be wrong &#8230;</em></p>
<p>Let me say some more about ensuring that the Teradata and Aster product lines run the same SQL. It would probably be a Small Matter of Programming to give Aster nCluster the Teradata temporal SQL extensions; to make Teradata run <a href="../../../../../2009/06/09/aster-data-nclustersql-mapreduce/">SQL/MapReduce</a>, for some subset of the programming languages that Aster nCluster supports; and generally to make it so that most new things you&#8217;d want to develop on one platform would also run on the other. And since Teradata and Aster seem to be SAS&#8217;s two favorite DBMS partners &#8212; Teradata for market share and Aster for technology &#8212; <a href="../../../../../2010/05/07/in-database-sas-teradata-netezza-aster/">SAS-support</a> compatibility seems like a reasonable goal as well.</p>
<p>Benefits of such compatibility would include:</p>
<ul>
<li>Users could start developing on one platform, then acquire the other one only after they&#8217;d dipped their toes in the water.</li>
<li>In particular, a small-but-noisy group of analysts could be given a virtual slice of a Teradata system to play on until their Aster cluster made it through the budget approval cycle.</li>
<li>Disaster recovery, archiving, and so on could straddle product lines.</li>
</ul>
<p>Performance might be very different between the two product lines, of course. And certain programming techniques might not carry over easily, such as user-defined functions (UDF), stored procedures, or some non-SQL analytic processes. Still, all focused and differentiated product positioning notwithstanding, it would be a Very Good Idea to converge the Teradata and Aster platform capabilities faster than is normal in similar merger situations.</p>
<p>Finally, perhaps I should also say more about Teradata&#8217;s and Aster&#8217;s relative weakness in generic fast-analytic-query use cases.</p>
<ul>
<li>There are many use cases for which <a href="../../../../../2011/02/06/columnar-compression-database-storage/">columnar storage, columnar compression, or both</a> give winning performance. Teradata has <a href="http://www.teradataforum.com/l020829a.htm">a very limited form of columnar compression</a>, and no column storage. Aster has <a href="../../../../../2010/09/15/aster-data-ncluster-version-4-6/">a fairly basic form of column storage</a>, and no columnar compression.</li>
<li>Teradata is architected more for random than sequential I/O. In row-based systems, sequential I/O is often better for &#8220;big&#8221; queries.</li>
<li>While Teradata isn&#8217;t as expensive as it used to be &#8212; if you&#8217;re just running some analytic queries, it&#8217;s also not particularly cheap.</li>
<li>Aster nCluster is just a younger product than many of its competitors. Perhaps there are some use cases for which <a href="../../../../../2009/08/21/bottleneck-whack-a-mole/">Bottleneck Whack-A-Mole</a> hasn&#8217;t yet gone far enough.</li>
</ul>
<p>And yes, the rumors about Aster&#8217;s customer comScore not getting into production seem to be true, even though comScore talked at the same <a href="../../../../../2010/04/18/washington-dc-may-2010-big-data-summi/">Aster-sponsored event</a> I did in May, 2010.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/03/04/teradata-aster-data-ncluster/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>The six useful things you can do with analytic technology</title>
		<link>http://www.dbms2.com/2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/</link>
		<comments>http://www.dbms2.com/2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/#comments</comments>
		<pubDate>Mon, 03 Jan 2011 12:12:05 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cognos]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3486</guid>
		<description><![CDATA[I seem to be in the mode of sharing some of my frameworks for thinking about analytic technology. Here’s another one. Ultimately, there are six useful things you can do with analytic technology: You can make an immediate decision. You can plan in support of future decisions. You can research, investigate, and analyze in support [...]]]></description>
			<content:encoded><![CDATA[<p><em>I seem to be in the mode of sharing some of my frameworks for thinking about analytic technology. Here’s another one. </em></p>
<p>Ultimately, there are six useful things you can do with analytic technology:</p>
<ul>
<li> You can make an immediate decision.</li>
<li>You can plan in support of future decisions.</li>
<li>You can research, investigate, and analyze in support of future decisions.</li>
<li>You can monitor what&#8217;s going on, to see when it necessary to decide, plan, or investigate.</li>
<li>You can communicate, to help other people and organizations do these same things.</li>
<li>You can provide support, in technology or data gathering, for one of the other functions.</li>
</ul>
<p>Technology vendors often cite similar taxonomies, claiming to have all the categories (as they conceive them) nicely represented, in slickly integrated fashion. They exaggerate. Most of these categories are in rapid flux, and the rest should be. Analytic technology still has a long way to go.</p>
<p>In more detail:  <span id="more-3486"></span></p>
<ul>
<li>You can make an <strong>immediate      decision.</strong>
<ul>
<li>The decision can be made by:
<ul>
<li>A machine, and then also        executed by a machine, for example in:
<ul>
<li>Web site personalization.</li>
<li>Algorithmic trading.</li>
<li>Network security and/or load         balancing.</li>
</ul>
</li>
<li>A machine, and then executed by        a human, for example in a call center.</li>
<li>A human looking at a machine.
<ul>
<li>This case can be a lot slower         than the others.</li>
<li>Analytic-operational app         integration can be a special case of this, but progress is slow —         today’s reality resembles what I proposed in <a href="http://www.monash.com/whitepapers.html">a         2004 white paper</a>.</li>
</ul>
</li>
</ul>
</li>
<li>Technologies supporting immediate       decision making include:
<ul>
<li>SQL, most commonly.</li>
<li>Extensions that are getting        added into SQL DBMS, such as <a href="../../../../../2010/05/15/further-clarifying-in-database-mpp-sas/">in-database model scoring</a>.</li>
<li>Other search and query        languages.</li>
<li>Complex event processing (CEP),        whether rules-based or SQL-like.</li>
<li>Other rules engines, rarely.</li>
</ul>
</li>
</ul>
</li>
<li>You can <strong>plan </strong>in support of      future decisions.
<ul>
<li>Technologies that support       planning include:
<ul>
<li>Microsoft Excel, first and        foremost.</li>
<li>Budget-centric tools.</li>
<li>Forecasting.</li>
</ul>
</li>
<li>Planning       hasn’t advanced as well as one       would have hoped.
<ul>
<li>The integration of planning and        business intelligence has been uninspiring, a couple years of aggressive        marketing last decade — especially by Cognos — notwithstanding.</li>
<li>Specialty planning languages        always seem to disappoint. River Logic is a small vendor with a great        idea — which it hasn’t advanced rapidly since the 1990s. This is sadly        typical.</li>
</ul>
</li>
<li>One bright spot, however, has       been demand forecasting.</li>
</ul>
</li>
<li>You can <strong>research, investigate, and      analyze</strong> in support of future decisions.
<ul>
<li>I’ve just started calling this       area <strong>investigative analytics.</strong></li>
<li>In doing so, I am conflating       several disciplines:
<ul>
<li>Statistics, data mining, machine        learning, and/or predictive analytics. <em>(Note: I can’t get excited        about the distinctions between those closely overlapping technology        categories — apologies to <a href="../../../../../2010/10/10/it-can-be-hard-to-analyze-analytics/#comment-187073">Sam Madden</a> and others who do seem to        care.)</em></li>
<li>The more research-oriented aspects        of business intelligence tools:
<ul>
<li>Ad-hoc query.</li>
<li>Drilldown.</li>
<li>Most things done by BI-using         “business analysts.”</li>
<li>Most things within BI called         “data exploration.”</li>
</ul>
</li>
<li>Analogous technologies as        applied to non-tabular data types such as <a href="http://www.texttechnologies.com/2010/12/01/state-of-the-art-text-analytics-mining-applications/">text</a> or <a href="../../../../../2009/08/21/social-network-analysis-aka-relationship-analytics/">graph</a>.</li>
</ul>
</li>
<li>There’s a lot further to go.
<ul>
<li>It’s still very early days for        in-database analytic technology.</li>
<li>Any two of statistics, business        intelligence, and text analytics could be much better integrated with        each other than they are.</li>
</ul>
</li>
</ul>
</li>
<li>You can <strong>monitor</strong> what’s      going on, to see when it necessary to decide, plan, or investigate.
<ul>
<li>The guts of business intelligence       — reports and dashboards — are really monitoring tools.</li>
<li>Monitoring is the jumping-off       point for a lot of decision making, planning, and investigation. First       you notice the anomaly or need, then you set out to do something about       it.</li>
<li>I think this technology could use       <a href="../../../../../2010/07/25/alerts-metrics-dashboards/">a lot of improvement</a>.</li>
</ul>
</li>
<li>You can <strong>communicate,</strong> to      help other people and organizations do these same things.
<ul>
<li>Since the dawn of reporting,       reports have used as much to communicate among colleagues as they have to       truly support personal decision-making.</li>
<li>BI vendors have done decent jobs       in recent years of advancing the communication aspects of BI, in two       respects:
<ul>
<li>General share-ability of reports        and the like.</li>
<li><a href="../../../../../2010/05/15/stakeholder-facing-analytics/">Stakeholder-facing BI</a>.</li>
</ul>
</li>
<li>But more profound BI-centric       collaboration is advancing too slowly.</li>
</ul>
</li>
<li>You can provide <strong>support,</strong> in      technology or <a href="http://www.monashreport.com/2006/10/04/data-mining-requires-data/">data gathering</a>, for one of the other      functions.
<ul>
<li>Well, duh. That’s most of what I       write about in this blog, especially in the areas of DBMS and       ETL/ELT/ETLT.</li>
</ul>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Big Data is Watching You!</title>
		<link>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/</link>
		<comments>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/#comments</comments>
		<pubDate>Wed, 11 Aug 2010 05:30:22 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2760</guid>
		<description><![CDATA[There&#8217;s a boom in large-scale analytics. The subjects of this analysis may be categorized as: People Financial trades Electronic networks Everything else The most varied, interesting, and valuable of those four categories is the first one. That may change some day, with the growing importance of machine-generated data, and of big-data science in particular. But [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">There&#8217;s a boom in large-scale analytics. The subjects of this analysis may be categorized as:</p>
<ul>
<li>People</li>
<li>Financial trades</li>
<li>Electronic networks</li>
<li>Everything else</li>
</ul>
<p style="margin-bottom: 0in;">The most varied, interesting, and valuable of those four categories is the first one.</p>
<p><span id="more-2760"></span></p>
<p style="margin-bottom: 0in;"><em>That may change some day, with the growing importance of<a href="http://www.dbms2.com/2010/04/08/machine-generated-data-example/"> </a><a href="http://www.dbms2.com/2010/04/08/machine-generated-data-example/">machine-generated data</a>,</em><em> and of <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/">big-data science</a> </em><em>in particular. But I think it&#8217;s a fair assessment at the present, and for at least the next few years.</em></p>
<p style="margin-bottom: 0in;">Some of th<span style="font-weight: normal;">e most interesting use cases are concentrated in the areas of identifying individuals, groups of people, or behaviors of (groups of) people. For example:</span></p>
<ul>
<li>comScore works hard to <strong>identify 	individual web surfers </strong><span style="font-weight: normal;">– 	i.e. to </span><strong>deanonymize</strong><span style="font-weight: normal;"> them &#8212; even</span> though they may have given incomplete or false 	personal information.</li>
<li>Other companies at least try to 	figure out <strong>which information in a user&#8217;s profile is unreliable,</strong> so as to classify them better. (Yes, there are 62-year-old 	video-game-obsessed Lady Gaga fans, but that&#8217;s generally not the way 	to bet.)</li>
<li>Multiple telecom vendors try to 	identify who their <strong>most influential customers</strong> are (to a first 	approximation, they&#8217;re the ones most often called by the most 	people, but it surely gets more sophisticated than that). This 	information is then used to reduce churn, either by working hard to 	retain those users, or – if they do churn – to move very fast to 	retain the business from their friends.</li>
<li>Other kinds of companies do 	similar kinds of analysis, to the extent that they have enough of a 	social graph to do so. (This application is a case where the term 	“<a href="http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/">social graph</a>” is not a misnomer.)</li>
<li><strong>Turing detectives</strong> (I just 	coined that phrase) try to determine whether users are humans or 	bots.</li>
<li>Central to detecting <strong>insurance 	fraud</strong> is identifying suspiciously close connections between 	claimants, service providers, and so on.</li>
<li>Identifying groups of people is 	also important in flagging <strong>insider trading.</strong><span style="font-weight: normal;"> Even more important are other kinds of analysis, along the lines of 	“is this normal innocent trading behavior?” </span></li>
<li><span style="font-weight: normal;">Intelligence 	agencies try to detect networks of </span><strong>terrorists</strong><span style="font-weight: normal;"> and their sympathizers. They further try to identify unusual 	patterns of communication or meetings along those networks that 	might indicate terrorist acts are being planned. (Civilian law 	enforcement agencies can use similar techniques.)</span></li>
</ul>
<p style="margin-bottom: 0in; font-weight: normal;">In most cases, the analysis and/or run-time execution of the relevant models is done with the help of analytic DBMS. Other technologies that come into play include non-DBMS MapReduce (Hadoop), graph engines, and CEP (Complex Event Processing). The vendor most heavily represented on that list is probably Aster Data, because:</p>
<ul>
<li>Aster Data is 	focused on hard-core analytics.</li>
<li>I talk a lot 	with Aster Data, and in particular had a long, detailed use-cases 	discussion with them last week.</li>
<li><span style="font-weight: normal;">The 	comScore example happens to come from a speaker at </span><a href="http://www.dbms2.com/2010/05/07/implications-onew-analytic-technology/"><span style="font-weight: normal;">an 	Aster event</span></a><span style="font-weight: normal;"> I also 	participated in.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">And by the way, all this only scratches the surface of what will be possible down the road. It&#8217;s based mainly on where you live, what you purchase, how you behave on websites, and who you communicate with. </span><span style="color: #000080;"><span lang="zxx"><span style="text-decoration: underline;"><a href="../2010/07/04/fair-data-use/"><span style="font-weight: normal;">Other kinds of data, which could be used to be yet more intrusive</span></a></span></span></span><span style="font-weight: normal;">, generally aren&#8217;t involved.</span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">I actually have two points in drawing up this list. One is golly-gee-whiz about how a lot of analytically sophisticated applications are actually getting into production. The other is to highlight the privacy and liberty threats If This Goes On Unchecked (which is why I didn&#8217;t include some other less-people-focused examples). There&#8217;s also a related danger that, to the extent we don&#8217;t get some smart regulations to keep us safe(r), we&#8217;ll get a bunch of stupid regulations instead. </span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">The Analytic Era has only just begun.<br />
</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Objectivity Infinite Graph</title>
		<link>http://www.dbms2.com/2010/06/19/objectivity-infinite-graph/</link>
		<comments>http://www.dbms2.com/2010/06/19/objectivity-infinite-graph/#comments</comments>
		<pubDate>Sat, 19 Jun 2010 12:05:45 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Objectivity and Infinite Graph]]></category>
		<category><![CDATA[RDF and graphs]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2306</guid>
		<description><![CDATA[I chatted Wednesday night with Darren Wood, the Australia-based lead developer of Objectivity&#8217;s Infinite Graph database product. Background includes: Objectivity is a profitable, decades-old object-oriented DBMS vendor with about 50 employees. Like some other object-oriented DBMS of its generation, Objectivity is as much a toolkit for building DBMS as it is a real finished DBMS [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I chatted Wednesday night with Darren Wood, the Australia-based lead developer of Objectivity&#8217;s Infinite Graph database product. Background includes:</p>
<ul>
<li>Objectivity is a profitable, 	decades-old object-oriented DBMS vendor with about 50 employees.</li>
<li>Like some other object-oriented 	DBMS of its generation, Objectivity is as much a toolkit for 	building DBMS as it is a real finished DBMS product. Objectivity 	sales are typically for custom deals, where Objectivity helps with 	the programming.</li>
<li>The way Objectivity works is 	basically:
<ul>
<li>You manage objects in memory, in 	the format of your choice.</li>
<li>Objectivity bangs them to disk, 	across a network.</li>
<li>Objectivity manages the 	(distributed) pointers to the objects.</li>
<li>You can, if you choose, hard code 	exactly which objects are banged to which node.</li>
<li>Objectivity&#8217;s DML for reading data 	is very different from Objectivity&#8217;s DML for writing data. (I think 	the latter is more like the program code itself, while the former is 	more like regular DML.)</li>
<li>The point of Objectivity is not so 	much to have fast I/O. Rather, it is to minimize the CPU cost of 	getting the data that comes across the wire into useful form.</li>
</ul>
</li>
<li>Darren got the idea of putting a 	generic graph DBMS front-end on Objectivity while doing a 	<a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/">relationship analytics</a> project for an Australian intelligence 	agency.</li>
<li>Darren redoubled his efforts to 	sell the project internally at Objectivity after read<span style="font-style: normal;">ing 	what I wrote about relationship analytics back in 200</span>6 or so.</li>
<li>There is now a 5 or so person team 	developing Infinite Graph.</li>
<li>Infinite Graph is just now going 	out to beta test.</li>
</ul>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><a href="http://www.infinitegraph.com/">Infinite Graph</a> is an API or language binding on top of Objectivity that:</p>
<ul>
<li>Hides a lot of Objectivity&#8217;s 	complexity.</li>
<li>Is suitable for graph/relationship 	analytics.</li>
</ul>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><span id="more-2306"></span>The main point of the Infinite Graph beta test is to see whether Objectivity got the API right. By way of contrast, Objectivity is still just researching the DBMS optimization side of things. According to Darren, what makes that so hard is that if you partition the graph in some smart way, probably through some kind of costly algorithm to determine “least connectedness,” a bit more additional data can thoroughly invalidate your results. Thus, Darren is focused more on ensuring that performance is good even if data is distributed around the network in annoying ways.</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">One performance win that Infinite Graph seems to get (almost?) for free from being built on top of Objectivity is lots of prefetching. Specifically, graph nodes and their edges are stored together, just like objects and their pointers are in traditional Objectivity &#8212; and if a node is retrieved, the nodes it&#8217;s connected to might also get retrieved as a background operation, before they&#8217;re even needed. More generally, Objectivity has always tried to be fast about traversing pointers, and that is a whole lot like traversing graph edges.</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">As a future, Infinite Graph is looking at ideas from <a href="http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html">Google&#8217;s Pregel</a>. As Darren characterizes it, in Pregel you wrap up information about a graph node and ship it off to another computing node if the next graph node you need is over there. Darren suspects that the extreme form of this strategy would not be ideal. (I gather from Darren that Google has realized the same thing from the getgo.) Instead, he&#8217;s pinning his hopes more on smarts about when to do that (costly) shipping, and when to just fetch the information back to the compute node currently being used.</p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">The most interesting part of our discussion, in my opinion, was about applications and application functionality. In a nutshell, Darren se</span><span style="font-style: normal;"><span style="font-weight: normal;">ems to think that it&#8217;s all about the edges, rathe</span></span><span style="font-style: normal;">r than the nodes themselves. (My words, not his.) In particular:</span></p>
<ul>
<li><strong>Edges are first-class citizens</strong> in Infinite Graph, just as nodes are.</li>
<li><strong>Graphs typically are polluted 	with lots of insignificant edges.</strong> Examples include:
<ul>
<li>If you&#8217;re tracking people&#8217;s 	telephone traffic, lots of folks call the local pizza parlor. 	Indeed, it&#8217;s common to look for “star” nodes like that that have 	very high connectivity, and excise from the graph to reduce noise.</li>
<li>Many measures of relationship 	include minor relationships. Facebook friends? LinkedIn connections? 	Occasional phone calls? Next door neighbors? All of those can 	indicate very minor relationships.</li>
</ul>
</li>
<li><span style="font-weight: normal;">Therefore, 	in Infinite Graph, </span><strong>edges (can) have weights.</strong> Darren 	says this is a widely-used capability in graph applications. The 	core reason is to let you distinguish between significant and 	insignificant edges. Note that these weights can be calculated based 	on the raw data and stored back into the database.</li>
<li><span style="font-weight: normal;">In 	Infinite Graph, </span><strong>edges can also have effectiveness date 	intervals.</strong> E.g., if you live at an address for a certain period, 	that&#8217;s when the edge connecting you to it is valid.</li>
<li>In general in Infinite Graph, 	<strong>edges can carry</strong><span style="font-weight: normal;"> arbitrary 	or at least flexible </span><strong>“qualifier”/attribute 	information.</strong></li>
<li><strong>For many applications, the 	number of possible nodes is fundamentally limited. </strong><span style="font-weight: normal;">There 	are only so many people in the world, so many street addresses, so 	many telephone numbers, and so on. (There was a time this wasn&#8217;t 	believed to be the case, because timestamping was done at the node 	rather than edge level. But I find persuasive Darren&#8217;s argument that 	it works better on edges.) <em>Edit: Even so, <a href="http://www.theregister.co.uk/2010/05/19/darpa_smite/">DARPA is thinking in the billions-of-nodes range</a>.</em><br />
</span></li>
<li><span style="font-weight: normal;">Darren 	is in general agreement with my observation that </span><a href="http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/"><span style="font-weight: normal;">the 	“social graph” shouldn&#8217;t primarily be regarded as a graph</span></a><span style="font-weight: normal;">.</span></li>
<li><span style="font-weight: normal;">Yes, 	the paradigmatic examples of intelligence agency graph analytics are 	telephone or even IP traffic analysis. Nodes can wind up with lots 	of edges connecting them. Full analysis of the graphs exceeds even 	the computing capacity available to governments.</span></li>
<li><span style="font-weight: normal;">On 	a happy civil liberties note, Darren observed that Australian 	intelligence has a lot of red tape restricting them from getting 	this kind information. Basically, they can only get chunks of 	information “on demand”. An awkward side effect of this is that 	when they do get it, it could be in any number of formats.</span></li>
</ul>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/19/objectivity-infinite-graph/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>The most important part of the “social graph” is neither social nor a graph</title>
		<link>http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/</link>
		<comments>http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/#comments</comments>
		<pubDate>Tue, 08 Jun 2010 05:18:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2235</guid>
		<description><![CDATA[“Social graph” is a highly misleading term, and so is “social network analysis.” By this I mean: There&#8217;s something akin to “social graphs” and “social network analysis” that is more or less worthy of all the current hype – but graphs and network analysis are only a minor part of the whole story. In particular, [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">“Social graph” is a highly misleading term, and so is “social network analysis.” By this I mean:</p>
<p><strong>There&#8217;s something akin to “social graphs” and “social network analysis” that is more or less worthy of all the current hype – but graphs and network analysis are only a minor part of the whole story.</strong></p>
<p style="margin-bottom: 0in;">In particular, <strong>the most important parts of the Facebook “social graph” are neither social nor a graph. </strong><span style="font-weight: normal;">Rather, what&#8217;s really important is an aggregate</span><strong> Profile of Revealed Preferences</strong><span style="font-weight: normal;">, of which person-to-person connections or other things best modeled by a graph play only a small part.</span></p>
<p style="margin-bottom: 0in;"><span id="more-2235"></span>Let me hasten to note that – even when viewed narrowly &#8212; the ideas of “social graph”and “<a href="../2009/08/21/social-network-analysis-aka-relationship-analytics/">social network analysis</a>” do have significance. Nontrivial use cases to date for big data social network analysis include:</p>
<ul>
<li>Intelligence agencies identify and 	analyze terrorist networks. Corporations and civilian law 	enforcement do the same for fraud networks.</li>
<li>Telephone companies use calling 	data to figure out which of their customers are most likely to 	influence which other customers in the decision to keep or change 	service providers. (Frankly, I find that rather creepy.)</li>
<li>Social networks figure out which 	other members you&#8217;re likely to know, and encourage you to connect 	with them.</li>
</ul>
<p style="margin-bottom: 0in;">Epidemiologists aspire to add to that list, based on their success to date using much more micro forms of social network analysis. But after that, I&#8217;m running out of examples. Sure, graph analytics is good for a bunch of other things (e.g., biology at the genetic or molecular level), but those have little or nothing to do with “social graphs” or social network analysis as they are commonly understood.</p>
<p style="margin-bottom: 0in;"><em>Note: Of course, it is also the case that everything can be modeled by entity-attribute-value triples, and those can always be modeled by graphs. But so what?</em></p>
<p style="margin-bottom: 0in;">Let&#8217;s consider what, in a marketer&#8217;s ideal world, would go into yo<span style="font-weight: normal;">ur Profile of Revealed Preferences. Raw data might include:</span></p>
<ul>
<li><strong>Personally identifyING 	information. </strong>Duh. This is what makes everything else possible.</li>
<li><strong>Purchase transaction data.</strong> Lots of it. Like, everything on your credit card statements.</li>
<li><strong>Demographic and lifestyle 	information.</strong> Address, date of birth, educational history, race, 	household composition, and so on.</li>
<li><strong>Affiliations.</strong> Politics, 	religion, group membership of any kind. (OK, that&#8217;s partly social.)</li>
<li><strong>Explicitly stated opinions, 	preferences and desires,</strong><span style="font-weight: normal;"> including:</span>
<ul>
<li>Any surveys you have filled out.</li>
<li><strong>Any recommendations you have 	made</strong> (e.g., through the Facebook Like feature).</li>
<li>The text of anything you&#8217;ve 	written and posted – and, very ideally, of your private emails as 	well.</li>
<li>Any <strong>wish lists</strong> you&#8217;ve 	filled in.</li>
</ul>
</li>
<li><strong>Attention information.</strong> What 	you clicked on, what you looked at, and all that stuff website 	owners track.</li>
<li><strong>Your movements, </strong><span style="font-weight: normal;">to 	the extent they are tracked. (E.g., via Foursquare and the like.)</span></li>
<li><strong>Your gaming activities</strong><span style="font-weight: normal;"> and the like. (This is social mainly to the extent it overlaps with 	other categories I&#8217;ve already mentioned.)</span></li>
<li><strong>Your medical information.</strong><span style="font-weight: normal;"> </span></li>
<li><strong>Who you communicate with, and 	what you communicate with them about.</strong><span style="font-weight: normal;"> (Hey! There&#8217;s something else social!)</span></li>
<li><span style="font-weight: normal;">Similar </span><strong>information about the people you communicate with.</strong></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">My core </span><strong>privacy</strong><span style="font-weight: normal;"> thoughts about that data include:</span></p>
<ul>
<li><strong>Individuals deserve the right 	to control all that information.</strong><span style="font-weight: normal;"> At a minimum, they deserve total control over how the data (raw or 	derived) is passed from the service – e.g., website – where it 	naturally resides (e.g., where it is originated) to any other place.</span></li>
<li>Given a chance, <strong>individuals 	would make fine-grained choices about what parts of their Profile of 	Revealed Preferences are available to which organizations.</strong> Reasons include:
<ul>
<li>Individuals have rather complex 	trust relationships with different kinds of merchants and marketers.</li>
<li>Consumers get different benefits 	from sharing information with different kinds of merchants and 	marketers. (Sometimes personalization is a large benefit. Sometimes 	it&#8217;s just creepy. And some companies actively bribe you to give them 	information they can use to sell to you.)</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">When one frame things this way, two rather difficult technological questions naturally arise.</p>
<ol>
<li>Suppose, implausibly, that a 	single entity were allowed to control and use (for marketing) all of 	your Profile of Revealed Preferences information. How would they 	store and analyze it?</li>
<li>How does the answer to #1 change 	because control over the information will, in fact, be fragmented?</li>
</ol>
<p style="margin-bottom: 0in;">It&#8217;s tough enough to answer these questions for data about one person. Trying to include all but the simplest information about other people is and will for years remain quite infeasible. So, for the most part, <strong>this is not “social” information.</strong></p>
<p style="margin-bottom: 0in;">It&#8217;s also <strong>not naturally a “graph.”</strong> Similarly, it is <strong>not a good candidate for network analysis.</strong> To see why, let me outline <strong>why I used the name “Profile of Revealed Preferences”:</strong></p>
<ul>
<li>The reason marketers want this 	data is, mainly, because they want to know what appeals to you, and 	how strongly you feel about it.</li>
<li>The analytic process often entails 	taking explicit choices you have made, and inferring other 	preferences from them.</li>
<li>The output of the analytic process 	is often one or more “scores” that then get plugged into various 	selection algorithms to determine what you should be shown or 	offered. At least implicitly, these algorithms are predicting what 	you will or won&#8217;t respond well to.</li>
</ul>
<p style="margin-bottom: 0in;">Not much graph-like there.</p>
<p style="margin-bottom: 0in;">This post has gotten pretty long, so I&#8217;ll stop here without spelling anything else out. But questions I still hope to address down the road include:</p>
<ul>
<li>How sho<span style="font-weight: normal;">uld 	Profile of Revealed Preferences data</span> be stored?</li>
<li>Suppose we want to pass around 	derived results and not the raw data. How could we ever get to 	standards that would make such interchange realistic?</li>
<li>If we only have raw data to pass 	around, what are the implications for privacy, liberty, and the 	structure of the online industries?</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
	</channel>
</rss>

