<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; GIS and geospatial</title>
	<atom:link href="http://www.dbms2.com/category/datatype/gis-geographic-geospatial/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Vertica as an analytic platform</title>
		<link>http://www.dbms2.com/2011/06/20/vertica-as-an-analytic-platform/</link>
		<comments>http://www.dbms2.com/2011/06/20/vertica-as-an-analytic-platform/#comments</comments>
		<pubDate>Mon, 20 Jun 2011 06:13:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4775</guid>
		<description><![CDATA[Vertica 5.0 is coming out today, and delivering the down payment on Vertica&#8217;s analytic platform strategy. In Vertica lingo, there&#8217;s now a Vertica SDK (Software Development Kit), featuring Vertica UDT(F)s* (User-Defined Transform Functions). Vertica UDT syntax basics start:  In this release, Vertica UDTFs can only be written in C++. Other UDTF languages are promised. Otherwise, [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.dbms2.com/2011/06/20/vertica-release-5/">Vertica 5.0</a> is coming out today, and delivering the down payment on Vertica&#8217;s <a href="../../../../../2011/02/24/analytic-platforms/">analytic platform</a> strategy. In Vertica lingo, there&#8217;s now a Vertica SDK (Software Development Kit), featuring Vertica UDT(F)s* (User-Defined Transform Functions). Vertica UDT syntax basics start:  <span id="more-4775"></span></p>
<ul>
<li>In this release, Vertica UDTFs can only be      written in C++. Other UDTF languages are promised.</li>
<li>Otherwise, Vertica UDTFs sound pretty      flexible; in particular:
<ul>
<li>They can ingest and emit any number of       rows.</li>
<li>Their assumed schemas can be defined       programmatically (both input and output).</li>
</ul>
</li>
<li>Vertica UDTFs go in the SELECT clause,      not the FROM clause. I must confess to not grasping Vertica&#8217;s argument as      to why this provides great and important flexibility.</li>
<li>UDTF syntax mirrors SQL 99 Analytics      pretty closely.</li>
</ul>
<p><em>*It looks like the &#8220;F&#8221; is in the official name, but will often be dropped colloquially.</em></p>
<p>Other Vertica analytic platform highlights include:</p>
<ul>
<li>Proper integrated UDT workload      management is promised, and there&#8217;s a little bit of UDT workload      management already.*</li>
<li>Vertica is delivering some prebuilt functions      for aggregation, statistics, etc.</li>
<li>Vertica has cool <a href="http://www.dbms2.com/2011/06/20/temporal-data-time-series-and-imprecise-predicates/">temporal and time series features</a>.</li>
<li>Vertica&#8217;s geospatial support seems      pretty basic (circles and rectangles).</li>
<li>Vertica&#8217;s NDA plans moving forward are      pretty much as one would hope.</li>
</ul>
<p><em>*Vertica&#8217;s UDT workload management is RAM-only, and &#8220;honor system&#8221; &#8212; i.e., it assumes that the UDTFs declare their resource usage correctly, which Vertica says is the right way to handle in-process C++ routines.</em></p>
<p>Vertica also argues that fast-performing SQL in and of itself can amount to analytic functionality. For example, Vertica has tried to ensure that it offers great performance in the kinds of self-joins that are used in graph analysis. Since Vertica has plenty of customers among the kinds of Web and telco companies that use graph analysis today, I&#8217;m inclined to grant some benefit of the doubt here. That said, Vertica thinks 3 hops is plenty for most kinds of graph analysis people want to do, and I can think of applications (e.g. anti-terrorism) where that&#8217;s surely not the case.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/20/vertica-as-an-analytic-platform/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Investigative analytics and derived data: Enzee Universe 2011 talk</title>
		<link>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/</link>
		<comments>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/#comments</comments>
		<pubDate>Sun, 19 Jun 2011 12:13:04 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4747</guid>
		<description><![CDATA[I&#8217;ll be speaking Monday, June 20 at IBM Netezza&#8217;s Enzee Universe conference. Thus, as is my custom: I&#8217;m posting draft slides. I&#8217;m encouraging comment (especially in the short time window before I have to actually give the talk). I&#8217;m offering links below to more detail on various subjects covered in the talk. The talk concept [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ll be speaking Monday, June 20 at IBM Netezza&#8217;s <a href="http://www.netezza.com/userconference/abstract.html#620_1200">Enzee Universe</a> conference. Thus, as is my custom:</p>
<ul>
<li>I&#8217;m posting draft <a href="http://www.monash.com/uploads/Enzee-Universe-2011-Monash.ppt">slides</a>.</li>
<li>I&#8217;m encouraging comment (especially in the short time window before I have to actually give the talk).</li>
<li>I&#8217;m offering links below to more detail on various subjects covered in the talk.</li>
</ul>
<p>The talk concept started out as &#8220;advanced analytics&#8221; (as opposed to fast query, a subject amply covered in the rest of any Netezza event), as a lunch break in what is otherwise a detailed &#8220;best practices&#8221; session. So I suggested we constrain the subject by focusing on a specific application area &#8212; customer acquisition and retention, something of importance to almost any enterprise, and which exploits most areas of analytic technology. Then I actually prepared the slides &#8212; and guess what? The mix of subjects will be skewed somewhat more toward generalities than I first intended, specifically in the areas of <strong>investigative analytics </strong>and<strong> derived data. </strong>And, as always when I speak, I&#8217;ll try to raise consciousness about the issues of <a href="../../../../../2011/01/10/privacy-dangers-an-overview/">liberty and privacy</a>, our <a href="../../../../../2010/07/04/fair-data-use/">options as a society for addressing them</a>, and the crucial role we play as an industry in <a href="../../../../../2010/04/04/privacy-liberty-continued/">helping policymakers deal with these technologically-intense subjects</a>.</p>
<p>Slide 3 refers back to a post I made last December, saying there are <a href="../../../../../2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/">six useful things you can do with analytic technology</a>:</p>
<ul>
<li><strong>Operational      BI/Analytically-infused operational apps:</strong> You can make an immediate      decision.</li>
<li><strong>Planning      and budgeting:</strong> You can plan in      support of future decisions.</li>
<li><strong>Investigative      analytics (multiple disciplines):</strong> You can research, investigate, and analyze in support of future decisions.</li>
<li><strong>Business      intelligence:</strong> You can monitor      what’s going on, to see when it necessary to decide, plan, or investigate.</li>
<li><strong>More      BI:</strong> You can communicate, to help      other people and organizations do these same things.</li>
<li><strong>DBMS,      ETL, and other &#8220;platform&#8221; technologies:</strong> You can provide support, in      technology or data gathering, for one of the other functions.</li>
</ul>
<p>Slide 4 observes that <a href="http://www.dbms2.com/2011/03/03/investigative-analytics/">investigative analytics</a>:</p>
<ul>
<li>Is the most rapidly advancing of the six areas &#8230;</li>
<li>&#8230; because it most directly exploits performance &amp; scalability.</li>
</ul>
<p>Slide 5 gives my simplest overview of investigative analytics technology to date:  <span id="more-4747"></span></p>
<ul>
<li>Fast query
<ul>
<li>Persistent storage (any data volume)</li>
<li>RAM (10s -100s of gigabytes, or more)</li>
</ul>
</li>
<li>Fast analytics
<ul>
<li>Predictive modeling</li>
<li>Transformation/tagging</li>
<li>Graph</li>
</ul>
</li>
</ul>
<p>Slide 6 points out that this is all supported by cheap data creation and acquisition, specifically in the area of <a href="http://www.dbms2.com/2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>, which gets the full benefit of Moore&#8217;s Law.</p>
<p>Slides 7-13 point out how the example problem domain involves lots of analytic tasks performed on lots of kinds of data. Specific examples cited include <a href="http://www.dbms2.com/2011/04/14/attensity-update/">text analytics</a> and <a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/">graph/relationship analytics</a>.</p>
<p>Slide 14 contains the punch line, so I&#8217;ll quote it in full:</p>
<blockquote><p>Derived data</p>
<ul>
<li>You can’t keep re-analyzing all that in raw form …</li>
<li>&#8230; so don’t.</li>
</ul>
<p><em>If you have one takeaway from this session, let it be the utter importance of derived data. </em></p></blockquote>
<p>Slide 16 lists kinds of <a href="http://www.dbms2.com/2011/05/30/another-category-of-derived-data/">derived data</a> that are important in the single application of reducing telco churn:</p>
<ul>
<li>Normalized data
<ul>
<li>Parsed/sessionized logs</li>
<li>Text/sentiment highlights</li>
<li>Social network graph(s)</li>
<li>Web de-anonymization</li>
<li>Household matching</li>
</ul>
</li>
<li>Scores and buckets
<ul>
<li>Demographic</li>
<li>Psychographic</li>
<li>Offer hot buttons</li>
<li>(Dis)satisfaction</li>
<li>Credit/fraud risk</li>
<li>Lifetime customer value</li>
<li>Influence on others!</li>
</ul>
</li>
</ul>
<p>And finally, Slide 17 is my first pass at best practices for dealing with derived data:</p>
<ul>
<li>Evolving data warehouse schema</li>
<li>Data marts
<ul>
<li>Physical or virtual</li>
<li>Inputs/outputs to “EDW”</li>
</ul>
</li>
<li>“Data science”
<ul>
<li>Research != production</li>
</ul>
</li>
<li>Multiple processing pipelines
<ul>
<li>Log parsing</li>
<li>Text</li>
<li>Predictive analytics</li>
<li>Generic ETL</li>
<li>Streaming “ETL”</li>
</ul>
</li>
</ul>
<p>That last list looks like a starting point for a whole set of interesting future posts.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/19/investigative-analytics-derived-data/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Netezza TwinFin i-Class overview</title>
		<link>http://www.dbms2.com/2011/04/17/netezza-twinfin-i-class-overview/</link>
		<comments>http://www.dbms2.com/2011/04/17/netezza-twinfin-i-class-overview/#comments</comments>
		<pubDate>Sun, 17 Apr 2011 13:59:59 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4315</guid>
		<description><![CDATA[I have long complained about difficulties in discussing Netezza&#8217;s TwinFin i-Class analytic platform. But I&#8217;m ready now, and in the grand sweep of the product&#8217;s history I&#8217;m not even all that late. The Netezza i-Class timing story goes something like this: Netezza i-Class was first foreshadowed in February, 2010. Netezza i-Class customer testing started in [...]]]></description>
			<content:encoded><![CDATA[<p>I have long complained about <a href="../../../../../2010/10/10/it-can-be-hard-to-analyze-analytics/">difficulties</a> in discussing Netezza&#8217;s TwinFin i-Class analytic platform. But I&#8217;m ready now, and in the grand sweep of the product&#8217;s history I&#8217;m not even all that late. The Netezza i-Class timing story goes something like this:</p>
<ul>
<li><a href="../../../../../2010/02/22/netezza-twinfin/">Netezza      i-Class was first foreshadowed in February, 2010</a>.</li>
<li>Netezza i-Class customer testing started in October, 2010 or so. Netezza      i-Class evidently has been shipped to 4-5 partners and a single-digit      number of end-user organizations, spread across some usual-suspect      industries (financial services, telecom, and so on).</li>
<li>Netezza i-Class 1.0 general availability is still in the (near)      future.</li>
</ul>
<p>My advice to Netezza as to how it should describe TwinFin i-Class boils down to:  <span id="more-4315"></span></p>
<blockquote><p>1.  The Netezza platform has been enhanced in two major ways:</p>
<ul>
<li>There&#8217;s a good way to run all kinds of analytic processes. This is very flexible and powerful, but tightly integrated with the SQL engine even so.</li>
<li>You are supplying some specific high-performing, highly parallel, big-data analytic process building blocks. More precisely, you have greatly extended the set of such building blocks; you had some cool building blocks (notably Spatial) even before this.</li>
</ul>
<p>2.   There are four main ways to get at this:</p>
<ul>
<li>Extended SQL.</li>
<li>Programming, in a bunch of languages and paradigms, integrated into the SQL.</li>
<li>Partner code, with them doing the programming for you.</li>
</ul>
</blockquote>
<p>Some of the rah-rah words aside, that&#8217;s a pretty fair overview. Here&#8217;s more detail.</p>
<p>To refresh your memory: <strong>Netezza TwinFin i-Class functionality basics</strong> include, as best I can tell (and there&#8217;s some more detail at the links above):</p>
<ul>
<li>You can run processes in a usual-suspect set of languages on      Netezza i-Class (even Fortran).</li>
<li>One notable example is R; indeed, there&#8217;s an R client for talking      to Netezza TwinFin.</li>
<li>Netezza provides its own Hadoop implementation, which differs from      standard Hadoop implementations most notably in that it manages data      relationally via the usual Netezza DBMS, not in anything like HDFS.</li>
<li>Anything written in any language except C/C++ (or of course SQL)      &#8212; and in particular anything involving Hadoop &#8212; runs out-of-process      versus the Netezza DBMS. C/C++ can run in-process, for maximum      performance.*</li>
<li>There&#8217;s an assortment of parallelized mathematical analytic packages      built into Netezza i-Class. The matrix algebra ones are called nzMatrix. Most      of the rest are part of a collection called nzAnalytics. Often these are      implemented as stored procedures, as they may make multiple passes through      the data.</li>
<li>Netezza has thoughtfully ported thousands of analytic procedures      for you to the Netezza platform (in essence, the basic R/CRAN and GNU      libraries). These are not promised to be parallel on their own, but you&#8217;re      welcome to invoke an instance on each node and parallelize that way.</li>
</ul>
<p>I forgot to check, but I&#8217;m guessing any extension of workload management to cover non-DBMS processes won&#8217;t be in the first release of Netezza i-Class.</p>
<p><em>*However, Netezza says that if you can batch requests to return even just 500-1000 records at a time, the out-of-process performance penalty &#8212; which is based on wait time for transferring data between processes &#8212; becomes insignificant.</em></p>
<p>None of that is particularly new information. But after a visit to Netezza on Tuesday, I&#8217;ve finally gotten some kind of handle on how i-Class is architected. <strong>Highlights of the Netezza i-Class architecture story,</strong> as I understand them, include:</p>
<ul>
<li>It all starts with UDtFs &#8212; User-Defined (table) Functions, which      are subject to the usual limitations.</li>
<li>To <strong>overcome the standard      limitations of UDtFs,</strong> Netezza built:
<ul>
<li>A set of UDtFs that, taken together, execute command-line       programs.</li>
<li>For each language (Java, Python, R, etc., and I think also C/C++),       a library that talks to the command-line executor. This library can talk       to multiple instances of the executor, so it&#8217;s not limited to a single       data stream. Similarly, it can persist past the life of a query.</li>
</ul>
</li>
<li>Similarly, Netezza built a C/C++ library that talks to the      command-line executor and also talks MPI (Message Passing Interface).
<ul>
<li>This has not yet been exposed outside Netezza.</li>
<li>Rather, MPI is used by nzMatrix, so that nzMatrix can invert (for       example) really, really big matrices.</li>
</ul>
</li>
<li>There are two* main ways to invoke all this.
<ul>
<li><strong>SQL.</strong> Any analytic process can be invoked via a SQL       UDtF. Netezza tends to use the term <strong>UDAP       (User-Defined Analytic Process)</strong> interchangeably for the process       itself and for the SQL UDtF that encapsulates it.</li>
<li>Netezza&#8217;s (interfaces to an) <strong>R</strong> client. More on that below.</li>
</ul>
</li>
<li>Netezza&#8217;s version of <strong>Hadoop </strong>is an important special case. The mappers and reducers you write in      Hadoop are UDAPs.</li>
</ul>
<p>I didn&#8217;t delve far enough into Netezza&#8217;s UDAP syntax to understand how it compares to, say, <a href="../../../../../2009/10/15/mapreduce-webinar-slides/">Aster&#8217;s SQL/MR</a>.</p>
<p><em>*From a marketing standpoint, Netezza might prefer to count partner code separately as a third way, but I&#8217;m focusing on the technology here, which is used by partners and end-user organizations alike.</em></p>
<p>Other Netezza/Hadoop notes include:</p>
<ul>
<li>Netezza has the usual kind of <a href="../../../../../2010/10/10/partnering-with-cloudera/">Cloudera      partnership</a>.</li>
<li>Since Netezza&#8217;s owner IBM has a Hadoop implementation, it seems obvious there will      be some partnership action with that too. But at this point it&#8217;s not so      far along.</li>
</ul>
<p>The Netezza TwinFin i-Class R story goes something like this:</p>
<ul>
<li>Assume you&#8217;re using R on a client. (I&#8217;m not sure whether Netezza      has an R client to give or recommend to you.)</li>
<li>There are three Netezza packages that change how R works, by      letting it use stuff on the Netezza box.
<ul>
<li>nzR translates between logical R memory structures and Netezza       tables. In particular, nzR allows R to run, not just in-memory, but       against the data on the Netezza box.</li>
<li>nzMatrix lets you do R matrix algebra against the data on the       Netezza box.</li>
<li>nzAnalytics lets you invoke various algorithms that run on the       Netezza box, against Netezza data.</li>
</ul>
</li>
</ul>
<p>A recently announced Netezza partnership with <a href="../../../../../2011/04/08/revolution-analytics-update/">Revolution Analytics</a> is meant to lead to Revolution replacing Netezza&#8217;s ports of R libraries with its own preferred distribution, and then supporting same.</p>
<p>Finally, there&#8217;s Netezza Spatial.</p>
<ul>
<li>Netezza claims multiple orders of magnitude of performance      advantage for Netezza Spatial vs. geospatial alternatives, which is always      a nice thing to be able to say.</li>
<li>Generally, <a href="../../../../../2008/09/23/peter-batty-on-netezza-spatial/">Netezza      Spatial</a> is now regarded as being part of i-Class.</li>
<li>However, the product timing and adoption comments above don&#8217;t      apply to Netezza Spatial.</li>
<li>Netezza Spatial has a couple of dedicated salespeople, and seems      to be well-liked by retailers.</li>
<li>Netezza surely wishes everybody would forget about some of <a href="../../../../../2010/10/03/notes-and-links-october-3-2010/">rewrites      and controversy</a> associated with Netezza Spatial.</li>
</ul>
<p>Perhaps there are yet more pieces of the Netezza TwinFin i-Class story I&#8217;m overlooking, but I hope I now have most of the major aspects at least partway right.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/04/17/netezza-twinfin-i-class-overview/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Notes, links, and comments January 20, 2011</title>
		<link>http://www.dbms2.com/2011/01/20/notes-links-and-comments-january-20-2010/</link>
		<comments>http://www.dbms2.com/2011/01/20/notes-links-and-comments-january-20-2010/#comments</comments>
		<pubDate>Thu, 20 Jan 2011 11:35:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[About this blog]]></category>
		<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3508</guid>
		<description><![CDATA[I haven&#8217;t done a pure notes/links/comments post for a while. Let&#8217;s fix that now. (A bunch of saved-up links, however, did find their way into my recent privacy threats overview.) First and foremost, the fourth annual New England Database Summit (nee &#8220;Day&#8221;) is next week, specifically Friday, January 28. As per my posts in previous [...]]]></description>
			<content:encoded><![CDATA[<p>I haven&#8217;t done a pure notes/links/comments post for a while. Let&#8217;s fix that now. <em>(A bunch of saved-up links, however, did find their way into my recent <a href="http://www.dbms2.com/2011/01/10/privacy-dangers-an-overview/">privacy threats overview</a>.)</em></p>
<p>First and foremost, the fourth annual <a href="http://db.csail.mit.edu/nedbday11/">New England Database Summit</a> (nee &#8220;Day&#8221;) is next week, specifically Friday, January 28. As per my posts in <a href="http://www.dbms2.com/2009/11/25/new-england-database-summit-january-28-2010/">previous</a> <a href="http://www.dbms2.com/2009/01/26/new-england-database-day-this-friday-january-30/">years</a>, I think well of the event, which has a friendly, gathering-of-the-clan flavor. Registration is free, but the organizers would prefer that you register online by the end of this week, if you would be so kind.</p>
<p><em>The two things potentially wrong with the New England Database Summit are parking and the rush hour drive home afterwards. I would listen with interest to any suggestions about dinner plans. </em></p>
<p>One thing I hope to figure out at the Summit or before is what the hell is going on on Vertica&#8217;s blog or, for that matter, at <strong>Vertica.</strong> The recent Mike Stonebraker post that spawned a lot of <a href="http://www.dbms2.com/2011/01/12/mike-stonebraker-on-real-column-stores/">discussion and commentary</a> has disappeared. Meanwhile, Vertica has had three consecutive heads of marketing leave the company since June, and I don&#8217;t know who to talk to there any more.  <span id="more-3508"></span></p>
<p>Speaking of blog problems, we&#8217;ve had performance/reliability glitches here again. Melissa Bradshaw determined that the problem was an apparently activated WP Super Cache not actually caching anything. We should be OK now, so please let me know if there are further difficulties. One interesting step &#8212; it turns out that there&#8217;s <a href="http://wordpress.org/extend/plugins/sqlmon/">a WordPress plug-in that does automatic EXPLAINs</a> (if you&#8217;re the blog administrator).</p>
<p>Another interesting <a href="http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors">Mike Stonebraker post</a> can be found (at least for now) over on the VoltDB blog. He continued his assault on the <strong>CAP Theorem, </strong>arguing that availability is an exaggerated concern when there are bug- or other human-error-driven kinds of outages, and also arguing that the concept of &#8220;partition tolerance&#8221; is misguided. Commenters pushed back, pointing out that in geographically distributed scenarios, the CAP Theorem sense of partitioning is quite a legitimate concern.</p>
<p>When I posted <a href="../2010/12/30/examples-and-definition-of-machine-generated-data/">an   expansive definition of machine-generated data</a> a few weeks ago, Daniel Abadi shot   back advocating a narrower one (see the comment thread, which includes a   link to his thoughtful post). The disagreement boils down to   conflicting intuitions as to whether the machine-data/true-human-data   ratio will keep growing rapidly, in hybrid cases such as web logs or   social gaming.</p>
<p>Dave McClure recently offered a survey of <a href="http://blog.500startups.com/2011/01/15/top-10-tech-investing-trends-for-2011/">hot startup investing themes</a>. High on his list were location-based services, which is a reminder to us all that geo-spatial data is becoming much more important. Ray Wang is savvy enough to understand <a href="http://blog.softwareinsider.org/2011/01/17/mondays-musings-why-im-unplugging-from-location-based-services-until-the-privacy-issue-is-resolved/">the privacy dangers location-based services cause</a>, but influential though Ray is, his view will probably remain in the minority. Machine-generated data and video each also make appearances on Dave&#8217;s  list.</p>
<p>And wait! I have even more links for you!  Several are taken from Thomas Houston&#8217;s  choices for <a href="http://www.switched.com/2010/12/30/best-technology-writing-of-2010/">The  Best Tech Writing of 2010</a>. He chose well. I recommend sampling his  list further.</p>
<ul>
<li>In <a href="http://www.nytimes.com/2011/01/02/business/02speed.html">an  article about new electronic exchanges</a>, the <em>New York Times</em> shared some numbers &#8212; 56% of trading volume &#8220;high speed&#8221; in stocks, 1/3  or so when looking at domestic futures, .1 milliseconds to do a NASDAQ  trade, 13 milliseconds for a trade that involves Chicago/NYC  communication, 60 milliseconds for NYC/Frankfurt. Slashdot offers <a href="http://hardware.slashdot.org/story/11/01/03/2127257/NJ-Server-Farms-Remake-the-US-Financial-Markets">photos  and other context</a>.</li>
<li>James Taylor caught up with once-hot <a href="http://jtonedm.com/2010/12/14/update-kxen/">KXEN</a>, and  evidently got the impression KXEN was focusing a lot of its efforts on  the tedious, time-consuming data-preparation side of modeling.</li>
<li><a href="http://innocuous.org/articles/2011/01/03/toddler-science-and-big-data/">Richard  Tibbetts</a> is being pretty funny on his blog.</li>
<li>(Slashdot) <a href="http://linux.slashdot.org/story/10/12/27/2025258/Putin-Orders-Russian-Move-To-GNULinux">The  Russian government seems to be getting into open source software in a  big way</a>. Well, <strong>PostgreSQL</strong> is already big in Russia (close to 1  million installations, I was once told), so this might conceivably add  some energy to its development.</li>
<li>In <a href="http://www.theregister.co.uk/2011/01/07/drupal_7_released/">Drupal  7</a>, Drupal now has &#8220;a built-in test environment, version upgrade  manager, and a database  abstraction layer for use with MariaDB, SQL  Server, MongoDB, Oracle,  MySQL, PostgreSQL, and SQLite.&#8221; That may  explain how <strong>MongoDB</strong> can hope to further penetrate the Drupal market.</li>
<li>The Boston Phoenix argues that <a href="http://thephoenix.com/Boston/news/113481-infopocalypse-the-cost-of-too-much-data/?page=1#TOPCONTENT">government  lacks the manpower, budget, and expertise to keep up with its  responsibilities in preserving and exposing information</a>. Fixing that  problem sounds like a pretty worthy open source development effort to  me.</li>
</ul>
<p>Finally:</p>
<ul>
<li>Clay Shirky reminded us that <a href="http://www.wired.com/magazine/2010/12/ff_ai_essay_airevolution/">modern    machine learning is what replaced old-style AI</a>.</li>
<li>Nominally reviewing a book he obviously disdains, Garry Kasparov &#8212;   in my opinion the most admirable world chess champion ever &#8212; <a href="http://www.nybooks.com/articles/archives/2010/feb/11/the-chess-master-and-the-computer/">surveyed   computer chess</a> in quick, nontechnical way. The whole thing is a  bit wordy even so, so I&#8217;ll quote one part:</li>
</ul>
<blockquote><p>In 2005, the online chess-playing site Playchess.com hosted  what it  called a “freestyle” chess tournament in which anyone could  compete in  teams with other players or computers. &#8230; The surprise came  at the conclusion of the event. The winner  was revealed to be not a  grandmaster with a state-of-the-art PC but a pair of  amateur American chess players  using three computers at the same time.  Their skill at manipulating and  “coaching” their computers to look very  deeply into positions  effectively counteracted the superior chess  understanding of their  grandmaster opponents and the greater  computational power of other  participants. Weak human + machine +  better process was superior to a  strong computer alone and, more  remarkably, superior to a strong human +  machine + inferior process.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/01/20/notes-links-and-comments-january-20-2010/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Privacy dangers &#8212; an overview</title>
		<link>http://www.dbms2.com/2011/01/10/privacy-dangers-an-overview/</link>
		<comments>http://www.dbms2.com/2011/01/10/privacy-dangers-an-overview/#comments</comments>
		<pubDate>Mon, 10 Jan 2011 16:14:53 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Health care]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3511</guid>
		<description><![CDATA[This post is the first of a series. The second one delves into the technology behind the most serious electronic privacy threats. The privacy discussion has gotten more active, and more complicated as well. A year ago, I still struggled to get people to pay attention to privacy concerns at all, at least in the [...]]]></description>
			<content:encoded><![CDATA[<p>
<em>This post is the first of a series. The second one delves into <a href="http://www.dbms2.com/2011/01/11/the-technology-of-privacy-threats/">the technology behind the most serious electronic privacy threats</a>.</em> </p>
<p>The privacy discussion has gotten more active, and more complicated as well. A year ago, I still struggled to get people to pay attention to privacy concerns at all, at least in the United States, with <a href="../../../../../2010/01/31/data-based-snooping-threat-libert/">my first public breakthrough</a> coming at the end of January. But much has changed since then.</p>
<p>On the <strong>commercial</strong> side, Facebook modified its privacy policies, garnering great press attention and an intense user backlash, leading to a quick <a href="http://www.nytimes.com/2010/05/27/technology/27facebook.html">partial retreat</a>. The <em>Wall Street Journal</em> then launched a long series of articles &#8212; 13 so far &#8212; recounting multiple kinds of <a href="http://online.wsj.com/public/page/what-they-know-digital-privacy.html">privacy threats</a>. Other media joined in, from <em><a href="http://blogs.forbes.com/kashmirhill/">Forbes</a> </em>to <em><a href="http://news.cnet.com/privacy-inc/">CNet</a>.</em> Various forms of US government rule-making to <a href="http://www.ecommerce-guide.com/solutions/advertising/article.php/3906466/Ad-Groups-to-Rally-Against-Federal-Privacy-Rules.htm">inhibit advertising-related tracking</a> have been proposed as an apparent result.</p>
<p>In the US, the <strong>government</strong> had a lively year as well. The Transportation Security Administration (TSA) rolled out what have been dubbed &#8220;porn scanners,&#8221; and backed them up with &#8220;<a href="http://twitter.com/#%21/julie_craig/status/4336151733207040">enhanced patdowns</a>.&#8221; For somebody who is, for example, female, young, a sex abuse survivor, and/or a follower of certain religions, those can be highly unpleasant, if not traumatic. Meanwhile, the Wikileaks/Cablegate events have spawned a government reaction whose scope is only beginning to be seen. A couple of &#8220;highlights&#8221; so far are some very nasty <a href="http://www.salon.com/news/opinion/glenn_greenwald/2010/11/09/manning/index.html">laptop seizures</a>, and the recent demand for <a href="http://twitter.com/#%21/wikileaks/status/23939621570215936">information on over 600,000 Twitter accounts</a>. (<a href="http://paranoia.dubfire.net/2011/01/thoughts-on-doj-wikileakstwitter-court.html">Christopher Soghoian</a> provided a detailed, nuanced legal analysis of same.)</p>
<p>At this point, it&#8217;s fair to say there are at least<strong> six different kinds of legitimate privacy fear. </strong><span id="more-3511"></span>The first five I have in mind are:</p>
<ul>
<li><strong>Governmental force. </strong>Your web browsing history can be used against you in a court of law. Profiling &#8212; perhaps based on big data analytics &#8212; might make you a target for law enforcement or anti-terrorism investigation as well. And between financial transaction records, communication records, physical movement data, and more, <strong>the government&#8217;s ability to gather and process information about people is effectively unlimited.</strong></li>
<li><strong>Private sector discrimination. </strong>There are many ways private companies could use detailed profiles &#8212; or just incriminating photos &#8212; to your detriment. They could fire you, not hire you, <a href="http://yro.slashdot.org/firehose.pl?op=view&amp;type=story&amp;sid=10/11/23/0344243">deny you insurance or credit</a>, or simply not give you their best possible price.<strong></strong></li>
<li><strong>Identity theft. </strong>Phishing happens.<strong></strong></li>
<li><strong>Social pressure and stalking.</strong> What your friends, neighbors, and classmates know about you can become a serious problem, especially if they gang up and cyber-bully you.  That your violent ex-boyfriend can track you could be a bigger problem yet.<strong></strong></li>
<li><strong>Embarrassment and creep-out.</strong> Some people REALLY don&#8217;t like being viewed naked in busy public places. Some people are bothered by weirdly (in)appropriate advertisements. Maybe there&#8217;s no obvious tangible harm other than their uncomfortable feelings &#8212; but their discomfort is serious even so. <strong></strong></li>
</ul>
<p>The sixth is:</p>
<ul>
<li>Throwing out the baby with the bathwater in a<strong> backlash. </strong>I think medical privacy rules already cost lives, in <a href="../../../../../2010/10/10/xldb4-xldb/">research</a> and <a href="../../../../../2010/09/13/reconciling-medical-privacy-and-elder-care/">treatment</a> alike. <em>(Forbes</em> offers multiple examples of <a href="http://www.forbes.com/forbes/2009/0608/034-privacy-research-hidden-cost-of-privacy_2.html">life-saving research being stopped by HIPAA</a>.) I think there&#8217;s an inevitable tradeoff between intrusive physical security measures, such as the TSA&#8217;s various forms of security theater, and spooky behind-the-scenes profiling and surveillance. (Unfortunately, we can hardly live in a society without some kinds of security measures &#8212; and even if we could, voters would never allow it.) Proposals that would hamstring the internet advertising industry currently seem unlikely to pass &#8212; but how much uproar would it take before that changed?</li>
</ul>
<p>You probably knew most of that already. Even so, here are a number of examples and links.</p>
<p><a href="http://search.slashdot.org/story/11/01/04/2346201/Unwise-mdash-Search-History-of-Murder-Methods">Slashdot</a> has more on the Jensen case, in which a man was <strong>convicted of murder</strong> in no small part because his computer revealed that he had searched for information about murder methods, including one that duplicated the actual method of his wife&#8217;s death. The money quote in an <a href="http://www.wicourts.gov/ca/opinion/DisplayDocument.html?content=html&amp;seqNo=58315">appeals court decision</a> is in Section 37.1, which characterizes &#8220;computer evidence&#8221; as &#8220;probably the most incriminating other evidence.&#8221;</p>
<p>Obviously, that particular trial had the correct outcome. But would you want a court to hear about the research you did about minimizing your taxes? What about when you considered changing jobs, something that might be of interest both in employment and child custody litigation? How about your viewing of sexy images other than those of your spouse? And by the way &#8212; just what kinds of online viewing or writing should get you on a terrorist watch list?</p>
<p>It&#8217;s getting ever more practical to <strong>track our actions and movements.</strong> The privacy implications are potentially grave &#8212; would you want THAT kind of information used in court, or for decisions about your insurance? Accordingly, the Electric Freedom Foundation coined the word &#8220;<a href="https://www.eff.org/deeplinks/2010/12/what-traitorware">traitorware</a>&#8221; to describe and call attention to consumer (mainly) devices that keep or even transmit records of your movements and doings. And mind-bogglingly, <em>Forbes</em> says that Sprint &#8220;<a href="http://www.forbes.com/forbes/2010/1206/technology-chris-soghoian-federal-trade-commission-agent-provocateur.html">turned customers&#8217; GPS information over to law enforcement 8 million times in a year</a>.&#8221;</p>
<p>But it&#8217;s not just your own devices. The <em>New York Times</em> recently wrote of uses for <a href="http://www.nytimes.com/2011/01/02/science/02see.html">smart video technology</a>. Enforcing good hand-washing procedure on doctors sounds great. But how about enforcing busy-ness on cubicle workers? Watching movie previewers&#8217; emotional reactions to specific scenes and characters seems kosher. But what if <a href="http://www.myce.com/news/going-to-the-movies-prepare-to-be-watched-while-you-watch-36138/">all movie-goers</a> are watched? Or what if this is done at supermarkets or <a href="http://news.discovery.com/tech/future-advertising-has-eye-on-you.html">vending machines</a>, perhaps even repeatedly over time? Could be creepy, huh? And by the way, while it could be a way to get you great discounts and service, it could also be a way to categorize you as unworthy of same.</p>
<p>Meanwhile, the <em>Washington Post</em> reminds us that <a href="http://www.washingtonpost.com/wp-dyn/content/article/2011/01/01/AR2011010102690_2.html">battlefield video surveillance technology</a> is getting ever better. Obviously, such technology could be used for domestic law enforcement as well. More immediately, the <em><a href="http://projects.washingtonpost.com/top-secret-america/articles/monitoring-america/">Washington Post</a></em> tells us of things like automatic recognition of license plates being used by police departments that repurpose anti-terrorism grants to general law enforcement. (Analogies to the 1980&#8242;s habit of tying every grant proposal to the Strategic Defense Initiative &#8220;Star Wars&#8221; project are probably not misplaced.)</p>
<p>But the benefits foregone if we don&#8217;t use this technology are also scary. The biggest current examples are probably the ones I cited above &#8212; medical research, anti-terrorism, and the whole online advertising industry. But even more examples are coming down the pike, of clever electronic aids with privacy implications. Most notably &#8212; as the population ages, and nursing homes continue to be miserable (and costly) places, revolutionizing <a href="../../../../../2010/04/20/big-brother-watching-our-parents/">elder</a> <a href="http://www.nytimes.com/2010/07/29/garden/29parents.html">care</a> becomes ever more desirable. Well, sensor technology will soon be able to watch over old folks, keeping them safe(r) in their independence &#8212; but only if it is highly detailed or intrusive.</p>
<p>Also:</p>
<ul>
<li><em>Slashdot/</em><em>New      York Times</em> had an article on a <a href="http://news.slashdot.org/story/11/01/07/192201/California-County-Bans-SmartMeter-Installations">Marin      County ordinance</a> forbidding smart electrical meters. Part of the      reason is a San Francisco area backlash against the privacy implications      of having your electricity usage recorded moment to moment.</li>
<li><em>Slashdot</em> also points us at an article about <a href="http://tech.slashdot.org/story/11/01/03/203227/Using-Technology-To-Enforce-Good-Behavior">technology      you use for self-control</a> in various ways. In principle, however, the      &#8220;self&#8221; part is an optional feature.</li>
<li><em>Computing      Now</em> offered an overview of <a href="http://www.computer.org/cms/Computer.org/ComputingNow/homepage/2010/1110/W_CO_OnBodySensing.pdf">on-body      sensing technology</a>. Mainly it talks about gaming and medical      applications, but it also suggests that people receptive to putting      tracking technology on themselves in crowds in return for certain      emergency-response benefits.</li>
</ul>
<p>Finally, privacy-threatening observations of our web use, postings, and other internet communications are much too numerous to list. But some high/lowlights include:</p>
<ul>
<li><a href="http://www.readwriteweb.com/archives/web_user_interest_data.php">AddThis now claims to track 1 billion unique individuals online</a>.</li>
<li>In its transparency reports, Google publishes a count <a href="http://www.google.com/transparencyreport/governmentrequests/">of how often various governments ask it for data</a>. For the first half of 2010, leaders included the US (4287), Brazil (2435), India (1420), the UK (1343), and France (1017). Also over 100 requests each were Argentina, Australia, Chile, Germany, Italy, Singapore, Spain, and Taiwan.</li>
<li>Verizon had <a href="http://www.nytimes.com/2011/01/10/technology/10privacy.html?_r=1">over 90,000 such requests</a> in 2007.</li>
<li>An internet service provider who successfully fought off an FBI information request nonetheless was kept under a <a href="http://www.wired.com/threatlevel/2010/08/nsl-gag-order-lifted/">gag order</a> for six years about the case.</li>
<li>The Obama Administration wants ever more <a href="http://www.washingtonpost.com/wp-dyn/content/article/2010/07/28/AR2010072806141.html?wpisrc=nl_headline">new legal powers</a> to get electronic information without court order or notice to the investigation&#8217;s subjects.</li>
<li>The UK government, or parts thereof, wants to <a href="http://yro.slashdot.org/firehose.pl?op=view&amp;type=story&amp;sid=10/10/20/1958209">track everything you do online or telephonically</a>, except &#8212; supposedly &#8212; the actual contents of your calls and messages. Similarly <a href="http://yro.slashdot.org/story/10/10/29/026250/Australias-Privacy-Boss-Slams-Govt-Data-Retention-Scheme?from=headlines">Australia</a>. There&#8217;s an <a href="http://www.crn.com.au/News/215135,lessons-learned-from-europes-data-retention-laws.aspx">EU directive</a> along those lines as well, although the EU also has some strong <a href="http://yro.slashdot.org/story/10/11/05/0411231/EU-Commission-Says-People-Have-a-Right-To-Be-Forgotten-Online?from=headlines">data retention safeguards</a>.</li>
<li>Various governments &#8212; India, the <a href="http://www.theregister.co.uk/2010/07/27/blackberry_uae/">United Arab Emirates</a>, and many more &#8212; have recently insisted that communication offerings such as BlackBerry provide them with back doors to snoop, on pain of being banned countrywide.</li>
<li>Informal methods of data acquisition are also used. For example, police departments use fake identities (with photos of hot girls, of course) to <a href="http://lacrossetribune.com/news/local/article_0ff40f7a-d4d1-11de-afb3-001cc4c002e0.html">friend young folks on Facebook to look for photos of underage drinking</a>.</li>
<li>A CNN survey article showed <a href="http://www.cnn.com/2010/TECH/web/12/13/end.of.privacy.intro/index.html">how extreme public data revelation can get</a>, using the example of Louis Gray.</li>
<li>For its own marketing reasons, DuckDuckGo offers a simple overview of <a href="http://donttrack.us/">how your web searches could/can be used against you</a>.</li>
<li>And as an example of how <a href="http://yro.slashdot.org/story/10/08/17/1229217/HP-CEOs-Browsing-History-Used-Against-Him">even petty internet uses can cause problems</a>, one of many things that got former HP CEO Mark Hurd into trouble was looking at racy online pictures of business associate Jodie Fisher.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/01/10/privacy-dangers-an-overview/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Notes and links October 3 2010</title>
		<link>http://www.dbms2.com/2010/10/03/notes-and-links-october-3-2010/</link>
		<comments>http://www.dbms2.com/2010/10/03/notes-and-links-october-3-2010/#comments</comments>
		<pubDate>Mon, 04 Oct 2010 01:10:41 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[HP and Neoview]]></category>
		<category><![CDATA[Humor]]></category>
		<category><![CDATA[Kickfire]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3103</guid>
		<description><![CDATA[Some notes, follow-up, and links before I head out to California:  HP hired a software guy, Leo Apotheker, as CEO, and a software guy with a liking for high-end services, Ray Lane, as chairman. Now a Leo Apotheker conference call suggests HP will increase its emphasis on software, and maybe high-end services as well. No [...]]]></description>
			<content:encoded><![CDATA[<p>Some notes, follow-up, and links before I head out to California:  <span id="more-3103"></span></p>
<ul>
<li>HP hired a software guy, Leo Apotheker, as CEO, and a software guy with a liking for high-end services, <a href="http://www.dbms2.com/2010/09/30/ray-lane-at-hp/">Ray Lane</a>, as chairman. Now a <a href="http://news.cnet.com/8301-31021_3-20018241-260.html">Leo Apotheker conference call</a> suggests HP will increase its emphasis on software, and maybe high-end services as well. No surprise. The article suggests, however,  that HP at this point has no clear strategy along these lines. That&#8217;s no surprise either.
<ul>
<li>And then there&#8217;s <a href="http://techcrunch.com/2010/10/01/oh-thank-god-oracle-has-a-new-rivalry/">Sarah Lacy&#8217;s take</a>, of which the interesting part reads &#8220;Separately, Andreessen has said that he thinks enterprise software is  ripe for disruption and his firm is going to fund a new generation of  Oracle-killers.&#8221;</li>
<li>I added more on <a href="http://www.softwarememories.com/2010/10/03/ray-lane-and-the-integration-of-software-and-consulting-at-oracle/">Ray Lane&#8217;s tenure at Oracle</a> over on <em><a href="http://www.softwarememories.com">Software Memories</a>.</em></li>
</ul>
</li>
<li>Netezza had a falling out with its original supplier of geospatial technology, Intelligent Integration Systems (IISi), and a lawsuit ensued over alleged copying. Now ISSi has upped the stakes, essentially alleging that <a href="http://news.cnet.com/8301-27080_3-20017809-245.html">Netezza&#8217;s new geospatial software doesn&#8217;t work</a>, and that hence the CIA (evidently a Netezza user) is killing the wrong people via drone strikes. Netezza has wisely selected from its short list of acceptable responses, including versions of:
<ul>
<li>&#8220;All our classified customers are happy, and if we told you anything more than that, that would kind of defeat the purpose of being classified, wouldn&#8217;t it?&#8221;</li>
<li>&#8220;Copy, schmopy. A polygon is a polygon, and has been since Euclid.&#8221;</li>
<li>&#8220;We don&#8217;t have no steenking bugs.&#8221;</li>
</ul>
</li>
<li><a href="http://www.theregister.co.uk/2010/09/30/ocz_hdsl/">OCZ</a>, whoever they are, are trying to offer solid-state drives with PCIe-like bandwidth, which makes sense in that most observers except <a href="http://www.dbms2.com/2009/10/25/teradata-hardware-strategy-and-tactics/">Teradata</a> think the SAS interface isn&#8217;t fast enough for solid-state.</li>
<li>Speaking of Teradata, I&#8217;d been wondering somewhat as to why they just <a href="http://www.dbms2.com/2010/08/12/teradata-future-product-strategy/">shut down Kickfire&#8217;s product line after acquiring its assets</a>. Well, somebody who tested a Kickfire box told me that &#8212; <a href="http://www.dbms2.com/2008/04/18/kickfire-kicks-off/">great TPC-H results notwithstanding</a> &#8212; it turned out not to be nearly as fast as one might think, on real-life data sets that didn&#8217;t fit entirely into RAM. Hard though such a thing may be to imagine, it turns out that Kickfire&#8217;s TPC-H results were yet less significant than I thought they were.</li>
<li>I haven&#8217;t been looking at <em><a href="http://highscalability.com/">High Scalability</a></em> nearly as  much as I should, and that&#8217;s an understatement. It&#8217;s an outstanding  blog.</li>
<li>A couple of Google execs offered some <a href="http://www.mediapost.com/publications/?fa=Articles.showArticle&amp;art_aid=136685&amp;nid=119185">predictions   about the future of online advertising</a>, which might be of  interest  to anybody selling analytic (or text analytic) technology to  the  online/digital media market.</li>
<li>The BBC shows us <a href="http://www.bbc.co.uk/blogs/researchanddevelopment/2010/09/what-makes-zeitgeist-tick.shtml">what a single 133-character tweet plus its metadata look like in JSON</a>. (All 1582 characters.)</li>
<li><em>Huffington Post&#8217;s</em> CEO made some comments about <a href="http://paidcontent.org/article/419-huffposts-hippeau-social-informants-are-the-new-influencers/">influencers</a> which are additive to what I&#8217;ve been saying about <a href="http://www.strategicmessaging.com/influencers-long-tail-watts-godin/2008/02/02/">influencers</a> over on <em><a href="http://www.strategicmessaging.com">Strategic Messaging</a>.</em> (If you don&#8217;t read that &#8212; well, it&#8217;s my blog about marketing.)<em><br />
</em></li>
<li>Speaking of my other blogs, I&#8217;m not bothering to put up a separate post like this over on <em><a href="http://www.texttechologies.com">Text Technologies</a>, </em>where thee posts I have put up recently tend to be (at least by my standards) relatively link-heavy anyway, but I have a couple more to share even so:
<ul>
<li>Paul Carr&#8217;s <a href="http://techcrunch.com/2010/10/01/eh-oh-well/">7 rules for TechCrunch/AOL employees</a> are really funny.</li>
<li>Some major search engine marketing experts are sounding <a href="http://sphinn.com/story/159876">defeatist about web spam</a>.</li>
</ul>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/10/03/notes-and-links-october-3-2010/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Various quick notes</title>
		<link>http://www.dbms2.com/2010/05/23/various-quick-notes/</link>
		<comments>http://www.dbms2.com/2010/05/23/various-quick-notes/#comments</comments>
		<pubDate>Sun, 23 May 2010 08:38:51 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[SAP AG]]></category>
		<category><![CDATA[SAS Institute]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2173</guid>
		<description><![CDATA[As you might imagine, there are a lot of blog posts I&#8217;d like to write I never seem to get around to, or things I&#8217;d like to comment on that I don&#8217;t want to bother ever writing a full post about. In some cases I just tweet a comment or link and leave it at [...]]]></description>
			<content:encoded><![CDATA[<p>As you might imagine, there are a lot of blog posts I&#8217;d like to write I never seem to get around to, or things I&#8217;d like to comment on that I don&#8217;t want to bother ever writing a full post about. In some cases I just <a href="http://twitter.com/CurtMonash">tweet</a> a comment or link and leave it at that.</p>
<p>And it&#8217;s not going to get any better. Next week = the oft-postponed elder care trip. Then I&#8217;m back for a short week. Then I&#8217;m off on my quarterly visit to the SF area. Soon thereafter I&#8217;ve have a lot to do in connection with <a href="http://www.netezza.com/userconference/speakers.html">Enzee Universe</a>. And at that point another month will have gone by.</p>
<p>Anyhow:<span id="more-2173"></span></p>
<ul>
<li>Back in January, Oracle finally briefed me on <a href="http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/">Exadata 2</a>. I also requested and got permission to post what I regarded as pretty interesting slides, then never got around to doing so. Well, <a href="http://www.monash.com/uploads/Exadata-slides-January-2010.pdf">here they are</a>. (Pay no attention to the word &#8220;Confidential&#8221;.)</li>
<li>Two people I have a lot of respect for, <a href="http://intelligent-enterprise.informationweek.com/blog/archives/2010/05/sap_and_inmemor.html">Cindi Howson</a> and <a href="http://intelligent-enterprise.informationweek.com/blog/archives/2010/05/quick_takes_on.html">Doug Henschen</a>, seem bullish on SAP&#8217;s in-memory NewDB efforts. But for a variety of execution reasons, I&#8217;m skeptical that this will matter for anything except SAP&#8217;s analytics suite. I.e., I don&#8217;t think anybody much except SAP will write OLTP apps to it, and I don&#8217;t think that without OLTP apps being written to it it&#8217;s much more than Business Objects&#8217; answer to QlikView.</li>
<li>I just learned that <a href="http://www.thestreet.com/story/10640248/1/tech-rights-give-companies-upper-hand.html">Netezza&#8217;s previous geospatial technology didn&#8217;t get ported to TwinFin</a>. However, <a href="http://www.netezza.com/releases/2010/release021710.htm">Netezza obviously found a geospatial alternative</a>.</li>
</ul>
<p>I &#8216;m beginning to make a habit of asking vendors for a postable version of their slide decks. <a href="http://www.dbms2.com/2010/05/23/sybase-iq-15/">Sybase IQ</a> is another example.</p>
<ul>
<li>Google is doing something called <a href="http://googlecode.blogspot.com/2010/05/bigquery-and-prediction-api-get-more.html">BigQuery</a> that is &#8220;SQL-like&#8221; for big data analytics. I don&#8217;t know anything about it.</li>
<li>I also don&#8217;t know anything about <a href="http://www-01.ibm.com/software/ebusiness/jstart/bigsheets/">IBM BigSheets</a> yet. It sounds something like <a href="http://www.dbms2.com/2010/04/16/introduction-to-datameer/">Datameer</a>, but that could be way off the mark.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/23/various-quick-notes/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Notes on SciDB and scientific data management</title>
		<link>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/</link>
		<comments>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/#comments</comments>
		<pubDate>Sat, 22 May 2010 08:04:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[SciDB]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2178</guid>
		<description><![CDATA[I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That&#8217;s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about issues in scientific data management. Here&#8217;s some of what has transpired since then. The [...]]]></description>
			<content:encoded><![CDATA[<p>I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That&#8217;s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/">issues in scientific data management</a>. Here&#8217;s some of what has transpired since then.</p>
<p>The main new activity I know of has been in the open source <a href="http://www.scidb.org/">SciDB</a> project.   <span id="more-2178"></span></p>
<ul>
<li>A company called Zetics has been started to commercialize SciDB. As of now, the entire staff seems to be CEO Marilyn Matz, techie Paul Brown, and part of Mike Stonebraker. Marilyn says Zetics has some venture capital, but even under NDA didn&#8217;t tell me who it was from. Zetics does not have its own web site.</li>
<li>Marilyn tells me there are 20-25 contributors to SciDB, led by Paul Brown and Mike Stonebraker. Brown is full-time. Persistent Systems has been donating the efforts of a few of its employees. Some <a href="http://www.lsst.org/lsst">LSST</a> folks have been doing SciDB work backed by grant money. Most or all of the rest seem to be purer volunteers. Some Russians have been particularly active.</li>
<li>Release 0.5 of SciDB is expected in June. Release 1.0 is expected in September. This is a rewrite; prior demo code has been scrapped. Perhaps not coincidentally, it&#8217;s also a small slip from prior project plans.</li>
<li>The array data model is an example of what&#8217;s being implemented first. (Duh &#8212; you can&#8217;t have a DBMS without a data model.) Support for uncertainty is an example of what&#8217;s been deferred until later.</li>
<li>As has been clear since XLDB3 last August, one major target market for SciDB is genomic research.</li>
<li>It&#8217;s obvious that the oil and gas industry, with all its geospatial data, should be interested in SciDB. But there&#8217;s not much activity in that regard; outreach is evidently needed. If you can think of somebody in that sector (or anywhere else) who should be alerted to SciDB, please ping them.</li>
<li>Interest from web analytics users in SciDB seems to have receded a bit from the days when eBay almost funded the project.</li>
</ul>
<p>In other scientific data management news,</p>
<ul>
<li>Microsoft put out a book called <a href="http://research.microsoft.com/en-us/collaboration/fourthparadigm/">The Fourth Paradigm</a> on scientific database management. The whole thing can be downloaded, very officially, as a giant PDF. I think it&#8217;s worth skimming. I don&#8217;t think it&#8217;s worth actually reading. (I did read it.)</li>
<li><a href="http://www-conf.slac.stanford.edu/xldb/">XLDB4</a> will be at Stanford October 5-7. Unlike prior XLDBs, it will have an open (i.e., no invitation required) part.</li>
</ul>
<p>Finally, you surely are aware of the whole &#8220;Climategate&#8221; mess, in which major climate researchers&#8217; email was hacked and many unkind conclusions were drawn. Well, one of the most technical parts of the disclosure was in a long series of Read Me files, in which an unfortunate programmer lamented about <a href="http://di2.nu/foia/HARRY_READ_ME-20.html">the difficulty of reconstructing published results from files at hand</a>. These turned out to illustrate a classic problem that SciDB or alternatives are meant to solve:</p>
<ul>
<li>Raw data was impossible to use without various adjustments to regularize it (the word &#8220;regridding&#8221; comes up a lot, for example). Massaging was needed before analytics could be done on it.</li>
<li>The raw data was thrown out or lost, and could not be reconstructed (why they couldn&#8217;t have asked the suppliers of the data to give it to them again was unclear in this case, since it wasn&#8217;t original experimental data).</li>
<li>It was thus impossible to massage the data in any new or improved way.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>This week at the Teradata Partners user conference</title>
		<link>http://www.dbms2.com/2009/10/19/teradata-partners-2009/</link>
		<comments>http://www.dbms2.com/2009/10/19/teradata-partners-2009/#comments</comments>
		<pubDate>Mon, 19 Oct 2009 13:07:31 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1150</guid>
		<description><![CDATA[Teradata tells me that its press embargoes are ending at 9:00 this morning. Here are some highlights of what&#8217;s going on, although names, dates, and details will have to await conversations and press releases this week. Teradata is productizing “private cloud,” under names including “Teradata Enterprise Analytics Cloud,” “Teradata Agile Analytics Cloud,” and “Teradata Elastic [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Teradata tells me that its press embargoes are ending at 9:00 this morning. Here are some highlights of what&#8217;s going on, although names, dates, and details will have to await conversations and press releases this week.</p>
<ul>
<li><strong>Teradata is productizing 	“private cloud,”</strong> under names including “Teradata 	Enterprise Analytics Cloud,” “Teradata Agile Analytics Cloud,” 	and “Teradata Elastic Mart Builder.” I.e., Teradata hopes to 	leapfrog Greenplum in its “<a href="../2009/06/08/the-future-of-data-marts/">Enterprise 	Data Cloud</a>” strategy. This is only fair, in that Greenplum 	lifted the idea from Teradata and eBay in the first place. It also 	provides major support for what I think is an extremely sensible 	trend. Give or take issues of who announces and ships what a couple 	months before or after a competitor, my early thinking is that the 	main differences between Greenplum and Teradata in this regard will 	be:
<ul>
<li>Virtual as opposed to just 	physical data marts, based on robust workload management software. 	(Advantage: Teradata)</li>
<li>Pricing, deployment options. 	(Advantage: Greenplum)</li>
<li>Features that don&#8217;t directly 	relate to enterprise/private cloud. (Advantage: Either, often 	Teradata.)</li>
</ul>
</li>
<li><strong>Teradata is generally 	strengthening its data movement technology</strong>, e.g. for making 	various appliances work in sync. I&#8217;m not too clear yet on the 	details of that. I think this is what Teradata&#8217;s phrase “ecosystem 	management” refers to.</li>
<li><strong>Teradata is (pre-)announcing – 	at least as a statement of direction &#8212; an appliance based on 	solid-state drives (SSDs). </strong>I&#8217;ve thought for a while that 	Teradata was a leader in thinking through <a href="../2008/10/23/teradata-solid-state-drives-ssd/">the 	issues around solid-state memory in data warehousing</a>, so it 	makes sense that they&#8217;re among the leaders in actually coming to 	market as well. I plan to say more after meeting with, e.g., Carson 	Schmidt.</li>
<li><strong>Teradata has achieved a 300%ish 	speed-up in geospatial processing</strong>. I gather this is largely a 	byproduct of the parallel analytics work Teradata did around 	strengthening its SAS integration. However, there don&#8217;t seem to be a 	lot of Teradata geospatial users yet.</li>
<li><span>Teradata 	Express, </span><strong>Teradata&#8217;s free Windows-based crippleware, is being 	ported to Amazon EC2 and VMware</strong> as well. Presumably to avoid 	cannibalizing Teradata product sales, there are quite a few 	limitations on Teradata Express, including system capacity, database 	size, and “no production use.”</li>
<li><strong>Teradata continues to extend 	its optimizations 	to handle queries issued by business intelligence tools. </strong><span>Previously, the focus of what 	Teradata discussed in this regard was <a href="../2009/08/02/teradata-13-focuses-on-advanced-analytic-performance/">query 	rewrite</a>. But soon automatic recommendation and creation of 	Aggregate Join Indexes – i.e.., materialized views – will be 	included as well.</span></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/19/teradata-partners-2009/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Issues in scientific data management</title>
		<link>http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/</link>
		<comments>http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/#comments</comments>
		<pubDate>Sat, 03 Oct 2009 05:51:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[SciDB]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=993</guid>
		<description><![CDATA[In the opinion of the leaders of the XLDB and SciDB efforts, key requirements for scientific data management include: A data model based on multidimensional arrays, not sets of tuples A storage model based on versions and not update in place Built-in support for provenance (lineage), workflows, and uncertainty Scalability to 100s of petabytes and [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">In the opinion of the leaders of <a href="../2009/09/12/xldb-scid/">the XLDB and SciDB efforts</a>, key <a href="http://scidb.org/about/history.php">requirements for scientific data management</a> include:</p>
<ul>
<li>A data model based on <strong>multidimensional arrays,</strong> not 	sets of tuples</li>
<li>A storage model based on <strong>versions</strong> and not update in 	place</li>
<li><span>Built-in 	support for </span><strong>provenance (lineage), workflows, and 	uncertainty</strong></li>
<li>Scalability to <strong>100s of 	petabytes and 1,000s of nodes</strong> with high degrees of <strong>tolerance 	to failures</strong></li>
<li>Support for <strong>&#8220;external&#8221; 	data objects</strong> so that data sets can be queried and manipulated 	without ever having to be loaded into the database</li>
<li><strong>Open source</strong> in order to foster a community of 	contributors and to insure that data is never &#8220;locked up&#8221; 	— a critical requirement for scientists</li>
</ul>
<p style="margin-bottom: 0in;">However:<span id="more-993"></span></p>
<ul>
<li><strong>I think that&#8217;s a dream/wish 	list.</strong> A lot of good could be done without meeting each of those 	six requirements in full.</li>
<li>I think at least some of the 	XLDB/SciDB leaders realize this.</li>
<li>In my opinion, <strong>a highly useful 	subset of the dream/wish list is achievable in the 	reasonably-intermediate future,</strong> in either of two ways:
<ul>
<li><strong>Through a Hadoop-centric open 	source effort,</strong> especially since <a href="../2009/09/13/hadoopdb/">HadoopDB 	opens up the possibility of letting DBMS creators offload MPP 	scaling challenges to somebody else</a>.</li>
<li><strong>From commercial MPP 	software-only</strong><span> (as opposed to 	appliance) </span><strong>DBMS vendors.</strong> I think they can develop the 	needed technology. I also think it could be in their business 	interest to make licensing arrangements of the sort that the 	scientific and research communities would need.</li>
</ul>
</li>
<li>Talking about &#8220;scientific&#8221; 	big data is unhelpfully vague. Let&#8217;s just focus on <strong>multi-dimensional 	measurement- or model-centric data,</strong> from disciplines such as 	seismology (under the Earth&#8217;s surface), climatology (over the 	surface), and astronomy (outer space). That would also include 	disciplines whose three-spatial-dimensions-plus-time data comes from 	inside a laboratory or other man-made environment, such as 	high-energy physics, fluid dynamics, and so on.</li>
<li>One place in all that where there 	should be a commercial-company market is in <strong>oil/gas extraction.</strong><span> And by the way, the energy industry is increasing its uptake of data 	warehousing technology faster these days than any other sector I can 	think of, except perhaps for &#8230;</span></li>
<li>&#8230; web companies that do <strong>log 	file analysis.</strong> Facebook&#8217;s log data has arrays-within-arrays 	reminiscent of the scientists&#8217;. eBay has been a major backer of 	XLDB/SciDB. It&#8217;s far from fully known yet just how much overlap 	there is between log-file-analyzers&#8217; data management needs and those 	of big-data scientists. But there clearly are at least some 	commonalities.</li>
<li>I don&#8217;t get the impression that 	scientists focused on modeling &#8212; e.g. climate-predictors &#8212; have 	been big participants in XLDB. That&#8217;s a pity for at least two 	reasons. First, modeling is at the heart of some of the most 	important global issues scientists address (e.g., climate change). 	Second, it might be an area of particularly rich overlap with 	commercial data management needs.</li>
</ul>
<p style="margin-bottom: 0in;">Now let&#8217;s step back and consider approximately what is meant by the requirements listed above.</p>
<ul>
<li>The requirements for an <strong>array</strong> structure are evidently pretty deep. You can glean some of the 	reasons from the <a href="http://scidb.org/use/">scientific database 	use cases</a> posted on the SciDB website. In particular:
<ul>
<li>Coordinate data naturally fits 	into arrays.</li>
<li>Coordinate data also naturally 	fits into geospatial ranges and the like.</li>
<li>The &#8220;grid&#8221; for the array 	can be imprecise &#8212; or calculated via transformation &#8212; for a whole 	lot of different reasons.</li>
<li>Different measurements may be 	available for different points in the array. (I think this may be 	the essence of the array-valued-arrays requirement.)</li>
</ul>
</li>
<li>Some reasons scientists want 	<strong>versioning</strong> and support for <strong>data provenance</strong> are pretty 	obvious &#8212; you never want to lose the record of what the instrument 	readings said, or ever were believed to say. But it goes further. 	Data is &#8220;cooked&#8221; &#8212; i.e., transformed/reduced &#8212; and 	stored in huge volumes. So you&#8217;d like to later on be able to go back 	to the raw data and re-cook it.</li>
<li>The <strong>workflow</strong> requirement 	seems to stem in many cases from data movement needs, that in turn 	in some cases stem from political issues. I haven&#8217;t yet understood 	why workflow would actually need to be baked into a scientific DBMS.</li>
<li>By the time the database 	management systems we&#8217;re talking about could conceivably be ready, 	the need will be at least in the 10s of petabytes. <strong>100s of 	petabytes</strong> is a reasonable design goal.</li>
<li>Not that I&#8217;ve run any numbers on 	the matter, but it seems plausible that <a href="../2009/09/13/fault-tolerant-queries/"><strong>query 	fault-tolerance</strong></a> will be needed, at least in some cases.</li>
<li>In many sciences (astronomy seems 	to be an exception), the default choice is to keep data in files 	rather than a DBMS. For example, CERN has a 10 terabyte or so Oracle 	database holding just the metadata for a vastly larger collection of 	data files. Even if the pendulum swings toward greater use of DBMS, 	the ongoing need for <strong>external file access</strong> is pretty obvious.</li>
<li>I suspect that the insistence on 	<strong>open source</strong> is part legitimate, part knee-jerk excessive.
<ul>
<li>&#8220;Free&#8221; is the best 	possible price, of course.</li>
<li>Beyond cash cost, scientists want 	data access to be free of licensing encumbrance. There are two main 	reasons. First, people might want to manage subsets or copies of 	data remotely from its central repository, for a variety of reasons. 	Not all of those reasons are easy to overcome, so any closed-source 	licensing would have to be very comprehensive (e.g., global or at 	least continent-wide &#8220;site&#8221; licensing).</li>
<li>Second, they want assurance that 	data will always be accessible, even if licenses expire. That seems 	a little overwrought. Yes, moving data from one multi-petabyte 	repository to another could be a bit slow. But it&#8217;s not an 	eventuality to panic about.</li>
<li>As for actual community 	development &#8212; scientists sure have a variety of exotic data 	management needs. But I&#8217;m not sure how much talent or resource there 	is among scientists to do true DBMS development (as opposed to, say, 	refining some UDFs). Yes, one XLDB attendee was both an astronomer 	and a PostgreSQL Major Contributor, but he seemed like an exception. 	On the other hand, it&#8217;s not entirely implausible that, in the right 	framework, some people with database talent could be recruited to 	donate some time to the general advancement of science.</li>
</ul>
</li>
<li>I don&#8217;t know much about management 	of <strong>uncertain data,</strong> and will duck that subject for now.</li>
</ul>
<p><em><strong>Related links</strong></em></p>
<ul>
<li><a href="http://www.dbms2.com/2009/10/03/martin-kersten-on-issues-in-scientific-data-management/">Martin Kersten&#8217;s response</a></li>
<li><a href="http://www.dbms2.com/2009/10/04/jacek-becla-on-issues-in-scientific-data-management/">Jacek Becla&#8217;s response</a></li>
<li><a href="http://www.dbms2.com/2009/10/10/scientific-data-sharing/">Scientific data sharing</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

