<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Application areas</title>
	<atom:link href="http://www.dbms2.com/category/applications/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 02 Sep 2010 09:06:44 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Big Data is Watching You!</title>
		<link>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/</link>
		<comments>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/#comments</comments>
		<pubDate>Wed, 11 Aug 2010 05:30:22 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2760</guid>
		<description><![CDATA[There&#8217;s a boom in large-scale analytics. The subjects of this analysis may be categorized as:

People
Financial trades
Electronic networks
Everything else

The most varied, interesting, and valuable of those four categories is the first one.

That may change some day, with the growing importance of machine-generated data, and of big-data science in particular. But I think it&#8217;s a fair assessment [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">There&#8217;s a boom in large-scale analytics. The subjects of this analysis may be categorized as:</p>
<ul>
<li>People</li>
<li>Financial trades</li>
<li>Electronic networks</li>
<li>Everything else</li>
</ul>
<p style="margin-bottom: 0in;">The most varied, interesting, and valuable of those four categories is the first one.</p>
<p><span id="more-2760"></span></p>
<p style="margin-bottom: 0in;"><em>That may change some day, with the growing importance of<a href="http://www.dbms2.com/2010/04/08/machine-generated-data-example/" > </a><a href="http://www.dbms2.com/2010/04/08/machine-generated-data-example/" >machine-generated data</a>,</em><em> and of <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/" >big-data science</a> </em><em>in particular. But I think it&#8217;s a fair assessment at the present, and for at least the next few years.</em></p>
<p style="margin-bottom: 0in;">Some of th<span style="font-weight: normal;">e most interesting use cases are concentrated in the areas of identifying individuals, groups of people, or behaviors of (groups of) people. For example:</span></p>
<ul>
<li>comScore works hard to <strong>identify 	individual web surfers </strong><span style="font-weight: normal;">– 	i.e. to </span><strong>deanonymize</strong><span style="font-weight: normal;"> them &#8212; even</span> though they may have given incomplete or false 	personal information.</li>
<li>Other companies at least try to 	figure out <strong>which information in a user&#8217;s profile is unreliable,</strong> so as to classify them better. (Yes, there are 62-year-old 	video-game-obsessed Lady Gaga fans, but that&#8217;s generally not the way 	to bet.)</li>
<li>Multiple telecom vendors try to 	identify who their <strong>most influential customers</strong> are (to a first 	approximation, they&#8217;re the ones most often called by the most 	people, but it surely gets more sophisticated than that). This 	information is then used to reduce churn, either by working hard to 	retain those users, or – if they do churn – to move very fast to 	retain the business from their friends.</li>
<li>Other kinds of companies do 	similar kinds of analysis, to the extent that they have enough of a 	social graph to do so. (This application is a case where the term 	“<a href="http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/" >social graph</a>” is not a misnomer.)</li>
<li><strong>Turing detectives</strong> (I just 	coined that phrase) try to determine whether users are humans or 	bots.</li>
<li>Central to detecting <strong>insurance 	fraud</strong> is identifying suspiciously close connections between 	claimants, service providers, and so on.</li>
<li>Identifying groups of people is 	also important in flagging <strong>insider trading.</strong><span style="font-weight: normal;"> Even more important are other kinds of analysis, along the lines of 	“is this normal innocent trading behavior?” </span></li>
<li><span style="font-weight: normal;">Intelligence 	agencies try to detect networks of </span><strong>terrorists</strong><span style="font-weight: normal;"> and their sympathizers. They further try to identify unusual 	patterns of communication or meetings along those networks that 	might indicate terrorist acts are being planned. (Civilian law 	enforcement agencies can use similar techniques.)</span></li>
</ul>
<p style="margin-bottom: 0in; font-weight: normal;">In most cases, the analysis and/or run-time execution of the relevant models is done with the help of analytic DBMS. Other technologies that come into play include non-DBMS MapReduce (Hadoop), graph engines, and CEP (Complex Event Processing). The vendor most heavily represented on that list is probably Aster Data, because:</p>
<ul>
<li>Aster Data is 	focused on hard-core analytics.</li>
<li>I talk a lot 	with Aster Data, and in particular had a long, detailed use-cases 	discussion with them last week.</li>
<li><span style="font-weight: normal;">The 	comScore example happens to come from a speaker at </span><a href="http://www.dbms2.com/2010/05/07/implications-onew-analytic-technology/" ><span style="font-weight: normal;">an 	Aster event</span></a><span style="font-weight: normal;"> I also 	participated in.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">And by the way, all this only scratches the surface of what will be possible down the road. It&#8217;s based mainly on where you live, what you purchase, how you behave on websites, and who you communicate with. </span><span style="color: #000080;"><span lang="zxx"><span style="text-decoration: underline;"><a href="../2010/07/04/fair-data-use/"><span style="font-weight: normal;">Other kinds of data, which could be used to be yet more intrusive</span></a></span></span></span><span style="font-weight: normal;">, generally aren&#8217;t involved.</span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">I actually have two points in drawing up this list. One is golly-gee-whiz about how a lot of analytically sophisticated applications are actually getting into production. The other is to highlight the privacy and liberty threats If This Goes On Unchecked (which is why I didn&#8217;t include some other less-people-focused examples). There&#8217;s also a related danger that, to the extent we don&#8217;t get some smart regulations to keep us safe(r), we&#8217;ll get a bunch of stupid regulations instead. </span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">The Analytic Era has only just begun.<br />
</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Nested data structures keep coming up, especially for log files</title>
		<link>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/</link>
		<comments>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/#comments</comments>
		<pubDate>Sat, 31 Jul 2010 10:42:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2723</guid>
		<description><![CDATA[Nested data structures have come up several times now, almost always in the context of log files.

Google has published about a project called Dremel. Per Tasso Agyros, one of Dremel&#8217;s key concepts is nested data structures.
Those arrays that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of [...]]]></description>
			<content:encoded><![CDATA[<p>Nested data structures have come up several times now, almost always in the context of log files.</p>
<ul>
<li>Google has published about a project called <a href="http://www.asterdata.com/blog/index.php/2010/07/19/google%E2%80%99s-dremel-%E2%80%93-or-can-mapreduce-itself-handle-fast-interactive-querying/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');">Dremel</a>. Per Tasso Agyros, one of Dremel&#8217;s key concepts is nested data structures.</li>
<li>Those <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/" >arrays</a> that the XLDB/SciDB folks keep talking about are meant to be nested data structures. Scientific data is of course log-oriented. <a href="http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/" >eBay was very interested in that project too</a>.</li>
<li>Facebook&#8217;s log files have a big nested data structure flavor.</li>
</ul>
<p>I don&#8217;t have a grasp yet on what exactly is happening here, but it&#8217;s something.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Cassandra technical overview</title>
		<link>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/</link>
		<comments>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 09:10:39 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Amazon and its cloud]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Riptano]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2473</guid>
		<description><![CDATA[Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I&#8217;m finally finding time to clear my Cassandra/Riptano [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I&#8217;m finally finding time to clear my Cassandra/Riptano backlog. I&#8217;ll cover the more technical parts below, and the more business- or usage-oriented ones in <a href="http://www.dbms2.com/2010/07/06/riptano-and-cassandra-adoption/" >a companion Cassandra/Riptano post</a>.</p>
<p style="margin-bottom: 0in;">Jonathan&#8217;s core claims for Cassandra include:</p>
<ul>
<li>Cassandra is shared-nothing.</li>
<li>Cassandra has good approaches to 	replication and partitioning, right out of the box.</li>
<li>In particular, Cassandra is good 	for use cases that distribute a database around the world and want 	to access it at “local” latencies. (Indeed, Jonathan asserts 	that non-local replication is a significant non-big-data Cassandra 	use case.)</li>
<li>Cassandra&#8217;s scale-out is 	application-transparent, unlike sharded MySQL&#8217;s.</li>
<li> Cassandra is fast at both appends 	and range queries, which would be hard to accomplish in a pure 	key-value store.</li>
</ul>
<p style="margin-bottom: 0in;">In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he&#8217;s concerned, may well belong in a more traditional SQL DBMS.  <span id="more-2473"></span></p>
<p style="margin-bottom: 0in;">Further highlights of our talks included, as best I understood them:</p>
<ul>
<li>Cassandra is based in parts both 	on Google&#8217;s <strong>BigTable</strong> paper of 2006 and Amazon&#8217;s <strong>Dynamo</strong> paper of 2007.
<ul>
<li>The core of what Cassandra takes 	from BigTable is based on <strong>log-structured merge trees,</strong> which 	actually entered the computer science literature in 1996.</li>
<li>Cassand<span style="font-weight: normal;">ra&#8217;s 	approach to horizontal scaling, replication, failover, etc. seems to 	be based Dynamo. </span></li>
</ul>
</li>
<li>There seems to be <strong>a logical 	concept of “row”</strong> in Cassandra, or it&#8217;s at least meaningful 	to use the SQL/relational concept of a “row” when talking about 	Cassandra data. However, Cassandra is closer to being a <strong>column-based 	data store</strong> than a row-based one. (Not the same thing, but 	closer.)</li>
<li>Even so, it only takes a single 	seek to return a whole Cassandra “row”.</li>
<li>Cassandra 	writes data quite differently from the way a classical OLTP DBMS 	would.
<ul>
<li><strong>Cassandra writes just the data 	elements</strong><span style="font-weight: normal;"> – i.e., fields – </span><strong>that are actually being inserted or changed,</strong> not whole 	rows.</li>
<li>One benefit is that Cassandra data 	is very <strong>sparse.</strong> NULLs aren&#8217;t stored in any way, and hence in 	particular take up no space.</li>
<li>Another benefit – and one of the 	core concepts of Cassandra – is that <strong>you can implicitly assume 	different schemas for different rows of the same “table.”</strong> In 	particular, you can add data for columns that you didn&#8217;t envision 	when you first started storing “rows” of the same “table.”</li>
<li><strong>Writes are collected into 	sorted “memtables,” which from time to time are sent to disk.</strong> Once data gets to disk, it&#8217;s <strong>immutable,</strong> except for occasional 	merge/reorganization/garbage collection.
<ul>
<li>Jonathan claims, plausibly, that 	this makes write throughput very fast (because the I/O is 	fundamentally sequential in nature.)</li>
<li>The default as to how long data 	typically stays in memory before it gets persisted to dis<span style="font-weight: normal;">k 	is “whichever comes first of {64 MB written, 300k updates, 1 	hour}”. </span></li>
</ul>
</li>
<li>Cassandra has <strong>durability</strong> – 	guaranteed non-loss of data – assuming fsync is turned on. fsync 	seems to create a 15% or so overhead.</li>
<li>However, Cassandra has <strong>no 	concept of a “transaction.”</strong></li>
<li>As one would 	expect, data can be read even before it has been persisted to disk.</li>
</ul>
</li>
<li>According to 	Jonathan, Cassandra can do about 14,000 writes or 7,000 reads per 	second, on a quad-core server.
<ul>
<li>Those figures scale pretty 	linearly with the number of servers. (There&#8217;s some overhead for 	network latency.)</li>
<li>Those figures assume a five-column 	row.</li>
<li>Cassandra&#8217;s write-performance 	figures are only “mildly sensitive” to the width of the row. 	E.g., doubling row width only gives a 15-20% throughput hit, due to 	some fixed per-row overhead. That said, I imagine going 100X in row 	width would create a major slowdown, although perhaps while 	measuring width more in bytes than in column count.</li>
<li>Cassandra&#8217;s <span style="color: #000080;"><span lang="zxx"><span style="text-decoration: underline;"><a href="http://racklabs.com/%7Ebwilliam/cassandra/04vs05vs06.png" onclick="javascript:pageTracker._trackPageview('/racklabs.com');">performance</a></span></span></span> has been growing nicely in each point release. Jonathan thinks this 	general trend will continue.</li>
</ul>
</li>
<li>Jonathan thinks Cassandra is 	pretty good at keeping your data safe.
<ul>
<li>Each node has a commit log.</li>
<li>When a node goes down, its writes 	are buffered until it comes back up.</li>
</ul>
</li>
<li>You can run Hadoop MapReduce 	straight against Cassandra files.</li>
<li>A Cassandra node might hold 	anything from 10s of gigabytes to multiple terabytes of data. You 	might want to go with the low end if you want to have lots of cache 	hits.</li>
<li>Solid-state storage would speed up 	Cassandra reads, not writes, and is not widely used with Cassandra 	yet.</li>
<li>Jonathan says Cassandra is really 	good at handling time series data, by which I suspect he means log 	files. <a href="https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/" onclick="javascript:pageTracker._trackPageview('/www.cloudkick.com');">Cloudkick</a> is a user of this capability.</li>
</ul>
<p style="margin-bottom: 0in;">I certainly didn&#8217;t grasp everything about Cassandra replication and partitioning strategies. That wasn&#8217;t the focus of our talks, and anyway I got the impression they are so flexible that there&#8217;s little that can firmly be said about them. But I did get the impressions:</p>
<ul>
<li>You set your consistency rules in 	the Cassandra API, not on a per-table basis. (I think this means 	that a lack of administrative tools is supposedly a feature, not a 	drawback.)</li>
<li>As a practical matter, Cassandra 	users commonly take one of two approaches to consistency:
<ul>
<li><a href="http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/" >RYW consistency</a>, most 	commonly with N = 3 and R = W = 2.</li>
<li>Geographically dispersed eventual 	consistency.</li>
</ul>
</li>
<li>Cassandra data is most commonly 	distributed via consistent hashing, but other options are 	“pluggable.”</li>
<li>If you add a node, the busiest 	note automagically decides to ship some data over, reducing its 	load. Of course, this only works if you get the new node on before 	the old node is so maxed out it doesn&#8217;t have time to do the 	shipping.</li>
</ul>
<p style="margin-bottom: 0in;">When we talked in March, the next release of Cassandra was going to be 0.7. Cassandra 0.7 was going to be a performance/scalability release, for example fixing the flaw that garbage collection read rows into memory one at a time. After that, Cassandra 0.8 was to be a feature release, with one planned feature being more automatic index management and/or materialized-view-like capability, so as to reduce the burden on Cassandra developers of schema management.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>My M<span style="font-style: normal;">arch 	<a href="../2010/03/12/some-nosql-links/">NoSQL 	links post</a> included </span>the Google and Amazon papers</li>
<li>The <a href="https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/" onclick="javascript:pageTracker._trackPageview('/www.cloudkick.com');">March 	2, 2010 Cloudkick post</a> also linked above goes into a lot of 	detail, including what they think is great about Cassandra and what 	they think is still missing</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Why you should go to XLDB4</title>
		<link>http://www.dbms2.com/2010/07/01/why-you-should-go-to-xldb4/</link>
		<comments>http://www.dbms2.com/2010/07/01/why-you-should-go-to-xldb4/#comments</comments>
		<pubDate>Thu, 01 Jul 2010 04:23:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2456</guid>
		<description><![CDATA[Scientific data commonly:

Comes in large volumes
Is machine-generated
Is augmented by synthetic and/or 	derived data
Has a spatial and/or temporal 	structure

In those respects, it is akin to some of the hottest areas for big data analytics, including:

Investment trade data – big, 	partly machine generated, augmented (often), temporal
Web/network log data – big, 	machine-generated, post-processed into derived form, temporal
Marketing analytic [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Scientific data commonly:</p>
<ul>
<li>Comes in large volumes</li>
<li>Is machine-generated</li>
<li>Is augmented by synthetic and/or 	derived data</li>
<li>Has a spatial and/or temporal 	structure</li>
</ul>
<p style="margin-bottom: 0in;">In those respects, it is akin to some of the hottest areas for big data analytics, including:</p>
<ul>
<li>Investment trade data – big, 	partly machine generated, augmented (often), temporal</li>
<li>Web/network log data – big, 	machine-generated, post-processed into derived form, temporal</li>
<li>Marketing analytic data – big, 	post-processed into derived form</li>
<li>Genomic data</li>
</ul>
<p style="margin-bottom: 0in;">So when Jacek Becla started the <a href="http://www.dbms2.com/2009/09/12/xldb-scid/" >XLDB</a> conferences on the premise that <strong>scientific and big data analytic challenges have a lot in common,</strong> he had a point. There are several tough database problems that the science-focused folk<span style="font-style: normal;">s have taken the leading in thinking about, but which are soon going to matter to the commercial world as well. And that&#8217;s one of two big reasons why you should consider participating</span><span style="font-style: normal;"><span style="font-weight: normal;"> in </span></span><span style="font-weight: normal;"><a href="http://www-conf.slac.stanford.edu/xldb10/" onclick="javascript:pageTracker._trackPageview('/www-conf.slac.stanford.edu');">XLDB4, October 6-7, at the SLAC facility in Menlo Park, CA</a>, </span><span style="font-style: normal;"><span style="font-weight: normal;">as an attendee, spo</span></span><span style="font-style: normal;">nsor, or both. </span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">The oth</span>er big reason is that it is <strong>important for the world that XLDB succeed.</strong> <span id="more-2456"></span>Computer technology to analyze global warming is lacking; better database technology is one of the ways it could improve. Database technology also has important potential contributions to make in medical research and other worthy endeavors, and in a lot of purer science too (Jacek himself is an astronomy guy).</p>
<p style="margin-bottom: 0in;">Other reasons to get involved with XLDB4 include:</p>
<ul>
<li><strong>It doesn&#8217;t cost much.</strong> The 	whole thing is done in academic conference dollar amounts, with low 	attendance fees, the venue use probably donated by SLAC, and (I 	would guess) low sponsorship fees as well.</li>
<li><strong>It&#8217;s fun.</strong> I think I 	attended more sessions in two days at XLDB3 than at the all the 	other conferences I went to last year combined.</li>
</ul>
<p style="margin-bottom: 0in;">All in all, I intend to spend three whole days attending conference sessions that week, which is something I almost never do. But it&#8217;s for my two favorite causes – scientific data management and liberty/privacy. Stayed tuned for news on the latter front soon.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>The <a href="http://www-conf.slac.stanford.edu/xldb10/" onclick="javascript:pageTracker._trackPageview('/www-conf.slac.stanford.edu');">XLDB4 conference web site</a></li>
<li>My original post on <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/" >issues in 	scientific management</a>, most of which will be discussed at XLDB4</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/01/why-you-should-go-to-xldb4/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cloudera Enterprise and Hadoop evolution</title>
		<link>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/</link>
		<comments>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 17:22:27 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2440</guid>
		<description><![CDATA[I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I&#8217;d say:  

If you are or want to be a serious 	MapReduce user – and you&#8217;re past the “play around over the 	weekend” stage &#8212; you probably should have either:

A serious non-DBMS MapReduce 	distribution.
MapReduce integrated into your [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I&#8217;d say:  <span id="more-2440"></span></p>
<ul>
<li>If you are or want to be a serious 	MapReduce user – and you&#8217;re past the “play around over the 	weekend” stage &#8212; you probably should have either:
<ul>
<li>A serious non-DBMS MapReduce 	distribution.</li>
<li>MapReduce integrated into your 	analytic DBMS.</li>
<li>Both.</li>
</ul>
</li>
<li>The obvious choice for non-DBMS 	MapReduce is Hadoop.</li>
<li>The obvious choice for a Hadoop 	distribution is <strong>Cloudera Enterprise.</strong></li>
<li>Cloudera Enterprise has three main 	aspects, in an inseparable bundle:
<ul>
<li>Distributions for a double-digit 	number of open source projects. It&#8217;s nice having all that in one 	package – unless, of course, you like playing with Tinkertoys.</li>
<li>Proprietary Cloudera code.</li>
<li>Cloudera support.</li>
</ul>
</li>
<li>Cloudera says its proprietary code 	is and in the future is planned to be concentrated – at least in 	large part &#8212; on integrating open source technology with closed 	source products. This has the virtue of being targeted directly at 	that segment of the market which has proven it&#8217;s actually willing to 	pay money for software.</li>
<li>Cloudera Enterprise areas of 	focus, now and in the presumed future, include:
<ul>
<li><strong>Core Hadoop engine,</strong> which 	Cloudera says is quite predictably and appropriately evolving more 	slowly than the tools around it.</li>
</ul>
<ul>
<li><strong>Development, management and 	administrative tools,</strong> including:
<ul>
<li><strong>Pig</strong> and <strong>Hive</strong>. Cloudera says &gt;70% 	of Facebook Hadoop jobs are initiated through Hive, and the same is 	true of Yahoo and Pig.</li>
<li>Connectivity to commercial tools.</li>
<li>The product formerly known as 	“Cloudera Desktop.”</li>
</ul>
</li>
<li><strong>Workflow</strong>, which in this context 	refers to letting you create a Hadoop application as a sequence of 	small steps, rather than forcing you to kluge it into being one 	unwieldy thing. At the moment, this is much less widely adopted than 	Pig and Hive, but Cloudera has high hopes for it, because of its 	obvious benefits in modularity and manageability.</li>
<li><strong>Quasi-DBMS technology.</strong> Besides Hive and Pig, this includes <strong>HBase.</strong> Cloudera says there has 	been considerable demand for HBase, and it is pleased that project 	is now mature enough to ship. Cloudera stresses that it intends 	HBase not for OLTP, but as an adjunct to analytic processing. E.g., 	Cloudera suggests HBase would be a fine vehicle for replicating 	dimension tables across each node of a cluster.</li>
<li><strong>Data connectivity, </strong><span style="font-weight: normal;">e.g. 	to MySQL or to sensor log files.</span></li>
</ul>
</li>
<li>Cloudera Enterprise pricing is 	well below DBMS prices – not by a full order of magnitude, if I&#8217;m 	right about everybody&#8217;s quantity discount policies, but even so by a 	lot. Details are NDA.</li>
</ul>
<p style="margin-bottom: 0in;">Cloudera sometimes sends confusing signals about its beliefs and strategies. For example, one can get different stories depending on whether one talks to:</p>
<ul>
<li>Somebody at Cloudera who comes 	primarily from the user and open source communities.</li>
<li>Somebody at Cloudera who has 	actually worked at a software company before.</li>
</ul>
<p style="margin-bottom: 0in;">But I predict that Cloudera will now stick for a while with more or less the strategy outlined above.</p>
<p style="margin-bottom: 0in;">Naturally, we also talked about Hadoop adoption. Highlights of that part – no doubt somewhat biased towards Cloudera&#8217;s own customer base &#8212; included:</p>
<ul>
<li>Notwithstanding <a href="http://www.dbms2.com/2009/04/14/ebay-thinks-mpp-dbms-clobber-mapreduce/" >eBay&#8217;s prior 	skepticism about MapReduce</a>, it is quoted saying nice things in a Cloudera press release, 	and has apparently become quite a large Hadoop user, starting out 	with a search-quality use case.</li>
<li>Typical Hadoop deployment sizes 	are 10 nodes or so when experimenting, 80-500+ in production.</li>
<li>10 terabytes/node – I&#8217;m pretty 	sure Cloudera meant of user data &#8212; is not inconceivable, so a 	cost-conscious 500-node user could have 5 petabytes of data managed 	by Hadoop.</li>
<li>Cloudera has half a dozen 	customers at the 75+ node production level.</li>
<li>Web and financial services are the 	two vertical markets moving most aggressively into Hadoop 	production. The government is also in significant Hadoop production, 	but the details of that are classified.</li>
<li>Web uses for Hadoop include:
<ul>
<li>Clickstream – sessionization, 	etc. – that&#8217;s a super-mainstream use.</li>
<li>Search – analyzing search 	attempts in conjunction with structured data.</li>
<li>Machine learning (for ad serving, 	etc.).</li>
</ul>
</li>
<li>Financial services uses for Hadoop 	include:
<ul>
<li>Internal trading rule 	enforcement/fraud detection.</li>
<li>Complex ETL.</li>
<li>Portfolio risk assessment 	(typically overnight).</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">None of this is inconsistent with previous surveys of <a href="http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/" >Hadoop use cases</a>.</p>
<p style="margin-bottom: 0in; font-style: normal;">Various users talked at the Hadoop Summit this week. I wasn&#8217;t there, and won&#8217;t write about their stories for now. That said, <a href="http://www.slideshare.net/kevinweil/hadoop-at-twitter-hadoop-summit-2010" onclick="javascript:pageTracker._trackPageview('/www.slideshare.net');">Twitter&#8217;s slide deck</a> from same has some interesting stuff, including:</p>
<ul>
<li><span style="font-style: normal;">7 	TB/day ETLed from MySQL.</span></li>
<li><span style="font-style: normal;">Petabytes-being-stored 	accordingly coming soon.</span></li>
<li><span style="font-style: normal;">Open 	sourcing their ETL tool Crane.</span></li>
<li><span style="font-style: normal;">3-4X 	LZO compression at little CPU cost.</span></li>
<li><span style="font-style: normal;">HBase 	is a more usable for them than HDFS, which isn&#8217;t mutable enough.</span></li>
<li><span style="font-style: normal;">Pig 	= 5% of code and coding effort vs. vanilla Hadoop at 30% or less 	performance hit.</span></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>The most important part of the “social graph” is neither social nor a graph</title>
		<link>http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/</link>
		<comments>http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/#comments</comments>
		<pubDate>Tue, 08 Jun 2010 05:18:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2235</guid>
		<description><![CDATA[“Social graph” is a highly misleading term, and so is “social network analysis.” By this I mean:
There&#8217;s something akin to “social graphs” and “social network analysis” that is more or less worthy of all the current hype – but graphs and network analysis are only a minor part of the whole story.
In particular, the most [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">“Social graph” is a highly misleading term, and so is “social network analysis.” By this I mean:</p>
<p><strong>There&#8217;s something akin to “social graphs” and “social network analysis” that is more or less worthy of all the current hype – but graphs and network analysis are only a minor part of the whole story.</strong></p>
<p style="margin-bottom: 0in;">In particular, <strong>the most important parts of the Facebook “social graph” are neither social nor a graph. </strong><span style="font-weight: normal;">Rather, what&#8217;s really important is an aggregate</span><strong> Profile of Revealed Preferences</strong><span style="font-weight: normal;">, of which person-to-person connections or other things best modeled by a graph play only a small part.</span></p>
<p style="margin-bottom: 0in;"><span id="more-2235"></span>Let me hasten to note that – even when viewed narrowly &#8212; the ideas of “social graph”and “<a href="../2009/08/21/social-network-analysis-aka-relationship-analytics/">social network analysis</a>” do have significance. Nontrivial use cases to date for big data social network analysis include:</p>
<ul>
<li>Intelligence agencies identify and 	analyze terrorist networks. Corporations and civilian law 	enforcement do the same for fraud networks.</li>
<li>Telephone companies use calling 	data to figure out which of their customers are most likely to 	influence which other customers in the decision to keep or change 	service providers. (Frankly, I find that rather creepy.)</li>
<li>Social networks figure out which 	other members you&#8217;re likely to know, and encourage you to connect 	with them.</li>
</ul>
<p style="margin-bottom: 0in;">Epidemiologists aspire to add to that list, based on their success to date using much more micro forms of social network analysis. But after that, I&#8217;m running out of examples. Sure, graph analytics is good for a bunch of other things (e.g., biology at the genetic or molecular level), but those have little or nothing to do with “social graphs” or social network analysis as they are commonly understood.</p>
<p style="margin-bottom: 0in;"><em>Note: Of course, it is also the case that everything can be modeled by entity-attribute-value triples, and those can always be modeled by graphs. But so what?</em></p>
<p style="margin-bottom: 0in;">Let&#8217;s consider what, in a marketer&#8217;s ideal world, would go into yo<span style="font-weight: normal;">ur Profile of Revealed Preferences. Raw data might include:</span></p>
<ul>
<li><strong>Personally identifyING 	information. </strong>Duh. This is what makes everything else possible.</li>
<li><strong>Purchase transaction data.</strong> Lots of it. Like, everything on your credit card statements.</li>
<li><strong>Demographic and lifestyle 	information.</strong> Address, date of birth, educational history, race, 	household composition, and so on.</li>
<li><strong>Affiliations.</strong> Politics, 	religion, group membership of any kind. (OK, that&#8217;s partly social.)</li>
<li><strong>Explicitly stated opinions, 	preferences and desires,</strong><span style="font-weight: normal;"> including:</span>
<ul>
<li>Any surveys you have filled out.</li>
<li><strong>Any recommendations you have 	made</strong> (e.g., through the Facebook Like feature).</li>
<li>The text of anything you&#8217;ve 	written and posted – and, very ideally, of your private emails as 	well.</li>
<li>Any <strong>wish lists</strong> you&#8217;ve 	filled in.</li>
</ul>
</li>
<li><strong>Attention information.</strong> What 	you clicked on, what you looked at, and all that stuff website 	owners track.</li>
<li><strong>Your movements, </strong><span style="font-weight: normal;">to 	the extent they are tracked. (E.g., via Foursquare and the like.)</span></li>
<li><strong>Your gaming activities</strong><span style="font-weight: normal;"> and the like. (This is social mainly to the extent it overlaps with 	other categories I&#8217;ve already mentioned.)</span></li>
<li><strong>Your medical information.</strong><span style="font-weight: normal;"> </span></li>
<li><strong>Who you communicate with, and 	what you communicate with them about.</strong><span style="font-weight: normal;"> (Hey! There&#8217;s something else social!)</span></li>
<li><span style="font-weight: normal;">Similar </span><strong>information about the people you communicate with.</strong></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">My core </span><strong>privacy</strong><span style="font-weight: normal;"> thoughts about that data include:</span></p>
<ul>
<li><strong>Individuals deserve the right 	to control all that information.</strong><span style="font-weight: normal;"> At a minimum, they deserve total control over how the data (raw or 	derived) is passed from the service – e.g., website – where it 	naturally resides (e.g., where it is originated) to any other place.</span></li>
<li>Given a chance, <strong>individuals 	would make fine-grained choices about what parts of their Profile of 	Revealed Preferences are available to which organizations.</strong> Reasons include:</li>
<li>Individuals have rather complex 	trust relationships with different kinds of merchants and marketers.</li>
<li>Consumers get different benefits 	from sharing information with different kinds of merchants and 	marketers. (Sometimes personalization is a large benefit. Sometimes 	it&#8217;s just creepy. And some companies actively bribe you to give them 	information they can use to sell to you.)</li>
</ul>
<p style="margin-bottom: 0in;">When one frame things this way, two rather difficult technological questions naturally arise.</p>
<ol>
<li>Suppose, implausibly, that a 	single entity were allowed to control and use (for marketing) all of 	your Profile of Revealed Preferences information. How would they 	store and analyze it?</li>
<li>How does the answer to #1 change 	because control over the information will, in fact, be fragmented?</li>
</ol>
<p style="margin-bottom: 0in;">It&#8217;s tough enough to answer these questions for data about one person. Trying to include all but the simplest information about other people is and will for years remain quite infeasible. So, for the most part, <strong>this is not “social” information.</strong></p>
<p style="margin-bottom: 0in;">It&#8217;s also <strong>not naturally a “graph.”</strong> Similarly, it is <strong>not a good candidate for network analysis.</strong> To see why, let me outline <strong>why I used the name “Profile of Revealed Preferences”:</strong></p>
<ul>
<li>The reason marketers want this 	data is, mainly, because they want to know what appeals to you, and 	how strongly you feel about it.</li>
<li>The analytic process often entails 	taking explicit choices you have made, and inferring other 	preferences from them.</li>
<li>The output of the analytic process 	is often one or more “scores” that then get plugged into various 	selection algorithms to determine what you should be shown or 	offered. At least implicitly, these algorithms are predicting what 	you will or won&#8217;t respond well to.</li>
</ul>
<p style="margin-bottom: 0in;">Not much graph-like there.</p>
<p style="margin-bottom: 0in;">This post has gotten pretty long, so I&#8217;ll stop here without spelling anything else out. But questions I still hope to address down the road include:</p>
<ul>
<li>How sho<span style="font-weight: normal;">uld 	Profile of Revealed Preferences data</span> be stored?</li>
<li>Suppose we want to pass around 	derived results and not the raw data. How could we ever get to 	standards that would make such interchange realistic?</li>
<li>If we only have raw data to pass 	around, what are the implications for privacy, liberty, and the 	structure of the online industries?</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>VoltDB finally launches</title>
		<link>http://www.dbms2.com/2010/05/25/voltdb-finally-launches/</link>
		<comments>http://www.dbms2.com/2010/05/25/voltdb-finally-launches/#comments</comments>
		<pubDate>Tue, 25 May 2010 07:15:04 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Michael Stonebraker]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[VoltDB and H-Store]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2201</guid>
		<description><![CDATA[VoltDB is finally launching today. As is common for companies in sectors I write about, VoltDB &#8212; or just &#8220;Volt&#8221; &#8212; has discovered the virtues of embargoes that end 12:01 am. Let&#8217;s go straight to the technical highlights:

VoltDB is based on the H-Store technology, which I wrote about in February, 2009. Most of what I [...]]]></description>
			<content:encoded><![CDATA[<p>VoltDB is finally launching today. As is common for companies in sectors I write about, VoltDB &#8212; or just &#8220;Volt&#8221; &#8212; has discovered the virtues of embargoes that end 12:01 am. Let&#8217;s go straight to the technical highlights:</p>
<ul>
<li>VoltDB is based on the <a href="http://www.dbms2.com/2008/02/19/h-store-architecture/" >H-Store</a> technology, which I wrote about in February, 2009. Most of what I said about H-Store then applies to VoltDB today.</li>
<li>VoltDB is a no-apologies ACID relational DBMS, which runs entirely in RAM.</li>
<li>VoltDB has rather limited SQL. (One example: VoltDB can&#8217;t do SUMs in SQL.) However, VoltDB guy Tim Callaghan (Mark Callaghan&#8217;s lesser-known but nonetheless smart brother) asserts that if you code up the missing functionality, it&#8217;s almost as fast as if it were present in the DBMS to begin with, because there&#8217;s no added I/O from the handoff between the DBMS and the procedural code. (The data&#8217;s in RAM one way or the other.)</li>
<li>VoltDB&#8217;s Big Conceptual Performance Story is that it does away with most locks, latches, logs, etc., and also most context switching.</li>
<li>In particular, you&#8217;re supposed to partition your data and architect your application so that most transactions execute on a single core. When you can do that, you get VoltDB&#8217;s performance benefits. To the extent you can&#8217;t, you&#8217;re in two-phase-commit performance land. (More precisely, you&#8217;re doing 2PC for multi-core writes, which is surely a major reason that multi-core reads are a lot faster in VoltDB than multi-core writes.)</li>
<li>VoltDB has a little less than one DBMS thread per core. When the data partitioning works as it should, you execute a complete transaction in that single thread. Poof. No context switching.</li>
<li>A transaction in VoltDB is a Java stored procedure. (The early idea of Ruby on Rails in lieu of the Java/SQL combo didn&#8217;t hold up performance-wise.)</li>
<li>Solid-state memory is not a viable alternative to RAM for VoltDB. Too slow.</li>
<li>Instead, VoltDB lets you snapshot data to disk at tunable intervals. &#8220;Continuous&#8221; is one of the options, wherein a new snapshot starts being made as soon as the last one completes.</li>
<li>In addition, VoltDB will also spool a kind of transaction log to the target of your choice. (Obvious choice: An analytic DBMS such as Vertica, but there&#8217;s no such connectivity partnership actually in place at this time.)</li>
</ul>
<p><span id="more-2201"></span>I should also note that when Tim Callaghan described architectural options to get around 2PC performance issues, they sounded a lot like eventual consistency. Maybe tunable <a href="http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/" >RYW consistency</a> isn&#8217;t in the cards, but at least there&#8217;s a NoSQL-like possibility with VoltDB.</p>
<p>VoltDB&#8217;s open source strategy is:</p>
<ul>
<li>VoltDB will be open sourced.</li>
<li>Community VoltDB will be GPLed. Professional Edition VoltDB has a non-GPL license.</li>
<li>The VoltDB Professional Edition won&#8217;t start out with features beyond the Community Edition ones, but will gain such later on. I didn&#8217;t get the sense the plans for those features were completely baked yet, but ideas mentioned included:
<ul>
<li>Management/monitoring tools.</li>
<li>Integration with expense closed-source enterprise software products, such as ones in the management/monitoring area.</li>
<li>Yet more &#8220;extreme&#8221;/edge-case performance.</li>
</ul>
</li>
<li>Before VoltDB decided for sure that it wasn&#8217;t selling licenses, it sold a license to Getco, which also seems to be an investor in the company.</li>
</ul>
<p>VoltDB had a beta test with about 150 participants. None is in production yet, although at least a few are clearly headed there. Most VoltDB beta testers are in some kind of online business, with a particular concentration in everybody&#8217;s new favorite market, online gaming. Most of the rest are in investment/trading &#8212; a major target market for at least three different Mike Stonebraker companies &#8212; and a few are in telecom. VoltDB assures me that some of the beta users are companies one actually has heard of before, but VoltDB is not in a position to name any of those.</p>
<p>VoltDB is not ideally suited for a classic order management system, since you&#8217;d want to partition both on CustomerID and SKU, the latter because you&#8217;d constantly updating inventory stock levels. However, this argument doesn&#8217;t apply in the case of virtual goods. Virtual goods that are sold for real money &#8212; and hence need ACID levels of transaction integrity &#8212; are thus a clear target market for VoltDB. (The example that came up was in, you guessed it, online gaming.) The other interesting use case that Tim highlighted was low-latency analytics/ELT. For reasons I didn&#8217;t totally grasp, Tim likes to call this &#8220;Stateful ELT.&#8221; (Given that the data goes into the VoltDB database before much else happens to it, I&#8217;m pretty sure I heard &#8220;ELT&#8221; correctly. But I guess I might have been mishearing &#8220;ETL&#8221;.)</p>
<p>VoltDB company highlights include:</p>
<ul>
<li>VoltDB has about a dozen employees, all but two of whom are technical. (However, I&#8217;m not sure they&#8217;re counting Andy Ellicott against the two. But then, last I heard he wasn&#8217;t full time at VoltDB.)</li>
<li>VoltDB&#8217;s venture funding status is, if I may paraphrase, &#8220;Mumble mumble.&#8221;</li>
<li>Although long separate from Vertica, VoltDB is still located in Vertica&#8217;s offices.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/25/voltdb-finally-launches/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>More on Sybase IQ, including Version 15.2</title>
		<link>http://www.dbms2.com/2010/05/23/sybase-iq-15/</link>
		<comments>http://www.dbms2.com/2010/05/23/sybase-iq-15/#comments</comments>
		<pubDate>Sun, 23 May 2010 08:34:28 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2186</guid>
		<description><![CDATA[Back in March, Sybase was kind enough to give me permission to post a slide deck about Sybase IQ. Well, I&#8217;m finally getting around to doing so. Highlights include but are not limited to:

Slide 2 has some market success figures and so on. (&#62;3100 copies at &#62;1800 users, &#62;200 sales last year)
Slides 6-11 give more [...]]]></description>
			<content:encoded><![CDATA[<p>Back in March, Sybase was kind enough to give me permission to post <a href="http://www.monash.com/uploads/Sybase-IQ-slides-March-2010.pdf" onclick="javascript:pageTracker._trackPageview('/www.monash.com');">a slide deck about Sybase IQ</a>. Well, I&#8217;m finally getting around to doing so. Highlights include but are not limited to:</p>
<ul>
<li>Slide 2 has some market success figures and so on. (&gt;3100 copies at &gt;1800 users, &gt;200 sales last year)</li>
<li>Slides 6-11 give more detail on Sybase&#8217;s indexing and data access methods than I put into my recent <a href="http://www.dbms2.com/2010/05/17/technical-basics-of-sybase-iq/" >technical basics of Sybase IQ</a> post.</li>
<li>Slide 16 reminds us that in-database data mining is quite competitive with what <a href="http://www.dbms2.com/2010/05/15/further-clarifying-in-database-mpp-sas/" >SAS has actually delivered with its DBMS partners</a>, even if it doesn&#8217;t have the nice architectural approach of <a href="http://www.dbms2.com/2010/02/22/netezza-twinfin/" >Aster or Netezza</a>. (I.e., Sybase IQ&#8217;s more-than-SQL advanced analytics story relies on C++ UDFs  &#8212; User Defined Functions &#8212; running in-process with the DBMS.) In particular, there&#8217;s a data mining/predictive analytics library &#8212; modeling and scoring both &#8212; licensed from a small third party.</li>
<li>A number of the other later slides also have quite a bit of technical crunch. (More on some of those points below too.)</li>
</ul>
<p>Sybase IQ may have a bit of a funky architecture (e.g., no MPP), but the age of the product and the substantial revenue it generates have allowed Sybase to put in a bunch of product features that newer vendors haven&#8217;t gotten around to yet.</p>
<p>More recently, Sybase volunteered permission for me to preannounce <strong>Sybase IQ Version 15.2</strong> by a few days (it&#8217;s scheduled to come out this week). <span id="more-2186"></span>Sybase IQ seems to be focused on large part on the government/intelligent market, with three major features being:</p>
<ul>
<li>A kind of <strong>data federation,</strong> querying external databases, that makes sense mainly in the context of rigorous security rules. (I find that confusing, since Sybase IQ&#8217;s indexes tend to hold all the information in the database, but I didn&#8217;t push the point.)</li>
<li>An upgrade to Sybase IQ&#8217;s built-in <strong>text indexing.</strong> I doubt anybody would confuse this with best-of-breed text search, but evidently that intelligence community is satisfied with less. But even before 15.2, Sybase IQ could do both LIKE and WHERE CONTAINS searching.</li>
<li>Improved LOB (Large OBject) management.</li>
</ul>
<p>One part of my Sybase IQ conversations I haven&#8217;t blogged yet in much details is <strong>scale-out, concurrency, </strong>and<strong> &#8220;multiplexing.&#8221;</strong></p>
<ul>
<li>Sybase feels that Sybase IQ&#8217;s competitive sweet spot, especially in terms of performance, is reached when there are 20 or more concurrent queries.</li>
<li>In general, Sybase asserts that a shared-everything architecture is great for concurrency &#8212; just run different queries on different boxes, all against the same data.</li>
<li>The ability to use a bunch of boxes run Sybase IQ is called &#8220;multiplexing.&#8221;  This is a chargeable option, without which one is limited to a single SMP box.</li>
<li>Just under 20% of the top 250 Sybase IQ customers have multi-node scale-out configuration (vs. single-node SMP scale-up). And around 8% have it overall.</li>
<li>Sybase IQ nodes can be heterogeneous (e.g., in compute power).</li>
<li>Sybase IQ nodes can be dedicated to be read-only, or can be read-write. Indeed, Sybase IQ nodes can change roles dynamically, for example becoming write-only during nightly batch load. (I didn&#8217;t clarify whether all this applies just to nodes-as-boxes, or if some parts apply to specific processors or cores within the same box.)</li>
<li>Sybase noted that data mart outsourcers can offer differentiated SLAs (Service Level Agreements) depending upon which nodes they give which customers access to.</li>
<li>Most Sybase IQ installations start at 8 cores or more. The Sybase IQ Small Business Edition, limited to 4 cores, is not a big seller.</li>
<li>Sybase IQ has a straightforward round-robin load-balancing story via third-party technology.</li>
</ul>
<p>Finally, along the way in the discussions I picked up various tidbits about the Sybase IQ user base. Unfortunately, Sybase is pretty vague in discussing database sizes &#8212; are they user data? Are they compressed? What do the numbers mean? With that huge caveat:</p>
<ul>
<li>By some metric or other, a couple of classified customers are approaching petabyte scale.</li>
<li>The largest commercial Sybase IQ customer &#8212; a credit card company &#8212; has a couple hundred terabytes or so.</li>
<li>The largest financial services Sybase IQ databases are 50-70 terabytes. This sounds low, frankly, so maybe those are compressed figures, with user data being 200+ terabytes. But I&#8217;m just speculating there.</li>
<li>Sybase IQ has a little less than 100 customers in the &#8220;data aggregator&#8221; market, which is a lot like what I call &#8220;data mart outsourcer.&#8221;</li>
<li><a href="http://www.dbms2.com/2009/08/25/sybase-iq-technical-highlights/" >Sybase IQ&#8217;s ILM technology</a> is a chargeable option, with Sybase being &#8220;cautious&#8221; about sales. Compliance is a big market driver for it.</li>
<li>Sybase IQ&#8217;s #1 vertical market is financial services. Other biggies are government, telecom, marketing services, and to some extent retail.</li>
<li>As of February, there were 40-45 production users of Sybase IQ 15.0 and 15.1.</li>
</ul>
<p><!-- 		@page { margin: 0.79in } 		P { margin-bottom: 0.08in } --></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/23/sybase-iq-15/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Notes on SciDB and scientific data management</title>
		<link>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/</link>
		<comments>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/#comments</comments>
		<pubDate>Sat, 22 May 2010 08:04:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[SciDB]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2178</guid>
		<description><![CDATA[I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That&#8217;s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about issues in scientific data management. Here&#8217;s some of what has transpired since then.
The main [...]]]></description>
			<content:encoded><![CDATA[<p>I firmly believe that, as a community, we should look for ways to support scientific data management and related analytics. That&#8217;s why, for example, I went to XLDB3 in Lyon, France at my own expense. Eight months ago, I wrote about <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/" >issues in scientific data management</a>. Here&#8217;s some of what has transpired since then.</p>
<p>The main new activity I know of has been in the open source <a href="http://www.scidb.org/" onclick="javascript:pageTracker._trackPageview('/www.scidb.org');">SciDB</a> project.   <span id="more-2178"></span></p>
<ul>
<li>A company called Zetics has been started to commercialize SciDB. As of now, the entire staff seems to be CEO Marilyn Matz, techie Paul Brown, and part of Mike Stonebraker. Marilyn says Zetics has some venture capital, but even under NDA didn&#8217;t tell me who it was from. Zetics does not have its own web site.</li>
<li>Marilyn tells me there are 20-25 contributors to SciDB, led by Paul Brown and Mike Stonebraker. Brown is full-time. Persistent Systems has been donating the efforts of a few of its employees. Some <a href="http://www.lsst.org/lsst" onclick="javascript:pageTracker._trackPageview('/www.lsst.org');">LSST</a> folks have been doing SciDB work backed by grant money. Most or all of the rest seem to be purer volunteers. Some Russians have been particularly active.</li>
<li>Release 0.5 of SciDB is expected in June. Release 1.0 is expected in September. This is a rewrite; prior demo code has been scrapped. Perhaps not coincidentally, it&#8217;s also a small slip from prior project plans.</li>
<li>The array data model is an example of what&#8217;s being implemented first. (Duh &#8212; you can&#8217;t have a DBMS without a data model.) Support for uncertainty is an example of what&#8217;s been deferred until later.</li>
<li>As has been clear since XLDB3 last August, one major target market for SciDB is genomic research.</li>
<li>It&#8217;s obvious that the oil and gas industry, with all its geospatial data, should be interested in SciDB. But there&#8217;s not much activity in that regard; outreach is evidently needed. If you can think of somebody in that sector (or anywhere else) who should be alerted to SciDB, please ping them.</li>
<li>Interest from web analytics users in SciDB seems to have receded a bit from the days when eBay almost funded the project.</li>
</ul>
<p>In other scientific data management news,</p>
<ul>
<li>Microsoft put out a book called <a href="http://research.microsoft.com/en-us/collaboration/fourthparadigm/" onclick="javascript:pageTracker._trackPageview('/research.microsoft.com');">The Fourth Paradigm</a> on scientific database management. The whole thing can be downloaded, very officially, as a giant PDF. I think it&#8217;s worth skimming. I don&#8217;t think it&#8217;s worth actually reading. (I did read it.)</li>
<li><a href="http://www-conf.slac.stanford.edu/xldb/" onclick="javascript:pageTracker._trackPageview('/www-conf.slac.stanford.edu');">XLDB4</a> will be at Stanford October 5-7. Unlike prior XLDBs, it will have an open (i.e., no invitation required) part.</li>
</ul>
<p>Finally, you surely are aware of the whole &#8220;Climategate&#8221; mess, in which major climate researchers&#8217; email was hacked and many unkind conclusions were drawn. Well, one of the most technical parts of the disclosure was in a long series of Read Me files, in which an unfortunate programmer lamented about <a href="http://di2.nu/foia/HARRY_READ_ME-20.html" onclick="javascript:pageTracker._trackPageview('/di2.nu');">the difficulty of reconstructing published results from files at hand</a>. These turned out to illustrate a classic problem that SciDB or alternatives are meant to solve:</p>
<ul>
<li>Raw data was impossible to use without various adjustments to regularize it (the word &#8220;regridding&#8221; comes up a lot, for example). Massaging was needed before analytics could be done on it.</li>
<li>The raw data was thrown out or lost, and could not be reconstructed (why they couldn&#8217;t have asked the suppliers of the data to give it to them again was unclear in this case, since it wasn&#8217;t original experimental data).</li>
<li>It was thus impossible to massage the data in any new or improved way.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/22/scidb-and-scientific-database-management/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>The Clustrix story</title>
		<link>http://www.dbms2.com/2010/05/12/the-clustrix-story/</link>
		<comments>http://www.dbms2.com/2010/05/12/the-clustrix-story/#comments</comments>
		<pubDate>Wed, 12 May 2010 08:53:48 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Clustrix]]></category>
		<category><![CDATA[Emulation, transparency, portability]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Solid-state memory]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2096</guid>
		<description><![CDATA[After my recent post, the Clustrix guys raised their hands and briefed me. Takeaways included:    

Nothing in my 	original short post about Clustrix was actually incorrect.
Clustrix plans to reveal actual 	production “name-brand” customers soon.
The name of Clustrix&#8217;s software, 	or at least the guts thereof, is Sierra.
Clustrix&#8217;s products have actually 	been in general availability since last [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">After my recent post, the Clustrix guys raised their hands and briefed me. Takeaways included:    <span id="more-2096"></span></p>
<ul>
<li>Nothing in <a href="../2010/05/04/clustrix-may-be-doing-something-interesting/">my 	original short post about Clustrix</a> was actually incorrect.</li>
<li>Clustrix plans to reveal actual 	production “name-brand” customers soon.</li>
<li>The name of Clustrix&#8217;s software, 	or at least the guts thereof, is Sierra.</li>
<li>Clustrix&#8217;s products have actually 	been in general availability since last quarter, with some versions 	at customer sites for 2 years. Development started 3 ½ years ago.</li>
<li>Clustrix says its technology is 	for OLTP systems, which it calls “non-batch/non-analytic,” with 	mixed read/write workloads. All Clustrix&#8217;s example target markets 	are “internet verticals,” such as photo sharing, gaming, social 	media, e-commerce, etc.</li>
<li>Clustrix&#8217;s heart is in SQL, as is 	most of its customer base. Clustrix Sierra&#8217;s key-value-store option 	has little or no performance advantage over Clustrix Sierra&#8217;s SQL 	option, nor any other advantage over SQL that came up in discussion.</li>
<li>Clustrix Sierra is 	“wire-compatible” with MySQL, but doesn&#8217;t use MySQL code; 	Clustrix wrote all the code itself.</li>
<li>Clustrix asserts that Clustrix 	Sierra supports the “vast majority” of MySQL features. Examples 	of MySQL features Clustrix doesn&#8217;t support at this time are 	full-text search and geospatial indexing.</li>
<li>Indeed, Clustrix claims Clustrix 	Sierra can be used to replace MySQL with few or zero changes to 	existing applications.</li>
<li>I specifically asked about 	referential integrity, which has a poor performance reputation in 	MySQL. Besides saying they supported it, Clustrix said that some 	customers actually use referential integrity in some of their less 	active tables.</li>
<li>Clustrix Sierra is fully 	ACID-compliant, with no eventual consistency or <a href="http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/" >RYW consistency</a> story. The default number of copies of each datum is two, and 	they&#8217;re kept consistent via two-phase commit.</li>
<li>Clustrix Sierra is fully parallel, 	with no “head” node. I forgot to ask how it was determined which 	queries would be addressed to and/or controlled by which nodes, but 	I presume there&#8217;s some sort of a load-balancing scheme.</li>
<li>Clustrix says that because 	Clustrix Sierra uses MVCC (Multi-Version Concurrency Control), and 	thus reads and writes don&#8217;t block each other, global locks aren&#8217;t a 	major issue. (They&#8217;re rare or short or something – I have trouble 	seeing why they would be non-existent.)</li>
<li>Clustrix says there&#8217;s a second 	class of locks and latches that are purely local and short-lived, 	for B-tree indexes and the like. (I didn&#8217;t drill down into those 	either.) I guess this means Clustrix Sierra is B-tree-centric, which 	makes sense for an OLTP-oriented system.</li>
<li>Clustrix Sierra distributes data 	among nodes via consistent hashing (default), range partitioning, or 	“full distribution”(i.e., coping a – presumably small – 	table to each node). The choice of distribution plans is manual now; 	more automation is a future feature.</li>
<li>Clustrix Sierra&#8217;s CBO (Cost-Based 	Optimizer) is, as one would hope, distribution-aware.</li>
<li>Clustrix Sierra compiles query 	fragments and ships them off to the relevant nodes. A fragment might 	contain both instructions for SQL to be executed locally and for 	where data is to be sent next.</li>
<li>Clustrix says that Clustrix Sierra 	does data migration and redistribution (e.g., when you add a node) 	transparently online, and further says that in practice this doesn&#8217;t 	cause a performance hit.</li>
<li>As for Clustrix hardware:
<ul>
<li>Clustrix makes <a href="http://www.monashreport.com/2007/01/29/computing-appliances-trends/" onclick="javascript:pageTracker._trackPageview('/www.monashreport.com');">Type 	I appliances</a>.</li>
<li>A Clustrix node contains 2 	quad-core chips, 32 gigs of RAM, and 7 160 GB solid-state drives.</li>
<li>Specifically, Clustrix is using 	Intel SSDs, with a SAS interface.</li>
<li>Clustrix says solid-state memory 	isn&#8217;t really essential to the product design; it&#8217;s just cheap in 	terms of $/IOPS (I/O Per Second).</li>
</ul>
</li>
<li>A minimum Clustrix configuration 	is 3 nodes, for redundancy. After that you can add nodes one at a 	time. Clustrix says it built a 20-node system in-house, leading me 	to suspect that customers don&#8217;t have anything bigger than 20 nodes 	either.</li>
<li>That 20-node Clustrix system was 	tested to show near-linear scalability. (In discussing this, 	Clustrix tends to forget to use the word “near”.)</li>
<li>Clustrix has partnered with 	somebody to provide global 4-hour-response support. As of now 	Clustrix seems to be active mainly in North America and Europe.</li>
<li>Clustrix is formed from the 	combination of two startups, which I&#8217;ve heard elsewhere were called 	Clustrix and Sprout. Exactly when the combination happened sounds a 	little different depending on who&#8217;s telling the story (one version 	has the predecessors still being separate well into 2008, but 	Clustrix implies the combination happened pretty much on Day 1).</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/12/the-clustrix-story/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
