<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Parallelization</title>
	<atom:link href="http://www.dbms2.com/category/parallelization/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 02 Sep 2010 09:06:44 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>More on NoSQL and HVSP (or OLRP)</title>
		<link>http://www.dbms2.com/2010/08/26/nosql-hvsp-olrp/</link>
		<comments>http://www.dbms2.com/2010/08/26/nosql-hvsp-olrp/#comments</comments>
		<pubDate>Thu, 26 Aug 2010 09:10:31 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Akiban]]></category>
		<category><![CDATA[Basho and Riak]]></category>
		<category><![CDATA[Cache]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Clustrix]]></category>
		<category><![CDATA[CouchDB]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Riptano]]></category>
		<category><![CDATA[Schooner]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Tokutek]]></category>
		<category><![CDATA[memcached]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2907</guid>
		<description><![CDATA[Since posting last Wednesday morning that I&#8217;m looking into NoSQL and HVSP, I&#8217;ve had a lot of conversations, including with (among others):

Dwight Merriman of 10gen (MongoDB)
Damien Katz of Couchio (CouchDB)
Matt Pfeil of Riptano (Cassandra)
Todd Lipcon of Cloudera (HBase committer)
Tony Falco of Basho (Riak)
John Busch of Schooner
Ori Herrnstadt of Akiban

By no means do I have time [...]]]></description>
			<content:encoded><![CDATA[<p>Since posting last Wednesday morning that <a href="http://www.dbms2.com/2010/08/18/nosql-hvsp-adoption/" >I&#8217;m looking into NoSQL and HVSP</a>, I&#8217;ve had a lot of conversations, including with (among others):</p>
<ul>
<li>Dwight Merriman of 10gen (MongoDB)</li>
<li>Damien Katz of Couchio (CouchDB)</li>
<li>Matt Pfeil of <a href="http://www.dbms2.com/2010/07/06/riptano-and-cassandra-adoption/" >Riptano</a> (Cassandra)</li>
<li>Todd Lipcon of Cloudera (HBase committer)</li>
<li>Tony Falco of Basho (Riak)</li>
<li>John Busch of Schooner</li>
<li><strong><span style="font-weight: normal;">Ori Herrnstadt</span></strong> of <a href="http://www.dbms2.com/2010/04/03/akiban-highlights/" >Akiban</a></li>
</ul>
<p><span id="more-2907"></span>By no means do I have time to do these conversations justice, in terms of giving them the write-ups and/or immediate follow-up that they deserve. Indeed, I&#8217;ll leave for vacation Saturday morning with my 2000-word NoSQL article still unwritten. So I&#8217;ll dump as many observations as I can into one or a few posts now, and play catch-up later as circumstances allow.</p>
<p>In no particular order:</p>
<ul>
<li>A number of NoSQL offerings have had more uptake to date than most of the scale-out SQL offerings have.</li>
<li>&#8220;Document-oriented&#8221; NoSQL projects CouchDB and MongoDB have probably had the most users get into production, but perhaps for pretty small systems.</li>
<li>Cassandra and Hbase &#8212; the column-group-architecture guys &#8212; have probably had the most bang-in-lots-of-writes <a href="http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/" >HVSP</a> production uptake.*</li>
<li>I didn&#8217;t talk customer count with Schooner, but the decently-stocked <a href="http://www.schoonerinfotech.com/customers" onclick="javascript:pageTracker._trackPageview('/www.schoonerinfotech.com');">Schooner customer page</a> suggests Schooner may be something of an exception to these generalities.</li>
<li>A lot of these companies are in the low-to-mid-teens of employees.</li>
<li>The SQL-oriented companies, despite having fewer or no customers, often seem to have more money. (One reason I get the impression SQL guys have more money is, frankly, that more  of them are talking about engaging <a href="http://www.monash.com/advantage.html" onclick="javascript:pageTracker._trackPageview('/www.monash.com');">my services</a>.)
<ul>
<li>Schooner cites $20 million in VC.</li>
<li><a href="http://www.dbms2.com/2010/05/12/the-clustrix-story/" >Clustrix</a> cites a figure close to that.</li>
<li>Basho cites $10 million, plus <a href="http://www.masshightech.com/stories/2010/08/02/daily35-Basho-rejects-VC-takes-late-friends-and-family-round.html" onclick="javascript:pageTracker._trackPageview('/www.masshightech.com');">a new round of $1.5 or $2 or $2.5 million</a>. The new round is at a  lowered valuation.</li>
<li>That same site says <a href="http://www.dbms2.com/2009/04/16/introduction-to-tokutek/" >Tokutek</a> finally was able to<a href="http://www.masshightech.com/stories/2010/08/16/daily47-Database-software-firm-Tokutek-lands-28M.html" onclick="javascript:pageTracker._trackPageview('/www.masshightech.com');"> raise some VC</a>. Congrats!</li>
</ul>
</li>
<li>It&#8217;s only a two-company trend, but I was pleased to hear that both 10gen/MongoDB and Akiban were seeing Drupal as a major use case or potential use case. No word on rescuing WordPress from its MySQL implementation, alas, but it seems that a Drupal site typically has 40-200+ tables, while a WordPress one has 10ish.</li>
<li>Another trend I think I&#8217;m seeing is serious object-oriented apps banging things straight into a simple back end. <a href="http://www.dbms2.com/2010/08/22/workday-stan-swete-database-architecture/" >Workday</a> is a huge example of that. Akiban hopes to do something similar with Hibernate.</li>
<li>Stability and maturity are still issues for many of these products. E.g., HBase isn&#8217;t even in Release 1.0 yet. Ditto Cassandra, and surely many of the others. Unsurprisingly, <a href="http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html" onclick="javascript:pageTracker._trackPageview('/blog.mikiobraun.de');">making Cassandra stable is still a challenge</a>.</li>
</ul>
<p><em>*As is common for terms I suggest, the &#8220;HVSP&#8221; name is not getting any traction. What do you think of Marton Trencseni&#8217;s suggestion of <a href="http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/#comment-182138" >OLRP, for OnLine Request Processing</a>?</em></p>
<p>One thing that makes following this area interesting is that so many projects are open source, leading there to be a lot of information in the wild. I hardly have time to read the mailing list for each project; but the people I talk with do, and often they may sorta kinda remember something somebody else posted one or several months back. As just one example, the mailing lists are said to confirm:</p>
<ul>
<li>Contrary to rumor, <a href="http://twitter.com/eventcloudpro/status/17872687577" onclick="javascript:pageTracker._trackPageview('/twitter.com');">Facebook hasn&#8217;t moved in-box search off of Cassandra</a>.</li>
<li>Apparently, however, it&#8217;s true that <a href="http://www.dbms2.com/2008/07/21/project-cassandra-facebook-open-sourced-quasi-dbms/" >Cassandra inventor Facebook</a> has stopped working on Cassandra, and Facebook&#8217;s core Cassandra developers have shifted over to HBase.</li>
</ul>
<p>Also, figuring out usage of open source software can be &#8230; interesting.</p>
<ul>
<li> People who use open source software don&#8217;t have to reveal themselves, as there&#8217;s no purchase transaction to kick things off.</li>
<li>On the other hand, if they&#8217;re serious enough in their use, they often do.
<ul>
<li>There are two main ways to get tech support for open source software &#8212; the community or a company that sells support &#8212; and both ways let the main support-selling company know that one is a user.</li>
<li>Some folks even add themselves to open lists of users, for example these rather long lists for <a href="http://wiki.apache.org/hadoop/Hbase/PoweredBy" onclick="javascript:pageTracker._trackPageview('/wiki.apache.org');">HBase</a> and <a href="http://wiki.apache.org/couchdb/CouchDB_in_the_wild" onclick="javascript:pageTracker._trackPageview('/wiki.apache.org');">CouchDB</a>.</li>
<li>Or they show up at conferences. For example, <a href="http://twitter.com/spyced/status/21490457839" onclick="javascript:pageTracker._trackPageview('/twitter.com');">two</a> <a href="http://twitter.com/spyced/status/21675203015" onclick="javascript:pageTracker._trackPageview('/twitter.com');">tweets</a> from Riptano founder Jonathan Ellis suggest at least 30 production Cassandra users were represented at a recent event. That&#8217;s more detail than his colleague Matt Pfeil wanted to give me when talked. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </li>
</ul>
</li>
</ul>
<p>OK. This post has gotten pretty long, even without me saying anything resembling an overview of any of the seven companies I listed up top, or of their products&#8217; adoption. So I&#8217;ll just publish this now, and edit in links below to follow-on posts if and when they become available.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/26/nosql-hvsp-olrp/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The substance of Pentaho&#8217;s Hadoop strategy</title>
		<link>http://www.dbms2.com/2010/08/21/the-substance-of-pentahos-hadoop-strategy/</link>
		<comments>http://www.dbms2.com/2010/08/21/the-substance-of-pentahos-hadoop-strategy/#comments</comments>
		<pubDate>Sat, 21 Aug 2010 06:40:29 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Pentaho]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2848</guid>
		<description><![CDATA[Pentaho has been talking about a Hadoop-related strategy. Unfortunately, in support of its Hadoop efforts, Pentaho has been &#8212; quite insistently &#8212; saying things that don&#8217;t make a lot of sense to people who know anything about Hadoop.
That said, I think I found four sensible points in Pentaho&#8217;s Hadoop strategy, namely:

If you use an ETL [...]]]></description>
			<content:encoded><![CDATA[<p>Pentaho has been talking about a Hadoop-related strategy. Unfortunately, in support of its Hadoop efforts, Pentaho has been &#8212; quite insistently &#8212; saying things that don&#8217;t make a lot of sense to people who know anything about Hadoop.</p>
<p>That said, I think I found four sensible points in Pentaho&#8217;s Hadoop strategy, namely:</p>
<ol>
<li>If you use an ETL tool like Pentaho&#8217;s to move things in and out of HDFS, you may be able to orchestrate two more steps in the ETL process than if you used Hadoop&#8217;s native orchestration tools.</li>
<li>A lot of what you want to do in MapReduce is things that can be graphically specified in an ETL tool like Pentaho&#8217;s. (That would include tokenization or regex.)</li>
<li>If you have some really lightweight BI requirements (ad hoc, reporting, or whatever) against HDFS data, you might be content to do it straight against HDFS, rather than moving the data into a real DBMS. If so, BI tools like Pentaho&#8217;s might be useful.</li>
<li>Somebody might want to use a screwy version of MapReduce, where by &#8220;screwy&#8221; I mean anything that isn&#8217;t <a href="http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/" >Cloudera Enterprise</a>, <a href="http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/" >Aster Data SQL/MapReduce</a>, or some other implementation/distribution with a lot of supporting tools. In that case, they might need all the tools they can get.</li>
</ol>
<p>The first of those points is, in the grand scheme of things, pretty trivial.</p>
<p>The third one makes sense. While Hadoop&#8217;s Hive client means you could roll your own integration with your own favorite BI tool in any case, having somebody certify it for you themselves could be nice. So if Pentaho ships something that works before other vendors do, good on them. (Target date seems to be October.)</p>
<p>The fourth one is kind of sad.</p>
<p>But if there&#8217;s any shovel-meet-pony aspect to all this &#8212; or indeed a reason for writing this blog post &#8212; it would be the second point. If one understands data management, but is in the &#8220;Oh no! Hadoop wants me to PROGRAM!&#8221; crowd, then being able to specify one&#8217;s MapReduce might be a really nice alternative versus having to actually code it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/21/the-substance-of-pentahos-hadoop-strategy/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>I&#8217;m collecting data points on NoSQL and HVSP adoption</title>
		<link>http://www.dbms2.com/2010/08/18/nosql-hvsp-adoption/</link>
		<comments>http://www.dbms2.com/2010/08/18/nosql-hvsp-adoption/#comments</comments>
		<pubDate>Wed, 18 Aug 2010 13:09:08 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Akiban]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Clustrix]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Groovy Corporation]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Northscale]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[ScaleDB]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[VoltDB and H-Store]]></category>
		<category><![CDATA[dbShards and CodeFutures]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2840</guid>
		<description><![CDATA[I was asked to do a magazine article on NoSQL, where by &#8220;NoSQL&#8221; is meant &#8220;whatever they talk about at NoSQL conferences.&#8221; By now the number of publications planning to run the article is up to 2, the deadline is next week and, crucially, it has been agreed that I may talk about HVSP in [...]]]></description>
			<content:encoded><![CDATA[<p>I was asked to do a magazine article on NoSQL, where by &#8220;NoSQL&#8221; is meant &#8220;whatever they talk about at NoSQL conferences.&#8221; By now the number of publications planning to run the article is up to 2, the deadline is next week and, crucially, it has been agreed that I may talk about <a href="http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/" >HVSP</a> in general, NoSQL and SQL alike.</p>
<p>It also is understood that, realistically, I can&#8217;t be expected to know and mention the very latest news for all the many products in the categories. Even so, I think this would be fine time to check just where NoSQL and HVSP adoption stand. Here is most of what I know, or links to same; it would be great if you guys would contribute additional data in the comment thread.</p>
<p>In the NoSQL area:  <span id="more-2840"></span></p>
<ul>
<li>Back in April, the VoltDB guys told me they thought Cassandra and HBase were the two NoSQL systems with the most momentum.</li>
<li>I know distressingly little about HBase adoption, but a source who may or may not wish to remain anonymous was kind enough to alert me that Twitter and StumbleUpon each have ~30 node deployments, for analytics and analytics/HVSP respectively.</li>
<li>I wrote in detail on <a href="http://www.dbms2.com/2010/07/06/riptano-and-cassandra-adoption/" >Cassandra adoption</a> last month. News since then includes:
<ul>
<li>Facebook is rumored to have dropped Cassandra completely.</li>
<li><a href="http://engineering.twitter.com/2010/07/cassandra-at-twitter-today.html" onclick="javascript:pageTracker._trackPageview('/engineering.twitter.com');">Twitter clarified that it may not be quite as lovestruck by Cassandra as before</a>, but they&#8217;re still very close friends.</li>
<li>It&#8217;s not obvious that the <a href="http://www.riptano.com/blog/cassandra-summit-recap" onclick="javascript:pageTracker._trackPageview('/www.riptano.com');">Cassandra Summit</a> unveiled a lot of new adoption stories.</li>
</ul>
</li>
<li>Northscale&#8217;s <a href="http://www.dbms2.com/2010/08/18/northscale-membase-roadmap/" >Membase</a> is still in its early days.  Zynga is bought in, however, as is something called NHN Korea. <em>(Edit: I subsequently saw NHN Korea on a prominent SEO expert&#8217;s list of the top half dozen or so search engines in the world. Who knew?)</em></li>
<li>Basho has listed a few <a href="http://www.basho.com/customers.html" onclick="javascript:pageTracker._trackPageview('/www.basho.com');">Riak customers</a>. If memory serves (I haven&#8217;t spoken with Basho for a while, and some of my notes are misplaced due to some computer sloppiness), Basho has a few dozen customers in total.</li>
<li>Mozilla has <a href="http://blog.mozilla.com/data/2010/08/16/benchmarking-riak-for-the-mozilla-test-pilot-project/" onclick="javascript:pageTracker._trackPageview('/blog.mozilla.com');">a 4 machine, 64 core Riak cluster</a> in production.</li>
<li><a href="http://highscalability.com/hypertable-new-bigtable-clone-runs-hdfs-or-kfs" onclick="javascript:pageTracker._trackPageview('/highscalability.com');">Hypertable</a> has a few users/project sponsors, Baidu being the biggest name among them.</li>
<li>I don&#8217;t really know how the MongoDB/10gen guys are doing. I think this is at least as much my fault as theirs. Anyhow, they seem to have <a href="http://www.10gen.com/news" onclick="javascript:pageTracker._trackPageview('/www.10gen.com');">links</a> to a couple of folks who have written about MongoDB usage.</li>
<li>NimbusDB is still in stealth mode. I&#8217;d be surprised if they had users  for a while yet, since in January they didn&#8217;t yet sound as if  development was very far underway. (Actually, I forget whether NimbusDB  is supposed to be SQL-based or not.)</li>
</ul>
<p>Among the SQL or SQL-friendly guys:</p>
<ul>
<li><a href="http://www.dbms2.com/2010/05/12/the-clustrix-story/" >Clustrix</a> says it has a few production users, some big-name, but is not disclosing them yet.</li>
<li><a href="http://www.dbms2.com/2010/07/28/dbshards/" >dbShards has around 6 customers</a>, including Facebook. (Facebook may outpace even Twitter and Zynga in using the most products mentioned in this post.)</li>
<li>As of May, <a href="http://www.dbms2.com/2010/05/25/voltdb-finally-launches/" >VoltDB</a> had one paying customer, plus 150 beta customers who weren&#8217;t in production yet.</li>
<li><a href="http://www.dbms2.com/2010/04/03/akiban-highlights/" >Akiban</a> says they&#8217;ll get me up to speed on Thursday. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </li>
<li><a href="http://www.dbms2.com/2008/04/13/scaledb-presents-the-revenge-of-the-pointer/" >ScaleDB</a> seems to be pedaling along in perennial beta. Whether ScaleDB has any actual beta users is less clear. On the plus side, checking that out uncovered a pretty funny <a href="http://scaledb.blogspot.com/2010/04/scaledb-introduces-clustered-database.html" onclick="javascript:pageTracker._trackPageview('/scaledb.blogspot.com');">April Fool blog post</a>.</li>
<li><a href="http://www.dbms2.com/2009/07/30/groovy-corp-puts-out-a-ridiculous-press-release/" >Groovy Corporation</a> seems to have disappeared, or morphed into something called <a href="http://www.groovycorp.com/home.html" onclick="javascript:pageTracker._trackPageview('/www.groovycorp.com');">uCirrus</a>, or something like that.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/18/nosql-hvsp-adoption/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Finally confirmed: Membase has a reasonable product roadmap</title>
		<link>http://www.dbms2.com/2010/08/18/northscale-membase-roadmap/</link>
		<comments>http://www.dbms2.com/2010/08/18/northscale-membase-roadmap/#comments</comments>
		<pubDate>Wed, 18 Aug 2010 09:37:55 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Northscale]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[memcached]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2830</guid>
		<description><![CDATA[On my recent trip to California, neither I nor my clients at Northscale covered ourselves in meeting-arranging glory. Still, from the rushed 30 minute meeting we did wind up having, I finally came away feeling good about Membase&#8217;s product direction.
To review, Membase is a reasonably elastic persistent data store, sporting the memcached API, making memcached/Membase [...]]]></description>
			<content:encoded><![CDATA[<p>On my recent trip to California, neither I nor my clients at Northscale covered ourselves in meeting-arranging glory. Still, from the rushed 30 minute meeting we did wind up having, I finally came away feeling good about Membase&#8217;s product direction.</p>
<p>To review, Membase is a reasonably elastic persistent data store, sporting the memcached API, making memcached/Membase an attractive alternative to memcached/sharded MySQL. As of now, Membase is a pure key-value store.</p>
<p>Northscale defends pure key-value stores by arguing, in effect:  <span id="more-2830"></span></p>
<ul>
<li>You can do a lot with entity-attribute-value triples.</li>
<li>If your key looks like an entity-attribute concatenation, then  your entity-attribute-value triple can be transformed into a key-value  pair.</li>
</ul>
<p>Northscale has a point. Still, I think that in most use cases you&#8217;ll want a data model and/or data access methods that are at least a little richer than pure entity-attribute-value.</p>
<p>Fortunately, that&#8217;s the direction Northscale is taking Membase. I don&#8217;t get the impression that the details have been worked out yet, but the general idea is:</p>
<ul>
<li>Northscale is putting a publish-subscribe interface into Membase it calls &#8220;tap,&#8221; useful for replication, node rebalancing, etc.</li>
<li>Tap will also serve to connect Membase data to a Membase feature Northscale calls “Node Code,&#8221; which will be code that runs in a separate process on each Membase node.</li>
<li>Node Code will include things like:
<ul>
<li>Language run-times</li>
<li>Standard libraries for things like 	index-building</li>
</ul>
</li>
</ul>
<p>Will Membase Node Code be a close substitute for relational DBMS functionality, or even the <a href="http://www.dbms2.com/2010/07/06/cassandra-technical-overview/" >Cassandra</a> architecture? I doubt it, especially at first. But at least it will keep Membase developers from getting locked in to a very simple and restrictive data management paradigm.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/18/northscale-membase-roadmap/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Big Data is Watching You!</title>
		<link>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/</link>
		<comments>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/#comments</comments>
		<pubDate>Wed, 11 Aug 2010 05:30:22 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2760</guid>
		<description><![CDATA[There&#8217;s a boom in large-scale analytics. The subjects of this analysis may be categorized as:

People
Financial trades
Electronic networks
Everything else

The most varied, interesting, and valuable of those four categories is the first one.

That may change some day, with the growing importance of machine-generated data, and of big-data science in particular. But I think it&#8217;s a fair assessment [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">There&#8217;s a boom in large-scale analytics. The subjects of this analysis may be categorized as:</p>
<ul>
<li>People</li>
<li>Financial trades</li>
<li>Electronic networks</li>
<li>Everything else</li>
</ul>
<p style="margin-bottom: 0in;">The most varied, interesting, and valuable of those four categories is the first one.</p>
<p><span id="more-2760"></span></p>
<p style="margin-bottom: 0in;"><em>That may change some day, with the growing importance of<a href="http://www.dbms2.com/2010/04/08/machine-generated-data-example/" > </a><a href="http://www.dbms2.com/2010/04/08/machine-generated-data-example/" >machine-generated data</a>,</em><em> and of <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/" >big-data science</a> </em><em>in particular. But I think it&#8217;s a fair assessment at the present, and for at least the next few years.</em></p>
<p style="margin-bottom: 0in;">Some of th<span style="font-weight: normal;">e most interesting use cases are concentrated in the areas of identifying individuals, groups of people, or behaviors of (groups of) people. For example:</span></p>
<ul>
<li>comScore works hard to <strong>identify 	individual web surfers </strong><span style="font-weight: normal;">– 	i.e. to </span><strong>deanonymize</strong><span style="font-weight: normal;"> them &#8212; even</span> though they may have given incomplete or false 	personal information.</li>
<li>Other companies at least try to 	figure out <strong>which information in a user&#8217;s profile is unreliable,</strong> so as to classify them better. (Yes, there are 62-year-old 	video-game-obsessed Lady Gaga fans, but that&#8217;s generally not the way 	to bet.)</li>
<li>Multiple telecom vendors try to 	identify who their <strong>most influential customers</strong> are (to a first 	approximation, they&#8217;re the ones most often called by the most 	people, but it surely gets more sophisticated than that). This 	information is then used to reduce churn, either by working hard to 	retain those users, or – if they do churn – to move very fast to 	retain the business from their friends.</li>
<li>Other kinds of companies do 	similar kinds of analysis, to the extent that they have enough of a 	social graph to do so. (This application is a case where the term 	“<a href="http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/" >social graph</a>” is not a misnomer.)</li>
<li><strong>Turing detectives</strong> (I just 	coined that phrase) try to determine whether users are humans or 	bots.</li>
<li>Central to detecting <strong>insurance 	fraud</strong> is identifying suspiciously close connections between 	claimants, service providers, and so on.</li>
<li>Identifying groups of people is 	also important in flagging <strong>insider trading.</strong><span style="font-weight: normal;"> Even more important are other kinds of analysis, along the lines of 	“is this normal innocent trading behavior?” </span></li>
<li><span style="font-weight: normal;">Intelligence 	agencies try to detect networks of </span><strong>terrorists</strong><span style="font-weight: normal;"> and their sympathizers. They further try to identify unusual 	patterns of communication or meetings along those networks that 	might indicate terrorist acts are being planned. (Civilian law 	enforcement agencies can use similar techniques.)</span></li>
</ul>
<p style="margin-bottom: 0in; font-weight: normal;">In most cases, the analysis and/or run-time execution of the relevant models is done with the help of analytic DBMS. Other technologies that come into play include non-DBMS MapReduce (Hadoop), graph engines, and CEP (Complex Event Processing). The vendor most heavily represented on that list is probably Aster Data, because:</p>
<ul>
<li>Aster Data is 	focused on hard-core analytics.</li>
<li>I talk a lot 	with Aster Data, and in particular had a long, detailed use-cases 	discussion with them last week.</li>
<li><span style="font-weight: normal;">The 	comScore example happens to come from a speaker at </span><a href="http://www.dbms2.com/2010/05/07/implications-onew-analytic-technology/" ><span style="font-weight: normal;">an 	Aster event</span></a><span style="font-weight: normal;"> I also 	participated in.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">And by the way, all this only scratches the surface of what will be possible down the road. It&#8217;s based mainly on where you live, what you purchase, how you behave on websites, and who you communicate with. </span><span style="color: #000080;"><span lang="zxx"><span style="text-decoration: underline;"><a href="../2010/07/04/fair-data-use/"><span style="font-weight: normal;">Other kinds of data, which could be used to be yet more intrusive</span></a></span></span></span><span style="font-weight: normal;">, generally aren&#8217;t involved.</span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">I actually have two points in drawing up this list. One is golly-gee-whiz about how a lot of analytically sophisticated applications are actually getting into production. The other is to highlight the privacy and liberty threats If This Goes On Unchecked (which is why I didn&#8217;t include some other less-people-focused examples). There&#8217;s also a related danger that, to the extent we don&#8217;t get some smart regulations to keep us safe(r), we&#8217;ll get a bunch of stupid regulations instead. </span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">The Analytic Era has only just begun.<br />
</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/11/big-data-is-watching-you/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Teradata, Xkoto Gridscale (RIP), and active-active clustering</title>
		<link>http://www.dbms2.com/2010/07/31/teradata-xkoto-gridscale-rip-and-active-active-clustering/</link>
		<comments>http://www.dbms2.com/2010/07/31/teradata-xkoto-gridscale-rip-and-active-active-clustering/#comments</comments>
		<pubDate>Sat, 31 Jul 2010 08:23:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Continuent]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Xkoto]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2708</guid>
		<description><![CDATA[Having gotten a number of questions about Teradata&#8217;s acquisition of Xkoto, I leaned on Teradata for an update, and eventually connected with Scott Gnau. Takeaways included:

Teradata is discontinuing  Xkoto&#8217;s existing product Gridscale, which 	Scott characterized as being too OLTP-focused to be a good fit for 	Teradata. Teradata hopes and expects that existing Xkoto Gridscale [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Having gotten a number of questions about Teradata&#8217;s acquisition of Xkoto, I leaned on Teradata for an update, and eventually connected with Scott Gnau. Takeaways included:</p>
<ul>
<li>Teradata is discontinuing <a href="http://www.dbms2.com/2009/09/11/xkoto-gridscale-highlights/" > </a><a href="http://www.dbms2.com/2009/09/11/xkoto-gridscale-highlights/" >Xkoto&#8217;s existing product Gridscale</a>, <span style="font-style: normal;">which 	Scott characterized as being too OLTP-focused to be a good fit for 	Teradata. Teradata hopes and expects that existing Xkoto Gridscale 	customers won&#8217;t renew maintenance. (I&#8217;m not sure</span> that they&#8217;ll 	even get the option to do so.)</li>
<li>The point of Teradata&#8217;s technology 	+ engineers acquisition of Xkoto is to enhance Teradata&#8217;s 	active-active or multi-active data warehousing capabilities, which 	it has had in some form for several years.</li>
<li>In particular, Teradata wants to 	tie together different products in the Teradata product line. (Note: 	Those typically all run pretty much the same Teradata database 	management software, except insofar as they might be on different 	releases.)</li>
<li>Scott rattled off all the 	plausible areas of enhancement, with multiple phrasings – 	performance, manageability, ease of use, tools, features, etc.</li>
<li>Teradata plans to have one or two 	releases based on Xkoto technology in 2011.</li>
</ul>
<p style="margin-bottom: 0in;">Frankly, I&#8217;m disappointed at the struggles of clustering efforts such as Xkoto Gridscale or <a href="http://www.dbms2.com/2009/09/03/continuent-on-clustering/" >Continuent&#8217;s pre-Tungsten products</a>, but if the DBMS vendors meet the same needs themselves, that&#8217;s OK too.</p>
<p style="margin-bottom: 0in;">The logic behind active-active database implementations actually seems pretty compelling:  <span id="more-2708"></span></p>
<ul>
<li>You may well be keeping a second 	copy of your database for high availability/hot standby.</li>
<li>You might even be keeping a third 	copy for off-site disaster recovery.</li>
<li>In some cases, you might have 	reasons beyond disaster recovery to distribute a database around the 	world.</li>
<li>So why not allow queries to be run 	against all the copies?</li>
<li>And by the way, splitting the 	workload up a bit by kinds (e.g., long-running vs. short query) 	might let you optimize the implementation of each copy of the 	database. (This last point becomes even more important with the rise 	of solid-state memory.)</li>
</ul>
<p style="margin-bottom: 0in;">Analytic DBMS vendors pretty much all need to offer this. (Possible exception: If they have a data-mart-only positioning so extreme that customers will never care about any form of failover.) That said, I must confess to not having done a good job of tracking who does or doesn&#8217;t have which features in this area to date; informative comments to this post in that regard would be much appreciated!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/31/teradata-xkoto-gridscale-rip-and-active-active-clustering/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>dbShards &#8212; a lot like an MPP OLTP DBMS based on MySQL or PostgreSQL</title>
		<link>http://www.dbms2.com/2010/07/28/dbshards/</link>
		<comments>http://www.dbms2.com/2010/07/28/dbshards/#comments</comments>
		<pubDate>Wed, 28 Jul 2010 09:39:11 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Facebook]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[dbShards and CodeFutures]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2662</guid>
		<description><![CDATA[I talked yesterday w/ Cory Isaacson, who runs CodeFutures, makers of dbShards.  dbShards is a software layer that turns an ordinary DBMS (currently MySQL or PostgreSQL) into an MPP shared-nothing ACID-compliant OLTP DBMS. Technical highlights included:  

Despite heavy emphasis on the 	word “sharding,” dbShards&#8217;s scale-out is transparent to the 	application programmer. E.g., in dbShards [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I talked yesterday w/ Cory Isaacson, who runs CodeFutures, makers of dbShards.  dbShards is a software layer that turns an ordinary DBMS (currently MySQL or PostgreSQL) into an MPP shared-nothing ACID-compliant OLTP DBMS. Technical highlights included:  <span id="more-2662"></span></p>
<ul>
<li>Despite heavy emphasis on the 	word “sharding,” dbShards&#8217;s scale-out is transparent to the 	application programmer. E.g., in dbShards + MySQL, the APIs are more 	or less the same ones you&#8217;d expect for MySQL (JDBC, etc.)</li>
<li>If the DBMS underneath is 	ACID-compliant (e.g., MySQL + InnoDB), then the dbShards version is 	ACID-compliant too.</li>
<li>Beyond those basics, I forgot to 	check the fine details of dbShards&#8217; MySQL (or PostgreSQL) syntax 	support. <a href="http://highscalability.com/blog/2010/6/23/product-dbshards-share-nothing-shard-everything.html" onclick="javascript:pageTracker._trackPageview('/highscalability.com');">Todd 	Hoff, however, did not forget</a>.</li>
<li>dbShards keeps copies of each 	shard on two different servers, via asynchronous log-shipping. This 	allows for failover in both planned and unplanned outages.</li>
<li>dbShards wants you to distribute 	big tables among shards via a “shard key,” which is a lot like 	the distribution key in MPP analytic DBMS. You&#8217;re encouraged to 	replicate small, low-update-volume tables across each shard.</li>
<li>Cory says that dbShards has good 	join performance when – you guessed it! – everything being joined 	is co-located shard-by-shard, because the tables were distributed on 	the same shard key and/or replicated across each shard. Cory can&#8217;t 	imagine why you&#8217;d want to do an inner join under any other 	circumstances.</li>
<li>The basic dbShards query execution 	model is: A query comes in; it&#8217;s parsed; a shard key is 	automagically detected (one hopes); the “global configuration 	file” is checked to see which shard to ship the work off too. I 	forgot to ask whether lookup was done via a hash table (the obvious 	guess) or something else. The programmer can put hints in the code 	comments to direct the sharding, but Cory asserts those aren&#8217;t 	needed very often.</li>
<li>Cory says that insert performance 	with dbShards + MySQL + InnoDB is 1500-3000 inserts per shard per 	second, scaling almost linearly with the number of shards. I forgot 	to ask how many shards this had been tested for.</li>
<li>If you want blazing dbShards 	performance, Cory&#8217;s base-case figure is 25 gigabytes of data per 	node, so that the most commonly used indexes can camp out in memory. 	(I forgot to ask what kind of hardware he was assuming per node.) 	This is if you&#8217;re going to be doing joins or aggregrations. If it&#8217;s 	just single-row inserts and updates, or if your performance 	requirements are lower, you can go with 10X that figure.</li>
<li>Cory tells stories wherein going 	from an unsharded database to 4 or so shards took database 	re-indexing time down 50X or more.  Apparently, such tasks can be 	exponential or even super-exponential with database size over 	InnoDB. (That said, I&#8217;d be surprised if all large InnoDB users 	suffered from that problem to the same degree.)</li>
<li>dbShards&#8217; customer workloads are 	all &gt;= 50% reads. This is reflective of dbShards&#8217; design 	priorities.</li>
<li>As long as it can be in charge, 	dbShards is happy to interface to whatever kind of database backup 	software you want to use on a node by node basis. (dbShards wants to 	drive your backup software for you so that it can be sure the 	replicas are handled properly.)</li>
<li>It&#8217;s “fairly common” for 	dbShards to be paired with memcached. I forgot to ask whether 	memcached typically lived on its own pool of servers, or on the same 	pool that runs dbShards.</li>
<li>Future DBMS options under 	consideration for dbShards include Oracle and (unspecified) 	in-memory.</li>
</ul>
<p style="margin-bottom: 0in;">Business highlights for CodeFutures and dbShards include:</p>
<ul>
<li>dbShards&#8217; price is 	$5000/server/year, including support and OEMed MySQL, with stated 	quantity discounts up to 40%.</li>
<li>dbShards cloud pricing is 	different (on a usage basis).</li>
<li>dbShards has 6 or so customers, 	half each on-premises and in the cloud. One of them is Facebook. (Those &#8220;100s&#8221; of customers mentioned on the dbShards website are for a fairly unrelated product.)</li>
<li>CodeFutures has been at this 2 ½ 	years or so. There is no venture capital in the company.</li>
<li>Early deals dbShards deals have 	evidently involved a fair amount of professional services.</li>
<li>Counting contractors, Code Futures 	has 10-12 people, which has been as high as 15.</li>
<li>Target dbShards customers are as 	you&#8217;d expect. Cory says he&#8217;s actually been more successful getting 	early-adopted money out of Web companies than Wall Street firms.</li>
<li>There are a couple of dbShards 	PostgreSQL customers for greenfield applications. Most dbShards 	customers and prospects, however, are looking to scale out existing 	apps.</li>
<li>Despite its connection to open source DBMS, there&#8217;s nothing open source about dbShards itself.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/28/dbshards/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Some interesting links</title>
		<link>http://www.dbms2.com/2010/07/23/some-interesting-links/</link>
		<comments>http://www.dbms2.com/2010/07/23/some-interesting-links/#comments</comments>
		<pubDate>Fri, 23 Jul 2010 09:04:48 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[EnterpriseDB and Postgres Plus]]></category>
		<category><![CDATA[Fun stuff]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Humor]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[SAP AG]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2626</guid>
		<description><![CDATA[In no particular order:  

Neil Raden points out that business intelligence dashboards can be dangerously misleading. His reasoning (sound) is that whatever you measure is apt to be distorted by the fact people know they&#8217;re being measured. His solution (implied) is to hire a good-looking consultant like himself to do it right.
I&#8217;ve had my issues [...]]]></description>
			<content:encoded><![CDATA[<p>In no particular order:  <span id="more-2626"></span></p>
<ul>
<li>Neil Raden points out that <a href="http://www.b-eye-network.com/channels/5083/view/9618/" onclick="javascript:pageTracker._trackPageview('/www.b-eye-network.com');">business intelligence dashboards can be dangerously misleading</a>. His reasoning (sound) is that whatever you measure is apt to be distorted by the fact people know they&#8217;re being measured. His solution (implied) is to hire a <a href="http://twitter.com/NeilRaden/status/19110492482" onclick="javascript:pageTracker._trackPageview('/twitter.com');">good-looking</a> consultant like himself to do it right.</li>
<li>I&#8217;ve had my issues with Fred Holahan, who was VP of Marketing when I posted that <a href="http://www.dbms2.com/2009/04/20/first-thoughts-on-oracle-acquiring-sun/" >EnterpriseDB was not to be trusted</a>. (That said, Fred is long gone from EnterpriseDB and my opinion hasn&#8217;t changed.) But he&#8217;s put up a good series of posts on the basis of the open source &#8220;progressive engagement&#8221; marketing funnel, including this gem on <a href="http://opensourceadvisory.com/wordpress/?p=860" onclick="javascript:pageTracker._trackPageview('/opensourceadvisory.com');">why you shouldn&#8217;t count on monetizing your community/free users</a>.</li>
<li><a href="http://tech.fortune.cnn.com/2010/07/22/oracle-plans-to-double-acquisition-budget/" onclick="javascript:pageTracker._trackPageview('/tech.fortune.cnn.com');">Oracle plans to increase its acquisition budget</a>. The figure given is $70 billion over the next 5 years. <em>Edit: But see this funny <a href="http://www.theregister.co.uk/2010/07/23/oracle_acquisition_budget/" onclick="javascript:pageTracker._trackPageview('/www.theregister.co.uk');">Register</a> followup.</em></li>
<li>Clayton Christensen wrote a phenomenal article on <a href="http://hbr.org/2010/07/how-will-you-measure-your-life/ar/1" onclick="javascript:pageTracker._trackPageview('/hbr.org');">how to live a good life</a>, from a very business-y perspective. (Only in one anecdote was it too religiously-oriented for my tastes.) Takeaways include:
<ul>
<li>Your core goals probably revolve around something other than business success. (E.g., family.) Don&#8217;t lose sight of that.</li>
<li>To the extent you&#8217;re a manager or leader, you may have a huge impact on other people&#8217;s lives. Use that power in admirable ways.</li>
<li>Teach people how to fish for answers, rather than just giving them answers. They&#8217;ll probably come to better conclusions than you would have anyway. (This is a core principle in my own consulting.)</li>
<li>Take time to reflect. And by the way, the same techniques you use for strategic analysis in business can be applied to your life as well.</li>
</ul>
</li>
<li><a href="http://www.bothsidesofthetable.com/2010/07/19/life-is-10-how-you-make-it-and-90-how-you-take-it/" onclick="javascript:pageTracker._trackPageview('/www.bothsidesofthetable.com');">Mark Suster</a> has a pretty good post expanding on my first Christensen takeaway, highlighting a point too often missing from articles in that genre: It&#8217;s not just family; it&#8217;s also all the cool things around us.</li>
<li>I haven&#8217;t gone through the <a href="http://developer.yahoo.com/events/hadoopsummit2010/agenda.html" onclick="javascript:pageTracker._trackPageview('/developer.yahoo.com');">Hadoop Summit archives</a> yet, but it looks as if there&#8217;s a lot of insight there about current Hadoop application activity.</li>
<li>If you&#8217;re a cat lover and don&#8217;t hate simple/traditional music, check out <a href="http://www.marcgunn.com/poetry/labels/cat_songs.shtml" onclick="javascript:pageTracker._trackPageview('/www.marcgunn.com');">Marc Gunn&#8217;s cat filksongs</a>, especially the infectious &#8220;What Shall We Do With a Catnipped Kitty?&#8221; and &#8220;Lord of the Pounce&#8221;, both playable from the right sidebar of that page (#7 and #10 respectively). Gunn is also a chief perpetrator of the justly (in)famous <a href="http://www.thebards.net/" onclick="javascript:pageTracker._trackPageview('/www.thebards.net');">Do Virgins Taste Better?</a> cycle of filksongs.</li>
<li>Former SAP exec Dennis Moore offers a theory as to <a href="http://dbmoore.blogspot.com/2010/05/why-is-in-memory-database-important-to.html" onclick="javascript:pageTracker._trackPageview('/dbmoore.blogspot.com');">why SAP cares so much about in-memory DBMS</a>. It&#8217;s to integrate business processes, because SAP has no other software layer good at doing same. Interestingly, Dennis originated SAP&#8217;s previous attempt at meeting a similar need via its composite applications initiative. However, in Dennis&#8217; view this benefit would only be achieved by a major rewrite of SAP&#8217;s applications.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/23/some-interesting-links/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Riptano, and Cassandra adoption</title>
		<link>http://www.dbms2.com/2010/07/06/riptano-and-cassandra-adoption/</link>
		<comments>http://www.dbms2.com/2010/07/06/riptano-and-cassandra-adoption/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 09:11:40 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Riptano]]></category>
		<category><![CDATA[Specific users]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2480</guid>
		<description><![CDATA[Tonight&#8217;s Cassandra technology post got plenty long enough on its own, so I&#8217;m separating out business and adoption issues here. For starters, known Cassandra users include:

Facebook, which has said it has 	150 or so Cassandra nodes (but see below)
Twitter, which has said it has 45 	or so Cassandra nodes
Rackspace, which used to be 	Jonathan Ellis&#8217; [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Tonight&#8217;s <a href="http://www.dbms2.com/2010/07/06/cassandra-technical-overview/" >Cassandra technology post</a> got plenty long enough on its own, so I&#8217;m separating out business and adoption issues here. For starters, known Cassandra users include:</p>
<ul>
<li>Facebook, which has said it has 	150 or so Cassandra nodes (but see below)</li>
<li>Twitter, which has said it has 45 	or so Cassandra nodes</li>
<li>Rackspace, which used to be 	Jonathan Ellis&#8217; employer, and now is backing Cassandra company 	Riptano</li>
<li>Digg, which along with Twitter and 	Rackspace was one of the three major users helping advance the 	Cassandra project</li>
<li>OpenX, Simple Geo, Digital 	Reasoning, who Jonathan cited as production users in March</li>
<li>Cloudkick, as noted and linked in 	my other post</li>
<li>Two 	customers Riptano named at launch (but I&#8217;ve forgotten who they were*)</li>
</ul>
<p style="margin-bottom: 0in;">Fetlife, Meebo, and others seem to at least have a healthy interest in Cassandra, based on their level of involvement in a forthcoming <a href="http://cassandrasummit2010.eventbrite.com/" onclick="javascript:pageTracker._trackPageview('/cassandrasummit2010.eventbrite.com');">Cassandra Summit</a>. That said, the <a href="http://twitter.com/fetlife" onclick="javascript:pageTracker._trackPageview('/twitter.com');">@Fetlife</a> tweetstream features numerous yelps of pain, and I don&#8217;t mean the recreational kind.  <span id="more-2480"></span></p>
<p style="margin-bottom: 0in;"><em>*And I can&#8217;t easily find a launch press release, whether on the rather minimalist Riptano website or elsewhere.</em></p>
<p style="margin-bottom: 0in;">Beyond that, when Riptano launched in May, the Riptano guys (mainly Jonathan Ellis) said:</p>
<ul>
<li>They were sure there were dozens 	of Cassandra user organizations, maybe even &gt;100. But there 	weren&#8217;t 100s.</li>
<li>Maybe 20-40% of those Cassandra 	sites were in production. (But I don&#8217;t think I&#8217;d multiply that out 	to suggest there were, say, 35-50 production Cassandra users.)</li>
<li>4000 people were going daily to 	the Apache Cassandra site.</li>
<li>There were 250 Cassandra downloads 	daily.</li>
<li>Lots of startups were using 	Cassandra.</li>
<li>Lots of other companies were 	looking at switching over to Cassandra.</li>
<li>Many potential Cassandra users had 	been waiting for a Cassandra company to be available to support it.</li>
<li>The median number of Cassandra 	(production?) nodes is probably 8-10. 4 would be a low end figure.</li>
</ul>
<p style="margin-bottom: 0in;">That&#8217;s a lot of adoption for a not-even-Release-1 open source project. Even so, there&#8217;s a feeling going around that Cassandra has lost some momentum the past couple of months. Most notably, <a href="../2008/07/21/project-cassandra-facebook-open-sourced-quasi-dbms/">Facebook, which created Cassandra in the first place,</a> isn&#8217;t using it for new projects. True, I&#8217;m hearing even less evidence that any one of Membase, Voldemort, <a href="http://www.dbms2.com/2010/05/25/voltdb-finally-launches/" >VoltDB</a>, <a href="http://www.dbms2.com/2010/04/03/akiban-highlights/" >Akiban</a>, <a href="http://www.dbms2.com/2010/05/12/the-clustrix-story/" >Clustrix</a>, or Riak – for example – is setting the world on fire than I am for Cassandra. But the viable Cassandra alternatives are piling up. Cassandra isn&#8217;t the only or even primary game in town, and for that matter I haven&#8217;t heard any concise description of a niche in which Cassandra is the unquestioned leader.</p>
<p style="margin-bottom: 0in;"><em>Edit: <a href="http://twitter.com/EventCloudPro/status/17872687577" onclick="javascript:pageTracker._trackPageview('/twitter.com');">A/the Facebook project that continues to run on Cassandra</a> is Inbox search.</em></p>
<p style="margin-bottom: 0in;">As for Riptano itself:</p>
<ul>
<li>Riptano launched with two founders 	and immediately made an offer to a third guy. I don&#8217;t know how many 	folks they have now, two months later.</li>
<li>Rackspace put some funding into 	Riptano.</li>
<li>Riptano&#8217;s strategy sounds a lot 	like <a href="../2010/06/30/cloudera-enterprise-hadoop-evolution/">Cloudera&#8217;s</a>, 	by which I mean:
<ul>
<li>Riptano&#8217;s business is all 	services, whether training, consulting, or support.</li>
<li>Riptano&#8217;s intended main business 	is obviously support.</li>
<li>Notwithstanding the above, Riptano 	intends to eventually offer proprietary software, bundled with its 	support services.</li>
<li>The first area of focus for that 	proprietary software is intended to be management tools.</li>
<li>I wouldn&#8217;t be surprised if, like 	Cloudera, Riptano tweaks its software focus from “stuff that lets 	us support you better” to “integration with stuff you pay for.” 	Those strategies are actually pretty similar.</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">Riptano seems to be starting out with support pricing around $1,000-$4,000/server/year, before quantity discounts.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/06/riptano-and-cassandra-adoption/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Cassandra technical overview</title>
		<link>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/</link>
		<comments>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 09:10:39 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Amazon and its cloud]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Riptano]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2473</guid>
		<description><![CDATA[Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I&#8217;m finally finding time to clear my Cassandra/Riptano [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I&#8217;m finally finding time to clear my Cassandra/Riptano backlog. I&#8217;ll cover the more technical parts below, and the more business- or usage-oriented ones in <a href="http://www.dbms2.com/2010/07/06/riptano-and-cassandra-adoption/" >a companion Cassandra/Riptano post</a>.</p>
<p style="margin-bottom: 0in;">Jonathan&#8217;s core claims for Cassandra include:</p>
<ul>
<li>Cassandra is shared-nothing.</li>
<li>Cassandra has good approaches to 	replication and partitioning, right out of the box.</li>
<li>In particular, Cassandra is good 	for use cases that distribute a database around the world and want 	to access it at “local” latencies. (Indeed, Jonathan asserts 	that non-local replication is a significant non-big-data Cassandra 	use case.)</li>
<li>Cassandra&#8217;s scale-out is 	application-transparent, unlike sharded MySQL&#8217;s.</li>
<li> Cassandra is fast at both appends 	and range queries, which would be hard to accomplish in a pure 	key-value store.</li>
</ul>
<p style="margin-bottom: 0in;">In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he&#8217;s concerned, may well belong in a more traditional SQL DBMS.  <span id="more-2473"></span></p>
<p style="margin-bottom: 0in;">Further highlights of our talks included, as best I understood them:</p>
<ul>
<li>Cassandra is based in parts both 	on Google&#8217;s <strong>BigTable</strong> paper of 2006 and Amazon&#8217;s <strong>Dynamo</strong> paper of 2007.
<ul>
<li>The core of what Cassandra takes 	from BigTable is based on <strong>log-structured merge trees,</strong> which 	actually entered the computer science literature in 1996.</li>
<li>Cassand<span style="font-weight: normal;">ra&#8217;s 	approach to horizontal scaling, replication, failover, etc. seems to 	be based Dynamo. </span></li>
</ul>
</li>
<li>There seems to be <strong>a logical 	concept of “row”</strong> in Cassandra, or it&#8217;s at least meaningful 	to use the SQL/relational concept of a “row” when talking about 	Cassandra data. However, Cassandra is closer to being a <strong>column-based 	data store</strong> than a row-based one. (Not the same thing, but 	closer.)</li>
<li>Even so, it only takes a single 	seek to return a whole Cassandra “row”.</li>
<li>Cassandra 	writes data quite differently from the way a classical OLTP DBMS 	would.
<ul>
<li><strong>Cassandra writes just the data 	elements</strong><span style="font-weight: normal;"> – i.e., fields – </span><strong>that are actually being inserted or changed,</strong> not whole 	rows.</li>
<li>One benefit is that Cassandra data 	is very <strong>sparse.</strong> NULLs aren&#8217;t stored in any way, and hence in 	particular take up no space.</li>
<li>Another benefit – and one of the 	core concepts of Cassandra – is that <strong>you can implicitly assume 	different schemas for different rows of the same “table.”</strong> In 	particular, you can add data for columns that you didn&#8217;t envision 	when you first started storing “rows” of the same “table.”</li>
<li><strong>Writes are collected into 	sorted “memtables,” which from time to time are sent to disk.</strong> Once data gets to disk, it&#8217;s <strong>immutable,</strong> except for occasional 	merge/reorganization/garbage collection.
<ul>
<li>Jonathan claims, plausibly, that 	this makes write throughput very fast (because the I/O is 	fundamentally sequential in nature.)</li>
<li>The default as to how long data 	typically stays in memory before it gets persisted to dis<span style="font-weight: normal;">k 	is “whichever comes first of {64 MB written, 300k updates, 1 	hour}”. </span></li>
</ul>
</li>
<li>Cassandra has <strong>durability</strong> – 	guaranteed non-loss of data – assuming fsync is turned on. fsync 	seems to create a 15% or so overhead.</li>
<li>However, Cassandra has <strong>no 	concept of a “transaction.”</strong></li>
<li>As one would 	expect, data can be read even before it has been persisted to disk.</li>
</ul>
</li>
<li>According to 	Jonathan, Cassandra can do about 14,000 writes or 7,000 reads per 	second, on a quad-core server.
<ul>
<li>Those figures scale pretty 	linearly with the number of servers. (There&#8217;s some overhead for 	network latency.)</li>
<li>Those figures assume a five-column 	row.</li>
<li>Cassandra&#8217;s write-performance 	figures are only “mildly sensitive” to the width of the row. 	E.g., doubling row width only gives a 15-20% throughput hit, due to 	some fixed per-row overhead. That said, I imagine going 100X in row 	width would create a major slowdown, although perhaps while 	measuring width more in bytes than in column count.</li>
<li>Cassandra&#8217;s <span style="color: #000080;"><span lang="zxx"><span style="text-decoration: underline;"><a href="http://racklabs.com/%7Ebwilliam/cassandra/04vs05vs06.png" onclick="javascript:pageTracker._trackPageview('/racklabs.com');">performance</a></span></span></span> has been growing nicely in each point release. Jonathan thinks this 	general trend will continue.</li>
</ul>
</li>
<li>Jonathan thinks Cassandra is 	pretty good at keeping your data safe.
<ul>
<li>Each node has a commit log.</li>
<li>When a node goes down, its writes 	are buffered until it comes back up.</li>
</ul>
</li>
<li>You can run Hadoop MapReduce 	straight against Cassandra files.</li>
<li>A Cassandra node might hold 	anything from 10s of gigabytes to multiple terabytes of data. You 	might want to go with the low end if you want to have lots of cache 	hits.</li>
<li>Solid-state storage would speed up 	Cassandra reads, not writes, and is not widely used with Cassandra 	yet.</li>
<li>Jonathan says Cassandra is really 	good at handling time series data, by which I suspect he means log 	files. <a href="https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/" onclick="javascript:pageTracker._trackPageview('/www.cloudkick.com');">Cloudkick</a> is a user of this capability.</li>
</ul>
<p style="margin-bottom: 0in;">I certainly didn&#8217;t grasp everything about Cassandra replication and partitioning strategies. That wasn&#8217;t the focus of our talks, and anyway I got the impression they are so flexible that there&#8217;s little that can firmly be said about them. But I did get the impressions:</p>
<ul>
<li>You set your consistency rules in 	the Cassandra API, not on a per-table basis. (I think this means 	that a lack of administrative tools is supposedly a feature, not a 	drawback.)</li>
<li>As a practical matter, Cassandra 	users commonly take one of two approaches to consistency:
<ul>
<li><a href="http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/" >RYW consistency</a>, most 	commonly with N = 3 and R = W = 2.</li>
<li>Geographically dispersed eventual 	consistency.</li>
</ul>
</li>
<li>Cassandra data is most commonly 	distributed via consistent hashing, but other options are 	“pluggable.”</li>
<li>If you add a node, the busiest 	note automagically decides to ship some data over, reducing its 	load. Of course, this only works if you get the new node on before 	the old node is so maxed out it doesn&#8217;t have time to do the 	shipping.</li>
</ul>
<p style="margin-bottom: 0in;">When we talked in March, the next release of Cassandra was going to be 0.7. Cassandra 0.7 was going to be a performance/scalability release, for example fixing the flaw that garbage collection read rows into memory one at a time. After that, Cassandra 0.8 was to be a feature release, with one planned feature being more automatic index management and/or materialized-view-like capability, so as to reduce the burden on Cassandra developers of schema management.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>My M<span style="font-style: normal;">arch 	<a href="../2010/03/12/some-nosql-links/">NoSQL 	links post</a> included </span>the Google and Amazon papers</li>
<li>The <a href="https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/" onclick="javascript:pageTracker._trackPageview('/www.cloudkick.com');">March 	2, 2010 Cloudkick post</a> also linked above goes into a lot of 	detail, including what they think is great about Cassandra and what 	they think is still missing</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
