<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; DataStax</title>
	<atom:link href="http://www.dbms2.com/category/products-and-vendors/riptano/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Couchbase update</title>
		<link>http://www.dbms2.com/2012/02/01/couchbase-update/</link>
		<comments>http://www.dbms2.com/2012/02/01/couchbase-update/#comments</comments>
		<pubDate>Thu, 02 Feb 2012 04:00:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Basho and Riak]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[CouchDB]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Zynga]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5877</guid>
		<description><![CDATA[I checked in with James Phillips for a Couchbase update, and I understand better what&#8217;s going on. In particular: Give or take minor tweaks, what I wrote in my August, 2010 Couchbase updates still applies. Couchbase now and for the foreseeable future has one product line, called Couchbase. Couchbase 2.0, the first version of Couchbase [...]]]></description>
			<content:encoded><![CDATA[<p>I checked in with James Phillips for a Couchbase update, and I understand better what&#8217;s going on. In particular:</p>
<ul>
<li>Give or take minor tweaks, what I wrote in my <a href="../../../../../2011/08/13/couchbase-business-update/">August, 2010 Couchbase updates</a> still applies.</li>
<li>Couchbase now and for the foreseeable future has one product line, called Couchbase.</li>
<li>Couchbase 2.0, the first version of Couchbase (the product) to use CouchDB for persistence, has slipped &#8230;</li>
<li>&#8230; because more parts of CouchDB had to be rewritten for performance than Couchbase (the company) had hoped.</li>
<li>Think mid-year or so for the release of Couchbase 2.0, hopefully sooner.</li>
<li>In connection with the need to rewrite parts of CouchDB, Couchbase has:
<ul>
<li><a href="../../../../../2012/01/18/notes-from-the-couch-blogs/">Gotten out of the single-server CouchDB business</a>.</li>
<li>Donated its proprietary single-sever CouchDB intellectual property to the Apache Foundation.</li>
</ul>
</li>
<li>The 150ish new customers in 2011 Couchbase brags about are real, subscription customers.</li>
<li>Couchbase has 60ish people, headed to &gt;100 over the next few months.</li>
</ul>
<p><span id="more-5877"></span><em>If you previously heard the brand names Couchbase Single or Couchbase Mobile, pay no further attention to them. Couchbase Single was CouchDB; Couchbase Mobile is part of Couchbase&#8217;s feature set.</em></p>
<p>The current product is Couchbase 1.8, which is a whole lot like what previously was called Membase. New features in Couchbase 1.8 (versus prior versions of Membase) were concentrated in client libraries/SDK (Software Development Kit). Not coincidentally, Couchbase has hired developer evangelists who are in charge of making Couchbase play nicely with various specific languages (e.g. C/C++)</p>
<p>Drilling down further into the CouchDB part of the story:</p>
<ul>
<li>Couchbase 2.0 will replace Couchbase 1.8/Membase&#8217;s SQLite back-end with CouchDB.</li>
<li>Parts of CouchDB that do things like read, write, or compact data have been rewritten from Erlang to C.</li>
<li>Couchbase still uses other Erlang parts of Apache CouchDB, and would be delighted if the community were to usefully enhance them.</li>
<li>Couchbase&#8217;s heavy contributions to development of open source CouchDB will, for the most part, continue.</li>
<li>CouchDB stuff donated to the Apache Foundation includes:
<ul>
<li>Documentation</li>
<li>Packaging</li>
<li>Performance enhancements</li>
</ul>
</li>
</ul>
<p>There&#8217;s at least one Couchbase user with &gt;1000 nodes (at a guess, <a href="../../../../../2011/09/05/zynga-linkedin-data-warehous/">Zynga</a>).  More typical might be 20 nodes or less. This led me to wonder how much data one puts on a Couchbase node anyway. The answer turns out to vary widely, in that you want your working set to be in RAM, and whether that&#8217;s your entire database or just a slice of it depends on the nature of the application.</p>
<p>James echoed a trend I&#8217;ve heard elsewhere as well, in which products one things of as being internet-specific are also sold in a few cases to conventional enterprises for &#8212; you guessed it! &#8212; their internet operations. I also asked him about competition, and he asserted:</p>
<ul>
<li>MongoDB is the big competition. He believes Couchbase has an excellent win rate vs. 10gen for actual paying accounts.</li>
<li>DataStax/Cassandra wins over Couchbase only when multi-data-center capability is important. Naturally, multi-data-center capability is planned for Couchbase. (Indeed, that&#8217;s one of the benefits of swapping in CouchDB at the back end.)</li>
<li>Redis has &#8220;dropped off the radar&#8221;, presumably because there&#8217;s no particular persistence strategy for it.</li>
<li>Riak doesn&#8217;t show up much.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/02/01/couchbase-update/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Highlights of a busy news week</title>
		<link>http://www.dbms2.com/2011/09/26/highlights-of-a-busy-news-week/</link>
		<comments>http://www.dbms2.com/2011/09/26/highlights-of-a-busy-news-week/#comments</comments>
		<pubDate>Mon, 26 Sep 2011 05:50:35 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Ingres]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[VectorWise]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5372</guid>
		<description><![CDATA[I put up 14 posts over the past week, so perhaps you haven&#8217;t had a chance yet to read them all. Highlights included: My most important post of the week was a general guide to IT vendor strategy. That one has already spawned discussion at many companies, from the tiny to the multi-billion-dollar. The best [...]]]></description>
			<content:encoded><![CDATA[<p>I put up 14 posts over the past week, so perhaps you haven&#8217;t had a chance yet to read them all. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  Highlights included:</p>
<ul>
<li>My most important post of the week was a general <a href="http://www.strategicmessaging.com/strategy-for-it-vendors-a-worksheet/2011/09/18/">guide to IT vendor strategy</a>. That one has already spawned discussion at many companies, from the tiny to the multi-billion-dollar.</li>
<li>The best comment thread of the week was probably on my post about <a href="http://www.dbms2.com/2011/09/19/oltp-disk-solid-state/">scale-out relational OLTP choices</a>, in which people discussed the merits of various particular alternatives.</li>
<li>I recommended that people strongly consider attending <a href="http://www.dbms2.com/2011/09/20/xldb-the-one-conference-i-like-to-go-to/">XLDB 5 in Menlo Park on October 18-19</a>.</li>
</ul>
<p>Most of the posts, however, were reactions to news events. In particular:</p>
<ul>
<li>Teradata announced that <a href="http://www.dbms2.com/2011/09/22/teradata-columnar-compression/">Teradata 14 will be hybrid-columnar</a>, more in Vertica&#8217;s way than in Greenplum&#8217;s or Aster Data&#8217;s. (Pay no attention to the <em>Wall Street Journal&#8217;s</em> apparent belief that <a href="http://www.dbms2.com/2011/09/22/hybrid-columnar-soundbites/">no other analytic DBMS is hybrid-columnar at all</a>.)</li>
<li>Aster announced the unsurprising news that there will be a Teradata Aster appliance. Also, <a href="http://www.dbms2.com/2011/09/22/aster-database-release-5-and-teradata-aster-appliance/">Aster talked about greater analytic flexibility in the forthcoming Aster 5.0</a>.</li>
<li>With Oracle OpenWorld coming up, Oracle decided to get some of its announcing out of the way early. In particular, it announced the <a href="http://www.dbms2.com/2011/09/21/oracle-database-appliance-soundbites/">Oracle Database Appliance</a>, which is small-business-friendly hardware for running the Oracle DBMS. However, the Oracle Database Appliance doesn&#8217;t seem to do much about the complexity of running the Oracle DBMS software.</li>
<li>In <a href="http://www.dbms2.com/2011/09/23/hadoop-appliances/">a catch-all Hadoop post</a>, I noted that:
<ul>
<li>Oracle has now clearly said it has a Hadoop appliance coming, no doubt next week at OpenWorld.</li>
<li>I still can&#8217;t see why Hadoop appliances would succeed, but a lot of smart folks seem to disagree with me.</li>
<li>Greenplum announced what looks like a nice but unimportant little product upgrade.</li>
<li>It&#8217;s a really good thing that previously reported plans to revamp Hadoop are underway.</li>
</ul>
</li>
<li>DataStax announced that <a href="http://www.dbms2.com/2011/09/22/datastax-pivots-back-to-its-original-strategy/">it really is a Cassandra company after all</a>. Pay no attention to previous marketing that seemed to put DataStax in the same Hadoop-alternative category as, say, MapR.</li>
<li><a href="../2011/09/25/ingres-actian/">Ingres has changed its name to Actian</a>. The announcement seems like a confession that Ingres and VectorWise are going nowhere.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/26/highlights-of-a-busy-news-week/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>DataStax pivots back to its original strategy</title>
		<link>http://www.dbms2.com/2011/09/22/datastax-pivots-back-to-its-original-strategy/</link>
		<comments>http://www.dbms2.com/2011/09/22/datastax-pivots-back-to-its-original-strategy/#comments</comments>
		<pubDate>Thu, 22 Sep 2011 23:23:12 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5331</guid>
		<description><![CDATA[The DataStax and Cassandra stories are somewhat confusing. Unfortunately, DataStax chose to clarify them in what has turned out to be a crazy news week. I&#8217;m going to use this post just to report on the status of the DataStax product line, without going into any analysis beyond that. Pro tip: If you choose to [...]]]></description>
			<content:encoded><![CDATA[<p>The DataStax and Cassandra stories are somewhat confusing. Unfortunately, DataStax chose to clarify them in what has turned out to be a crazy news week. I&#8217;m going to use this post just to report on the status of the DataStax product line, without going into any analysis beyond that.</p>
<p><span id="more-5331"></span><em>Pro tip: If you choose to announce at a conference where many other vendors will surely announce news also, you naturally run the risk of not garnering much attention.</em></p>
<p>For starters, it may help to realize or recall that:</p>
<ul>
<li><a href="http://www.dbms2.com/2008/07/21/project-cassandra-facebook-open-sourced-quasi-dbms/">Cassandra was originally developed and revealed at Facebook</a>, to much early NoSQL fanfare. Facebook later backed away from Cassandra use.</li>
<li>Rackspace guys in Texas became Cassandra&#8217;s biggest backers. They eventually founded a company called Riptano to commercialize Cassandra.</li>
<li>Texas company Riptano became the California company DataStax.</li>
<li>DataStax came out with a <a href="http://www.dbms2.com/2011/03/23/datastax-cassandrafs-hadoop-brisk/">Hadoop-on-Cassandra offering called Brisk</a>. For a while, it sounded as if Hadoop was as big a focus for DataStax as Cassandra is.</li>
<li>DataStax is now recommitted to being <strong>the Cassandra company,</strong> and has accordingly backed away from Hadoop and Brisk as a separate or coequal focus. However, it sees Hadoop capability as a nice, or even major, feature of its Cassandra-centric offering.</li>
<li>To finalize its open source obligations with respect to Brisk, DataStax is in essence:
<ul>
<li>Donating a Hive driver for Cassandra straight into the main Apache Cassandra project.</li>
<li>Releasing the rest of Brisk as a separate open source project.</li>
<li>Disclaiming interest in further advancing open source Brisk.</li>
</ul>
</li>
<li>There&#8217;s also something called Solandra &#8212; evidently SOLR-on-Cassandra &#8212; whose status is similar to Brisk&#8217;s.</li>
<li>There are three main ways that DataStax helps you to consume Cassandra.
<ul>
<li>DataStax is the principal sponsor of Apache Cassandra development, and presumably long will be.<strong> Apache Cassandra </strong>is both<strong> free-like-speech and free-like-beer.</strong></li>
<li>DataStax is also introducing a paid-subscription version of Cassandra called DataStax Enterprise, which features proprietary code, support, and so on. <strong>DataStax Enterprise </strong>is <strong>neither free-like-speech nor free-like-beer.</strong></li>
<li>There will also be something called DataStax Community Edition. <strong>DataStax Community Edition </strong>is<strong> free-like-beer, </strong>but<strong> not free-like-speech.</strong></li>
</ul>
</li>
</ul>
<p>Various posts on the <a href="http://www.datastax.com/blog">DataStax blog</a> give DataStax&#8217;s explanation of what it&#8217;s doing. Ben Werther, the ex-Greenplum guy who briefly worked at DataStax and was most associated with telling the Hadoop/Brisk story, has moved on to his own startup Platfora.</p>
<p>DataStax Enterprise has three main aspects:</p>
<ul>
<li><strong>DataStax Server,</strong> which is the actual database and analytics code. At this time, there is little closed-source code in DataStax Server, but DataStax reserves the right to widen that gap in the future.</li>
<li><strong>DataStax OpsCenter,</strong> which is management tools around DataStax Server. DataStax OpsCenter is entirely closed-source, even though DataStax gives a limited version away for free.</li>
<li><strong>Support.</strong></li>
</ul>
<p>To describe DataStax Community Edition, I&#8217;ll just quote the press release verbatim, which characterizes it as:</p>
<blockquote><p>&#8230; a free platform based on Apache Cassandra that bundles the open source database with smart installers, drivers and connectors for popular development languages, demo apps, documentation, and a free version of DataStax OpsCenter for Apache Cassandra.</p></blockquote>
<p>DataStax Community Edition is crippleware only in terms of feature set; there are no limitations on its database size, cluster size, or usage rights. A core mission of DataStax Community Edition is to create happy Cassandra users, who may then become customers for DataStax Enterprise.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/22/datastax-pivots-back-to-its-original-strategy/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Aster Data business trends</title>
		<link>http://www.dbms2.com/2011/09/08/aster-data-business-trends/</link>
		<comments>http://www.dbms2.com/2011/09/08/aster-data-business-trends/#comments</comments>
		<pubDate>Thu, 08 Sep 2011 05:33:56 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5204</guid>
		<description><![CDATA[Last month, I reviewed with the Aster Data folks which markets they were targeting and selling into, subsequent to acquisition by their new orange overlords. The answers aren&#8217;t what they used to be. Aster no longer focuses much on what it used to call frontline (i.e., low-latency, operational) applications; those are of course a key [...]]]></description>
			<content:encoded><![CDATA[<p>Last month, I reviewed with the Aster Data folks which markets they were targeting and selling into, subsequent to <a href="../../../../../2011/03/04/teradata-aster-data-ncluster/">acquisition</a> by their new orange overlords. The answers aren&#8217;t what they used to be. Aster no longer focuses much on what it used to call <a href="../../../../../2008/10/22/aster-data-systems-ncluster/">frontline</a> (i.e., low-latency, operational) applications; those are of course a key strength for Teradata. Rather, Aster focuses on <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> &#8212; they&#8217;ve long <a href="../../../../../2011/02/12/upcoming-webinar-on-investigative-analytics/">endorsed</a> my use of the term &#8212; and on the batch run/scoring kinds of applications that inform operational systems.</p>
<p><span id="more-5204"></span>Also, Aster no longer focuses much on the general internet industry where it got its earliest sales, its <a href="../../../../../2011/09/05/zynga-linkedin-data-warehous/">continued success at LinkedIn</a> and a recent win at <span style="text-decoration: line-through;">an (NDA) fairly-big-name internet new account</span> <em>Razorfish</em> notwithstanding. That said, the first target market Aster did share with me was &#8220;digital marketing optimization,&#8221; which includes &#8220;marketing optimization&#8221; (duh), search engine optimization (SEO), clickstream analysis, and the like. Also, Aster is going after &#8220;data scientists&#8221; in general, and that&#8217;s a term I&#8217;m still seeing used most frequently in the internet area.</p>
<p><em>I&#8217;m seeing ever more granularity as companies break down internet-related market segments. DataStax showed me a chart last week of 15 different market segments it had sold into, and at least 14 were in some way internet-related.</em></p>
<p>Rather, if Aster is to name three industries in which it has pleasingly strong sales traction, it would say manufacturing (which in Teradata lingo includes resource extraction), financial services (including insurance), and retail. A cynic might note that that breakdown, like many similar ones, adds up to fairly large swaths of the economy and the computer market, but never mind that part. (Other firms might have thrown in telecommunications and health care as well, to get even more coverage.</p>
<p>Two of Aster&#8217;s other favorite application areas are social network analysis/influencer identification and &#8212; which is analytically very similar &#8212; fraud detection/prevention. Taken together, that&#8217;s a whole lot of graph analysis. And I note with interest that the influencer identification stuff does NOT seem to be concentrated in telecom, which is the traditional sector one would imagine it being used in; all those call records are a lovely source of graph edges. Rather, the influencers seem to be identified from sources such as social media and credit card data .</p>
<p><em>Once again, this kind of thing gives me privacy jitters.</em></p>
<p>The match between Aster&#8217;s favorite industries and application areas is pretty much as you might expect &#8212; fraud in financial services, influencer analysis in retailing (and probably consumer financial services too), and digital marketing in both. As for manufacturing, the opportunities there seem to be focused on machine-generated data. That would be at least in high-tech manufacturing (I bet especially in flow-oriented stuff such as semiconductor fab) and oil/gas. Smart grid opportunities don&#8217;t seem to have arisen yet for Aster the way they have for a couple other vendors.</p>
<p>As for general Aster business trends, I think they&#8217;re good, while Aster would perhaps want to portray them as very good. Aster named a couple of impressive joint Teradata/Aster wins under NDA, but only a couple. Ramping up sales headcount is proving challenging, and some sales leadership turnover probably hasn&#8217;t helped. I do believe Aster&#8217;s spin that this is a matter of somebody being promoted quickly to a bigger job, and am optimistic about the current team &#8212; still, such moves tend to have at least short-term cost.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/08/aster-data-business-trends/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Alternatives for Hadoop/MapReduce data storage and management</title>
		<link>http://www.dbms2.com/2011/05/14/hadoop-mapreduce-data-storage-management/</link>
		<comments>http://www.dbms2.com/2011/05/14/hadoop-mapreduce-data-storage-management/#comments</comments>
		<pubDate>Sat, 14 May 2011 05:00:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[EMC]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadapt]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Parallelization]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4438</guid>
		<description><![CDATA[There&#8217;s been a flurry of announcements recently in the Hadoop world. Much of it has been concentrated on Hadoop data storage and management. This is understandable, since HDFS (Hadoop Distributed File System) is quite a young (i.e. immature) system, with much strengthening and Bottleneck Whack-A-Mole remaining in its future. Known HDFS and Hadoop data storage [...]]]></description>
			<content:encoded><![CDATA[<p>There&#8217;s been a flurry of announcements recently in the Hadoop world. Much of it has been concentrated on Hadoop data storage and management. This is understandable, since HDFS (Hadoop Distributed File System) is quite a young (i.e. immature) system, with much strengthening and <a href="../../../../../2009/08/21/bottleneck-whack-a-mole/">Bottleneck Whack-A-Mole</a> remaining in its future.</p>
<p>Known HDFS and Hadoop data storage and management issues include but are not limited to:</p>
<ul>
<li>Hadoop is run by a master node, and specifically a namenode, that&#8217;s a single point of failure.</li>
<li>HDFS compression could be better.</li>
<li>HDFS likes to store three copies of everything, whereas many DBMS and file systems are satisfied with two.</li>
<li>Hive (the canonical way to do SQL joins and so on in Hadoop) is slow.</li>
</ul>
<p>Different entities have different ideas about how such deficiencies should be addressed.  <span id="more-4438"></span></p>
<p>For most practical purposes, <strong>Yahoo&#8217;s</strong> and <strong>IBM&#8217;s</strong> views about Hadoop have converged. Yahoo and IBM both believe that Hadoop data storage should be advanced solely through the <strong>Apache</strong> Hadoop open source process. In particular:</p>
<ul>
<li>IBM and Yahoo both talk of the great undesirability of Hadoop &#8220;forking&#8221; like Unix did.</li>
<li>Yahoo appeared on stage at IBM&#8217;s analyst event this week to reinforce the meeting-of-the-minds, even though there&#8217;s no IBM/Yahoo customer relationship involved.</li>
<li>IBM has disclaimed any intention of providing its own Hadoop distribution, but even so is committed to selling lots of <a href="http://www-01.ibm.com/software/data/bigdata/enterprise.html">IBM InfoSphere BigInsights</a>, which incorporates Apache Hadoop.*</li>
<li><a href="http://developer.yahoo.com/blogs/hadoop/posts/2011/01/announcement-yahoo-focusing-on-apache-hadoop-discontinuing-the-yahoo-distribution-of-hadoop/">Yahoo has stopped offering its own Hadoop distribution</a>, period.</li>
</ul>
<p><em>*IBM is emphatic about ruling out marketing terms whose connotation it doesn&#8217;t like. IBM&#8217;s Hadoop distribution isn&#8217;t a &#8220;distribution,&#8221; because that might make it sound too proprietary; IBM&#8217;s Oracle emulation offering <a href="../../../../../2009/04/24/ibms-oracle-emulation-strategy-reconsidered/#comment-118444">isn&#8217;t an &#8220;emulation&#8221; offering</a>, because that might make it sound too slow; and <a href="../../../../../2009/05/13/ibm-system-s-infosphere-streams-processing/">IBM&#8217;s CEP product InfoSphere Streams isn&#8217;t a &#8220;CEP&#8221; product</a>, because that might make it sound too non-functional.</em></p>
<p><strong>Cloudera</strong> can probably be regarded as part of the Yahoo/IBM camp, some stern looks from IBM in Cloudera&#8217;s direction notwithstanding. <a href="../../../../../2010/06/30/cloudera-enterprise-hadoop-evolution/">Cloudera Enterprise</a> &#8212; also an embrace-and-extend offering &#8212; remains the obvious choice for enterprises Hadoop users; meanwhile, nobody has convinced me of any bogosity in <a href="http://www.cloudera.com/hadoop/">the &#8220;no forking&#8221; claim Cloudera makes for its free/open source Hadoop distribution</a>. Indeed, when I visited Cloudera a couple of weeks ago, Mike Olson showed me a slide demonstrating that Cloudera might be supplanting Yahoo as the biggest ongoing contributor to Apache Hadoop.</p>
<p><strong>EMC&#8217;s Data Computing Division, </strong>nee&#8217; <strong>Greenplum,</strong> made a lot of Hadoop noise this week. Unlike Yahoo, IBM, and Cloudera, EMC really is forking Hadoop. <a href="../../../../../2011/04/05/comments-on-emc-greenplum/">I&#8217;m not talking with the EMC/Greenplum folks</a> these days, but the whole thing was covered from various angles by <a href="http://www.computerworld.com/s/article/9216541/EMC_unveils_Hadoop_appliance_BI_software">Lucas Mearian</a>, <a href="http://www.informationweek.com/news/software/info_management/229403178">Doug Henschen</a>, <a href="http://gigaom.com/cloud/emc-hadoop/">Derrick Harris</a>, and <a href="http://davidmenninger.ventanaresearch.com/2011/05/12/emc-enters-elephant-race-with-hadoop/">Dave Menninger</a>.</p>
<p>Another option is to entirely replace HDFS with a DBMS, whether distributed or just instanced at each node. <strong>DataStax</strong> is doing that with <a href="../../../../../2011/03/23/datastax-cassandrafs-hadoop-brisk/">Cassandra-based Brisk</a>; <strong><a href="../../../../../2011/03/23/hadapt-commercialized-hadoopdb/">Hadapt</a></strong> plans to do that with PostgreSQL and VectorWise <em>(edit: As per the comment below, Hadapt only plans a partial replacement of HDFS);</em> and <a href="../../../../../2011/04/17/netezza-twinfin-i-class-overview/">Netezza&#8217;s analytic platform</a> has a Hadoop-over-<strong>Netezza</strong> option as well. Mike Olson objects to such implementations being called &#8220;Hadoop&#8221;; but trademark issues aside, those vendors plan to support a broad variety of Hadoop-compatible tools. <strong>Aster Data</strong> has long taken that approach one step further, by offering an enhanced version of MapReduce &#8212; aka <a href="../../../../../2009/12/02/mapreduce-for-complex-analytics-webina/">SQL/MapReduce</a> &#8212; over its nCluster DBMS. And <a href="../../../../../2011/04/04/the-mongodb-story/"><strong>10gen</strong> offers a more primitive form of MapReduce with MongoDB</a>, but probably wouldn&#8217;t position it as addressing a &#8220;MapReduce market&#8221; at all.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/05/14/hadoop-mapreduce-data-storage-management/feed/</wfw:commentRss>
		<slash:comments>21</slash:comments>
		</item>
		<item>
		<title>DataStax introduces a Cassandra-based Hadoop distribution called Brisk</title>
		<link>http://www.dbms2.com/2011/03/23/datastax-cassandrafs-hadoop-brisk/</link>
		<comments>http://www.dbms2.com/2011/03/23/datastax-cassandrafs-hadoop-brisk/#comments</comments>
		<pubDate>Wed, 23 Mar 2011 17:38:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open source]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4054</guid>
		<description><![CDATA[Cassandra company DataStax is introducing a Hadoop distribution called Brisk, for use cases that combine short-request and analytic processing. Brisk in essence replaces HDFS (Hadoop Distributed File System) with a Cassandra-based file system called CassandraFS. The whole thing is due to be released (Apache open source) within the next 45 days. The core claims for [...]]]></description>
			<content:encoded><![CDATA[<p><a href="../../../../../2011/02/01/datastax-opscenter-cassandra/">Cassandra company DataStax</a> is introducing a Hadoop distribution called Brisk, for use cases that combine <a href="../../../../../2011/03/02/short-request-processing/">short-request</a> and analytic processing. Brisk in essence replaces HDFS (Hadoop Distributed File System) with a Cassandra-based file system called CassandraFS. The whole thing is due to be released (Apache open source) within the next 45 days.</p>
<p>The core claims for Cassandra/Brisk/CassandraFS are:</p>
<ul>
<li><strong>CassandraFS has the same interface as HDFS.</strong> So, in particular, you should be able to use most Hadoop add-ons with Brisk.</li>
<li><strong>CassandraFS has comparable performance to HDFS on sequential scans.</strong> That&#8217;s without predicate pushdown to Cassandra, which is Coming Soon but won&#8217;t be in the first Brisk release.</li>
<li><strong>Brisk/CassandraFS is much easier to administer than HDFS.</strong> In particular, there are no NameNodes, JobTracker single points of failure, or any other form of <strong>head node.</strong> Brisk/CassandraFS is strictly peer-to-peer.</li>
<li>Cassandra is far superior to HBase for short-request use cases, specifically with <strong>5-6X the random-access performance.</strong></li>
</ul>
<p>There&#8217;s a pretty good white paper around all this, which also recites general Cassandra claims <em>&#8211; [edit] and here at last is the <a href="http://www.datastax.com/wp-content/uploads/2011/03/WP-Brisk.pdf">link</a>.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/03/23/datastax-cassandrafs-hadoop-brisk/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Cassandra company DataStax (formerly Riptano) is on track</title>
		<link>http://www.dbms2.com/2011/02/01/datastax-opscenter-cassandra/</link>
		<comments>http://www.dbms2.com/2011/02/01/datastax-opscenter-cassandra/#comments</comments>
		<pubDate>Tue, 01 Feb 2011 09:26:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Telecommunications]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3706</guid>
		<description><![CDATA[Riptano, the Cassandra company, has changed its name to DataStax. DataStax has opened headquarters in Burlingame and hired some database-experienced folks – notably Ben Werther from Greenplum and Michael Weir from ParAccel, with Zenobia Godschalk (who worked with Aster Data) somewhere in the outside PR mix. Other than that, what&#8217;s new at DataStax is pretty [...]]]></description>
			<content:encoded><![CDATA[<p>Riptano, the Cassandra company, has changed its name to DataStax. DataStax has opened headquarters in Burlingame and hired some database-experienced folks – notably Ben Werther from Greenplum and Michael Weir from ParAccel, with Zenobia Godschalk (who worked with Aster Data) somewhere in the outside PR mix. Other than that, what&#8217;s new at DataStax is pretty much what could have been expected based on <a href="../../../../../2010/07/06/riptano-and-cassandra-adoption/">what DataStax folks said last spring</a>.</p>
<p>Most notably, DataStax is introducing a software offering, whose full name is DataStax OpsCenter for Apache Cassandra. DataStax OpsCenter for Apache Cassandra seems to be, in essence, a monitoring tool for Cassandra clusters, with a bit of capacity planning bundled in. (If there are any outright operations parts to DataStax OpsCenter, they got overlooked in our conversation.)* <span id="more-3706"></span>OpsCenter has been in beta at a few places, with another beta version rolled out recently.</p>
<p><em>*Yeah, DataStax OpsCenter Release 1 sounds pretty boring. But it&#8217;s apt to be useful even so. And cooler stuff should come down the pike later on.</em></p>
<p>There will of course be a free-download version of DataStax OpsCenter, entirely uncrippled; you&#8217;re just not allowed  to use free-download DataStax OpsCenter with production applications. Production users of DataStax OpsCenter will need subscriptions. Much like Cloudera, DataStax is bundling product and support subscriptions, so that you can&#8217;t buy one without the other. The current Gold/Silver/Bronze trichotomy will be slimmed down to Mission-Critical/Premier, and you&#8217;ll be allowed to have different levels for different application clusters.</p>
<p>Finally, a few customer notes:</p>
<ul>
<li>DataStax has &gt;50 subscription support customers.</li>
<li>One DataStax customer has 400 Cassandra nodes.</li>
<li>DataStax&#8217;s major industry segments are web (of course), government/intelligence, and telecom.</li>
<li>Separately – I&#8217;m not sure why this is separate – DataStax thinks the next market it will penetrate is real-time analytics. That means online counts or other aggregations, although presumably not at a Skytide level of sophistication.</li>
<li><a href="http://techblog.netflix.com/2011/01/nosql-at-netflix.html">Netflix has nice things to say about HBase and Cassandra alike</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/02/01/datastax-opscenter-cassandra/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>More on NoSQL and HVSP (or OLRP)</title>
		<link>http://www.dbms2.com/2010/08/26/nosql-hvsp-olrp/</link>
		<comments>http://www.dbms2.com/2010/08/26/nosql-hvsp-olrp/#comments</comments>
		<pubDate>Thu, 26 Aug 2010 09:10:31 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Akiban]]></category>
		<category><![CDATA[Basho and Riak]]></category>
		<category><![CDATA[Cache]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Clustrix]]></category>
		<category><![CDATA[CouchDB]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Schooner Information Technology]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Tokutek]]></category>
		<category><![CDATA[memcached]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2907</guid>
		<description><![CDATA[Since posting last Wednesday morning that I&#8217;m looking into NoSQL and HVSP, I&#8217;ve had a lot of conversations, including with (among others): Dwight Merriman of 10gen (MongoDB) Damien Katz of Couchio (CouchDB) Matt Pfeil of Riptano (Cassandra) Todd Lipcon of Cloudera (HBase committer) Tony Falco of Basho (Riak) John Busch of Schooner Ori Herrnstadt of [...]]]></description>
			<content:encoded><![CDATA[<p>Since posting last Wednesday morning that <a href="http://www.dbms2.com/2010/08/18/nosql-hvsp-adoption/">I&#8217;m looking into NoSQL and HVSP</a>, I&#8217;ve had a lot of conversations, including with (among others):</p>
<ul>
<li>Dwight Merriman of 10gen (MongoDB)</li>
<li>Damien Katz of Couchio (CouchDB)</li>
<li>Matt Pfeil of <a href="http://www.dbms2.com/2010/07/06/riptano-and-cassandra-adoption/">Riptano</a> (Cassandra)</li>
<li>Todd Lipcon of Cloudera (HBase committer)</li>
<li>Tony Falco of Basho (Riak)</li>
<li>John Busch of Schooner</li>
<li><strong><span style="font-weight: normal;">Ori Herrnstadt</span></strong> of <a href="http://www.dbms2.com/2010/04/03/akiban-highlights/">Akiban</a></li>
</ul>
<p><span id="more-2907"></span>By no means do I have time to do these conversations justice, in terms of giving them the write-ups and/or immediate follow-up that they deserve. Indeed, I&#8217;ll leave for vacation Saturday morning with my 2000-word NoSQL article still unwritten. So I&#8217;ll dump as many observations as I can into one or a few posts now, and play catch-up later as circumstances allow.</p>
<p>In no particular order:</p>
<ul>
<li>A number of NoSQL offerings have had more uptake to date than most of the scale-out SQL offerings have.</li>
<li>&#8220;Document-oriented&#8221; NoSQL projects CouchDB and MongoDB have probably had the most users get into production, but perhaps for pretty small systems.</li>
<li>Cassandra and Hbase &#8212; the column-group-architecture guys &#8212; have probably had the most bang-in-lots-of-writes <a href="http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/">HVSP</a> production uptake.*</li>
<li>I didn&#8217;t talk customer count with Schooner, but the decently-stocked <a href="http://www.schoonerinfotech.com/customers">Schooner customer page</a> suggests Schooner may be something of an exception to these generalities.</li>
<li>A lot of these companies are in the low-to-mid-teens of employees.</li>
<li>The SQL-oriented companies, despite having fewer or no customers, often seem to have more money. (One reason I get the impression SQL guys have more money is, frankly, that more  of them are talking about engaging <a href="http://www.monash.com/advantage.html">my services</a>.)
<ul>
<li>Schooner cites $20 million in VC.</li>
<li><a href="http://www.dbms2.com/2010/05/12/the-clustrix-story/">Clustrix</a> cites a figure close to that.</li>
<li>Basho cites $10 million, plus <a href="http://www.masshightech.com/stories/2010/08/02/daily35-Basho-rejects-VC-takes-late-friends-and-family-round.html">a new round of $1.5 or $2 or $2.5 million</a>. The new round is at a lowered valuation.</li>
<li>That same site says <a href="http://www.dbms2.com/2009/04/16/introduction-to-tokutek/">Tokutek</a> finally was able to<a href="http://www.masshightech.com/stories/2010/08/16/daily47-Database-software-firm-Tokutek-lands-28M.html"> raise some VC</a>. Congrats!</li>
</ul>
</li>
<li>It&#8217;s only a two-company trend, but I was pleased to hear that both 10gen/MongoDB and Akiban were seeing Drupal as a major use case or potential use case. No word on rescuing WordPress from its MySQL implementation, alas, but it seems that a Drupal site typically has 40-200+ tables, while a WordPress one has 10ish.</li>
<li>Another trend I think I&#8217;m seeing is serious object-oriented apps banging things straight into a simple back end. <a href="http://www.dbms2.com/2010/08/22/workday-stan-swete-database-architecture/">Workday</a> is a huge example of that. Akiban hopes to do something similar with Hibernate.</li>
<li>Stability and maturity are still issues for many of these products. E.g., HBase isn&#8217;t even in Release 1.0 yet. Ditto Cassandra, and surely many of the others. Unsurprisingly, <a href="http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html">making Cassandra stable is still a challenge</a>.</li>
</ul>
<p><em>*As is common for terms I suggest, the &#8220;HVSP&#8221; name is not getting any traction. What do you think of Marton Trencseni&#8217;s suggestion of <a href="http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/#comment-182138">OLRP, for OnLine Request Processing</a>?</em></p>
<p>One thing that makes following this area interesting is that so many projects are open source, leading there to be a lot of information in the wild. I hardly have time to read the mailing list for each project; but the people I talk with do, and often they may sorta kinda remember something somebody else posted one or several months back. As just one example, the mailing lists are said to confirm:</p>
<ul>
<li>Contrary to rumor, <a href="http://twitter.com/eventcloudpro/status/17872687577">Facebook hasn&#8217;t moved in-box search off of Cassandra</a>.</li>
<li>Apparently, however, it&#8217;s true that <a href="http://www.dbms2.com/2008/07/21/project-cassandra-facebook-open-sourced-quasi-dbms/">Cassandra inventor Facebook</a> has stopped working on Cassandra, and Facebook&#8217;s core Cassandra developers have shifted over to HBase.</li>
</ul>
<p>Also, figuring out usage of open source software can be &#8230; interesting.</p>
<ul>
<li> People who use open source software don&#8217;t have to reveal themselves, as there&#8217;s no purchase transaction to kick things off.</li>
<li>On the other hand, if they&#8217;re serious enough in their use, they often do.
<ul>
<li>There are two main ways to get tech support for open source software &#8212; the community or a company that sells support &#8212; and both ways let the main support-selling company know that one is a user.</li>
<li>Some folks even add themselves to open lists of users, for example these rather long lists for <a href="http://wiki.apache.org/hadoop/Hbase/PoweredBy">HBase</a> and <a href="http://wiki.apache.org/couchdb/CouchDB_in_the_wild">CouchDB</a>.</li>
<li>Or they show up at conferences. For example, <a href="http://twitter.com/spyced/status/21490457839">two</a> <a href="http://twitter.com/spyced/status/21675203015">tweets</a> from Riptano founder Jonathan Ellis suggest at least 30 production Cassandra users were represented at a recent event. That&#8217;s more detail than his colleague Matt Pfeil wanted to give me when talked. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </li>
</ul>
</li>
</ul>
<p>OK. This post has gotten pretty long, even without me saying anything resembling an overview of any of the seven companies I listed up top, or of their products&#8217; adoption. So I&#8217;ll just publish this now.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/26/nosql-hvsp-olrp/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Riptano, and Cassandra adoption</title>
		<link>http://www.dbms2.com/2010/07/06/riptano-and-cassandra-adoption/</link>
		<comments>http://www.dbms2.com/2010/07/06/riptano-and-cassandra-adoption/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 09:11:40 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Specific users]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2480</guid>
		<description><![CDATA[Tonight&#8217;s Cassandra technology post got plenty long enough on its own, so I&#8217;m separating out business and adoption issues here. For starters, known Cassandra users include: Facebook, which has said it has 150 or so Cassandra nodes (but see below) Twitter, which has said it has 45 or so Cassandra nodes Rackspace, which used to [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Tonight&#8217;s <a href="http://www.dbms2.com/2010/07/06/cassandra-technical-overview/">Cassandra technology post</a> got plenty long enough on its own, so I&#8217;m separating out business and adoption issues here. For starters, known Cassandra users include:</p>
<ul>
<li>Facebook, which has said it has 	150 or so Cassandra nodes (but see below)</li>
<li>Twitter, which has said it has 45 	or so Cassandra nodes</li>
<li>Rackspace, which used to be 	Jonathan Ellis&#8217; employer, and now is backing Cassandra company 	Riptano</li>
<li>Digg, which along with Twitter and 	Rackspace was one of the three major users helping advance the 	Cassandra project</li>
<li>OpenX, Simple Geo, Digital 	Reasoning, who Jonathan cited as production users in March</li>
<li>Cloudkick, as noted and linked in 	my other post</li>
<li>Two 	customers Riptano named at launch (but I&#8217;ve forgotten who they were*)</li>
</ul>
<p style="margin-bottom: 0in;">Fetlife, Meebo, and others seem to at least have a healthy interest in Cassandra, based on their level of involvement in a forthcoming <a href="http://cassandrasummit2010.eventbrite.com/">Cassandra Summit</a>. That said, the <a href="http://twitter.com/fetlife">@Fetlife</a> tweetstream features numerous yelps of pain, and I don&#8217;t mean the recreational kind.  <span id="more-2480"></span></p>
<p style="margin-bottom: 0in;"><em>*And I can&#8217;t easily find a launch press release, whether on the rather minimalist Riptano website or elsewhere.</em></p>
<p style="margin-bottom: 0in;">Beyond that, when Riptano launched in May, the Riptano guys (mainly Jonathan Ellis) said:</p>
<ul>
<li>They were sure there were dozens 	of Cassandra user organizations, maybe even &gt;100. But there 	weren&#8217;t 100s.</li>
<li>Maybe 20-40% of those Cassandra 	sites were in production. (But I don&#8217;t think I&#8217;d multiply that out 	to suggest there were, say, 35-50 production Cassandra users.)</li>
<li>4000 people were going daily to 	the Apache Cassandra site.</li>
<li>There were 250 Cassandra downloads 	daily.</li>
<li>Lots of startups were using 	Cassandra.</li>
<li>Lots of other companies were 	looking at switching over to Cassandra.</li>
<li>Many potential Cassandra users had 	been waiting for a Cassandra company to be available to support it.</li>
<li>The median number of Cassandra 	(production?) nodes is probably 8-10. 4 would be a low end figure.</li>
</ul>
<p style="margin-bottom: 0in;">That&#8217;s a lot of adoption for a not-even-Release-1 open source project. Even so, there&#8217;s a feeling going around that Cassandra has lost some momentum the past couple of months. Most notably, <a href="../2008/07/21/project-cassandra-facebook-open-sourced-quasi-dbms/">Facebook, which created Cassandra in the first place,</a> isn&#8217;t using it for new projects. True, I&#8217;m hearing even less evidence that any one of Membase, Voldemort, <a href="http://www.dbms2.com/2010/05/25/voltdb-finally-launches/">VoltDB</a>, <a href="http://www.dbms2.com/2010/04/03/akiban-highlights/">Akiban</a>, <a href="http://www.dbms2.com/2010/05/12/the-clustrix-story/">Clustrix</a>, or Riak – for example – is setting the world on fire than I am for Cassandra. But the viable Cassandra alternatives are piling up. Cassandra isn&#8217;t the only or even primary game in town, and for that matter I haven&#8217;t heard any concise description of a niche in which Cassandra is the unquestioned leader.</p>
<p style="margin-bottom: 0in;"><em>Edit: <a href="http://twitter.com/EventCloudPro/status/17872687577">A/the Facebook project that continues to run on Cassandra</a> is Inbox search.</em></p>
<p style="margin-bottom: 0in;">As for Riptano itself:</p>
<ul>
<li>Riptano launched with two founders 	and immediately made an offer to a third guy. I don&#8217;t know how many 	folks they have now, two months later.</li>
<li>Rackspace put some funding into 	Riptano.</li>
<li>Riptano&#8217;s strategy sounds a lot 	like <a href="../2010/06/30/cloudera-enterprise-hadoop-evolution/">Cloudera&#8217;s</a>, 	by which I mean:
<ul>
<li>Riptano&#8217;s business is all 	services, whether training, consulting, or support.</li>
<li>Riptano&#8217;s intended main business 	is obviously support.</li>
<li>Notwithstanding the above, Riptano 	intends to eventually offer proprietary software, bundled with its 	support services.</li>
<li>The first area of focus for that 	proprietary software is intended to be management tools.</li>
<li>I wouldn&#8217;t be surprised if, like 	Cloudera, Riptano tweaks its software focus from “stuff that lets 	us support you better” to “integration with stuff you pay for.” 	Those strategies are actually pretty similar.</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">Riptano seems to be starting out with support pricing around $1,000-$4,000/server/year, before quantity discounts.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/06/riptano-and-cassandra-adoption/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Cassandra technical overview</title>
		<link>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/</link>
		<comments>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/#comments</comments>
		<pubDate>Tue, 06 Jul 2010 09:10:39 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Amazon and its cloud]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2473</guid>
		<description><![CDATA[Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I&#8217;m finally finding time to clear my Cassandra/Riptano [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I&#8217;m finally finding time to clear my Cassandra/Riptano backlog. I&#8217;ll cover the more technical parts below, and the more business- or usage-oriented ones in <a href="http://www.dbms2.com/2010/07/06/riptano-and-cassandra-adoption/">a companion Cassandra/Riptano post</a>.</p>
<p style="margin-bottom: 0in;">Jonathan&#8217;s core claims for Cassandra include:</p>
<ul>
<li>Cassandra is shared-nothing.</li>
<li>Cassandra has good approaches to 	replication and partitioning, right out of the box.</li>
<li>In particular, Cassandra is good 	for use cases that distribute a database around the world and want 	to access it at “local” latencies. (Indeed, Jonathan asserts 	that non-local replication is a significant non-big-data Cassandra 	use case.)</li>
<li>Cassandra&#8217;s scale-out is 	application-transparent, unlike sharded MySQL&#8217;s.</li>
<li> Cassandra is fast at both appends 	and range queries, which would be hard to accomplish in a pure 	key-value store.</li>
</ul>
<p style="margin-bottom: 0in;">In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he&#8217;s concerned, may well belong in a more traditional SQL DBMS.  <span id="more-2473"></span></p>
<p style="margin-bottom: 0in;">Further highlights of our talks included, as best I understood them:</p>
<ul>
<li>Cassandra is based in parts both 	on Google&#8217;s <strong>BigTable</strong> paper of 2006 and Amazon&#8217;s <strong>Dynamo</strong> paper of 2007.
<ul>
<li>The core of what Cassandra takes 	from BigTable is based on <strong>log-structured merge trees,</strong> which 	actually entered the computer science literature in 1996.</li>
<li>Cassand<span style="font-weight: normal;">ra&#8217;s 	approach to horizontal scaling, replication, failover, etc. seems to 	be based Dynamo. </span></li>
</ul>
</li>
<li>There seems to be <strong>a logical 	concept of “row”</strong> in Cassandra, or it&#8217;s at least meaningful 	to use the SQL/relational concept of a “row” when talking about 	Cassandra data. However, Cassandra is closer to being a <strong>column-based 	data store</strong> than a row-based one. (Not the same thing, but 	closer.)</li>
<li>Even so, it only takes a single 	seek to return a whole Cassandra “row”.</li>
<li>Cassandra 	writes data quite differently from the way a classical OLTP DBMS 	would.
<ul>
<li><strong>Cassandra writes just the data 	elements</strong><span style="font-weight: normal;"> – i.e., fields – </span><strong>that are actually being inserted or changed,</strong> not whole 	rows.</li>
<li>One benefit is that Cassandra data 	is very <strong>sparse.</strong> NULLs aren&#8217;t stored in any way, and hence in 	particular take up no space.</li>
<li>Another benefit – and one of the 	core concepts of Cassandra – is that <strong>you can implicitly assume 	different schemas for different rows of the same “table.”</strong> In 	particular, you can add data for columns that you didn&#8217;t envision 	when you first started storing “rows” of the same “table.”</li>
<li><strong>Writes are collected into 	sorted “memtables,” which from time to time are sent to disk.</strong> Once data gets to disk, it&#8217;s <strong>immutable,</strong> except for occasional 	merge/reorganization/garbage collection.
<ul>
<li>Jonathan claims, plausibly, that 	this makes write throughput very fast (because the I/O is 	fundamentally sequential in nature.)</li>
<li>The default as to how long data 	typically stays in memory before it gets persisted to dis<span style="font-weight: normal;">k 	is “whichever comes first of {64 MB written, 300k updates, 1 	hour}”. </span></li>
</ul>
</li>
<li>Cassandra has <strong>durability</strong> – 	guaranteed non-loss of data – assuming fsync is turned on. fsync 	seems to create a 15% or so overhead.</li>
<li>However, Cassandra has <strong>no 	concept of a “transaction.”</strong></li>
<li>As one would 	expect, data can be read even before it has been persisted to disk.</li>
</ul>
</li>
<li>According to 	Jonathan, Cassandra can do about 14,000 writes or 7,000 reads per 	second, on a quad-core server.
<ul>
<li>Those figures scale pretty 	linearly with the number of servers. (There&#8217;s some overhead for 	network latency.)</li>
<li>Those figures assume a five-column 	row.</li>
<li>Cassandra&#8217;s write-performance 	figures are only “mildly sensitive” to the width of the row. 	E.g., doubling row width only gives a 15-20% throughput hit, due to 	some fixed per-row overhead. That said, I imagine going 100X in row 	width would create a major slowdown, although perhaps while 	measuring width more in bytes than in column count.</li>
<li>Cassandra&#8217;s <span style="color: #000080;"><span lang="zxx"><span style="text-decoration: underline;"><a href="http://racklabs.com/%7Ebwilliam/cassandra/04vs05vs06.png">performance</a></span></span></span> has been growing nicely in each point release. Jonathan thinks this 	general trend will continue.</li>
</ul>
</li>
<li>Jonathan thinks Cassandra is 	pretty good at keeping your data safe.
<ul>
<li>Each node has a commit log.</li>
<li>When a node goes down, its writes 	are buffered until it comes back up.</li>
</ul>
</li>
<li>You can run Hadoop MapReduce 	straight against Cassandra files.</li>
<li>A Cassandra node might hold 	anything from 10s of gigabytes to multiple terabytes of data. You 	might want to go with the low end if you want to have lots of cache 	hits.</li>
<li>Solid-state storage would speed up 	Cassandra reads, not writes, and is not widely used with Cassandra 	yet.</li>
<li>Jonathan says Cassandra is really 	good at handling time series data, by which I suspect he means log 	files. <a href="https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/">Cloudkick</a> is a user of this capability.</li>
</ul>
<p style="margin-bottom: 0in;">I certainly didn&#8217;t grasp everything about Cassandra replication and partitioning strategies. That wasn&#8217;t the focus of our talks, and anyway I got the impression they are so flexible that there&#8217;s little that can firmly be said about them. But I did get the impressions:</p>
<ul>
<li>You set your consistency rules in 	the Cassandra API, not on a per-table basis. (I think this means 	that a lack of administrative tools is supposedly a feature, not a 	drawback.)</li>
<li>As a practical matter, Cassandra 	users commonly take one of two approaches to consistency:
<ul>
<li><a href="http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/">RYW consistency</a>, most 	commonly with N = 3 and R = W = 2.</li>
<li>Geographically dispersed eventual 	consistency.</li>
</ul>
</li>
<li>Cassandra data is most commonly 	distributed via consistent hashing, but other options are 	“pluggable.”</li>
<li>If you add a node, the busiest 	note automagically decides to ship some data over, reducing its 	load. Of course, this only works if you get the new node on before 	the old node is so maxed out it doesn&#8217;t have time to do the 	shipping.</li>
</ul>
<p style="margin-bottom: 0in;">When we talked in March, the next release of Cassandra was going to be 0.7. Cassandra 0.7 was going to be a performance/scalability release, for example fixing the flaw that garbage collection read rows into memory one at a time. After that, Cassandra 0.8 was to be a feature release, with one planned feature being more automatic index management and/or materialized-view-like capability, so as to reduce the burden on Cassandra developers of schema management.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>My M<span style="font-style: normal;">arch 	<a href="../2010/03/12/some-nosql-links/">NoSQL 	links post</a> included </span>the Google and Amazon papers</li>
<li>The <a href="https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/">March 	2, 2010 Cloudkick post</a> also linked above goes into a lot of 	detail, including what they think is great about Cassandra and what 	they think is still missing</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/07/06/cassandra-technical-overview/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

