<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Hadoop</title>
	<atom:link href="http://www.dbms2.com/category/products-and-vendors/hadoop-products-and-vendors/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Comments on SAS</title>
		<link>http://www.dbms2.com/2012/02/08/comments-on-sas/</link>
		<comments>http://www.dbms2.com/2012/02/08/comments-on-sas/#comments</comments>
		<pubDate>Wed, 08 Feb 2012 22:51:11 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[KXEN]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[SAS Institute]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5939</guid>
		<description><![CDATA[A reporter interviewed me via IM about how CIOs should view SAS Institute and its products. Naturally, I have edited my comments (lightly) into a blog post. They turned out to be clustered into three groups, as follows: SAS faces a number of challenges, not unlike those faced by other high-priced legacy technology vendors. It [...]]]></description>
			<content:encoded><![CDATA[<p>A reporter interviewed me via IM about how CIOs should view SAS Institute and its products. Naturally, I have edited my comments (lightly) into a blog post. They turned out to be clustered into three groups, as follows:</p>
<ul>
<li>SAS faces a number of challenges, not unlike those faced by other high-priced legacy technology vendors.
<ul>
<li>It is used by organizations who have large budgets to pay for the product and to pay people to be expert on the product&#8217;s intricacies.</li>
<li>SAS has not integrated with scale-out analytic DBMS technologies as well or quickly as had been hoped, or as earlier marketing suggested was likely.</li>
<li>SAS has not been strong in helping its users do <a href="http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-easy-parts/">agile predictive analytics</a>.</li>
</ul>
</li>
<li>SAS&#8217; strengths are concentrated in product breadth:
<ul>
<li>Lots of statistical algorithms.</li>
<li>Various vertical products that make the modeling techniques more accessible in specific application domains.</li>
<li><a href="http://www.dbms2.com/2011/04/21/sas-hpa-does-make-sense-after-all/">Various approaches to engineering for scalability</a> &#8212; no one of those has been a table-thumping success to date, but SAS has the resources to keep trying.</li>
<li>Some level of integration with its own business intelligence and text analytics products.</li>
</ul>
</li>
<li>For any particular use case, the burden of proof is on SAS alternatives to show that they have enough pieces in the toolkit to meet the needs.
<ul>
<li>SPSS (now owned by IBM) also has legacy issues.</li>
<li>KXEN is focused on marketing use cases.</li>
<li>Mahout has been one of the less successful Hadoop-related open source projects.</li>
<li>R-based technology is still maturing.</li>
<li>The modeling capabilities (as opposed to just scoring) bundled into RDBMS and well-parallelized tend to be pretty limited. Apparent exceptions tend to just be R repackaged.</li>
</ul>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/02/08/comments-on-sas/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop-related market categorization</title>
		<link>http://www.dbms2.com/2012/02/07/hadoop-related-market-categorization/</link>
		<comments>http://www.dbms2.com/2012/02/07/hadoop-related-market-categorization/#comments</comments>
		<pubDate>Tue, 07 Feb 2012 06:49:30 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Open source]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5914</guid>
		<description><![CDATA[I wasn&#8217;t the only one to be dubious about Forrester Research&#8217;s Hadoop taxonomy (or lack thereof). GigaOm&#8217;s Derrick Harris was as well, and offered a much superior approach of his own. In Derrick&#8217;s view, there&#8217;s Hadoop, Hadoop distributions, Hadoop management, and Hadoop applications. Taking those out of order, and recalling that no market categorization is [...]]]></description>
			<content:encoded><![CDATA[<p>I wasn&#8217;t the only one to be <a href="http://www.dbms2.com/2012/02/06/comments-on-the-2012-forrester-wave-enterprise-hadoop-solutions/">dubious about Forrester Research&#8217;s Hadoop taxonomy</a> (or lack thereof). GigaOm&#8217;s Derrick Harris was as well, and offered <a href="http://gigaom.com/cloud/what-it-really-means-when-someone-says-hadoop/">a much superior approach of his own</a>. In Derrick&#8217;s view, there&#8217;s Hadoop, Hadoop distributions, Hadoop management, and Hadoop applications. Taking those out of order, and recalling that <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">no market categorization is ever precise</a>:</p>
<ul>
<li>&#8220;Hadoop applications&#8221; is a catch-all category. Since Derrick offered suitable caveats around the label, I&#8217;m fine with what he said.</li>
<li>Hadoop management software commonly comes in the form of suites. Derrick&#8217;s discussion was solid.</li>
<li>Derrick seems to want to define &#8220;Hadoop&#8221; as being whatever is in the relevant Apache projects. Cool. He does seem to wind up on both sides of the &#8220;MapR and DataStax put Hadoop MapReduce on top of something that isn&#8217;t HDFS &#8212; so is that Hadoop or isn&#8217;t it?&#8221; question, but that&#8217;s a tough ambiguity to avoid.</li>
<li>Derrick could have been a little clearer on the subject of Hadoop distributions.</li>
</ul>
<p>Let&#8217;s drill down into that last one. Derrick refers to Hadoop distributions as &#8220;products&#8221; that:</p>
<blockquote><p>package a set of Hadoop projects (MapReduce, Hive, Sqoop, Pig, etc.) in a  way that in theory makes them integrate more naturally, and to run both  smoothly and securely.</p></blockquote>
<p>While that&#8217;s a reasonable recitation of the idea&#8217;s benefits, I&#8217;d rather say that a &#8220;distribution&#8221; of open source software comprises:<span id="more-5914"></span></p>
<ul>
<li>Open source software, in selected versions.</li>
<li>(Possibly) additional code.</li>
<li>(Likely) documentation.</li>
<li>(Possibly) legal assurances such as intellectual property indemnification.</li>
</ul>
<p>In the case of Hadoop:</p>
<ul>
<li> The version selection is a relatively big deal. There are a lot of Hadoop sub-projects. There&#8217;s been some splitting and forking and recombination. Testing a specific set of point releases for integration and bugs is a non-trivial user benefit.</li>
<li>The additional code is generally focused on installation or whatever, because the rest is bundled into separately identified management software. Even so, because of the large number of moving parts, this is a good thing to have.</li>
<li>What&#8217;s more, in the case of Cloudera, using a particular distribution (theirs) is a prerequisite to getting the most widely adopted Hadoop management software (also theirs), which in turn is required if you want the industry&#8217;s most widely adopted Hadoop support (ditto). Similar things are apt to be true of rival distributions.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/02/07/hadoop-related-market-categorization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Comments on the 2012 Forrester Wave: Enterprise Hadoop Solutions</title>
		<link>http://www.dbms2.com/2012/02/06/comments-on-the-2012-forrester-wave-enterprise-hadoop-solutions/</link>
		<comments>http://www.dbms2.com/2012/02/06/comments-on-the-2012-forrester-wave-enterprise-hadoop-solutions/#comments</comments>
		<pubDate>Mon, 06 Feb 2012 05:16:20 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EMC]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Hortonworks]]></category>
		<category><![CDATA[MapR]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Pentaho]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5886</guid>
		<description><![CDATA[Forrester has released its Q1 2012 Forrester Wave: Enterprise Hadoop Solutions. (Googling turns up a direct link, but in case that doesn&#8217;t prove stable, here also is a registration-required link from IBM&#8217;s Conor O&#8217;Mahony.) My comments include: The Forrester Wave&#8217;s relative vendor rankings are meaningless, in that the document compares apples, peaches, almonds, and peanuts. [...]]]></description>
			<content:encoded><![CDATA[<p>Forrester has released its Q1 2012 Forrester Wave: Enterprise Hadoop Solutions. (Googling turns up a <a href="http://www.forrester.com/rb/go?docid=60755&amp;oid=1-K07LCA&amp;action=5">direct link</a>, but in case that doesn&#8217;t prove stable, here also is <a href="http://database-diary.com/2012/02/02/get-a-free-copy-of-the-forrester-wave-for-enterprise-hadoop-solutions/">a registration-required link from IBM&#8217;s Conor O&#8217;Mahony</a>.) My comments include:</p>
<ul>
<li>The Forrester Wave&#8217;s <strong>relative vendor rankings are meaningless,</strong> in that the document compares apples, peaches, almonds, and peanuts. Apparently, it covers any vendor that includes a distribution of Apache Hadoop MapReduce into something it offers, and that offered at least two (not necessarily full production) references for same.</li>
<li>The Forrester Wave for &#8220;enterprise Hadoop&#8221; contradicts itself on the subject of Hortonworks.
<ul>
<li>The Forrester Wave for &#8220;enterprise Hadoop&#8221; is correct when it says <strong>&#8220;Hortonworks &#8230; has Hadoop training and professional services offerings that are still embryonic.&#8221;</strong></li>
</ul>
<ul>
<li>Peculiarly, the Forrester Wave for &#8220;enterprise Hadoop&#8221; also says &#8220;Hortonworks offers an impressive Hadoop professional services portfolio&#8221;. Hortonworks will likely win one or more nice partnership deals with vendors in adjacent fields, but even so its professional services capabilities are &#8230; well, a good word might be &#8220;embryonic&#8221;.</li>
</ul>
</li>
<li><a href="http://www.dbms2.com/2011/02/11/comments-on-the-2011-forrester-wave-for-enterprise-data-warehouse-platforms/">Forrester Waves always seem to have weird implicit definitions of &#8220;data warehousing&#8221;</a>. This one is no exception.</li>
<li>Forrester gave top marks in &#8220;Functionality&#8221; to 11 of 13 &#8220;enterprise Hadoop&#8221; vendors. This seems odd.</li>
<li>I don&#8217;t know why MapR, which doesn&#8217;t like HDFS (Hadoop Distributed File System), got top marks in &#8220;Subproject integration&#8221;.</li>
<li>Forrester gave top marks in &#8220;Storage&#8221; to Datameer. It also gave higher marks to MapR than to EMC Greenplum, even though EMC Greenplum&#8217;s technology is a superset of MapR&#8217;s. Very strange. <em>(Edit: Actually, as per a comment below, there is some uncertainty about the EMC/MapR relationship.)</em></li>
<li>Forrester gave higher marks in &#8220;Acceleration and optimization&#8221; to Hortonworks than to Cloudera and IBM, and higher marks yet to Pentaho. Very odd.</li>
<li>I&#8217;m not sure what Forrester is calling a &#8220;Distributed EDW file store connector&#8221;, but it sounds like something that Cloudera has provided via partnership to a number of analytic DBMS vendors.</li>
<li>Forrester&#8217;s &#8220;Strategy&#8221; rankings seem to correlate to a metric of &#8220;We&#8217;re a large enough vendor to go in N directions at once&#8221;, for various values of N.</li>
<li>Forrester is correct to rank Cloudera&#8217;s &#8220;Adoption&#8221; as being stronger than EMC/Greenplum&#8217;s or MapR&#8217;s. But Hortonworks&#8217; strong mark for &#8220;Adoption&#8221; baffles me.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/02/06/comments-on-the-2012-forrester-wave-enterprise-hadoop-solutions/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Notes on the Oracle Big Data Appliance</title>
		<link>http://www.dbms2.com/2012/01/10/notes-on-the-oracle-big-data-appliance/</link>
		<comments>http://www.dbms2.com/2012/01/10/notes-on-the-oracle-big-data-appliance/#comments</comments>
		<pubDate>Wed, 11 Jan 2012 01:32:39 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Pricing]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5809</guid>
		<description><![CDATA[Oracle announced its Big Data Appliance. Specs may be found in the Oracle Big Data Appliance press release. Beyond that: The most important software on the Oracle Big Data Appliance is a full set of Cloudera Enterprise code. Oracle will do Tier 1 Cloudera/Hadoop support, while Cloudera handles Tiers 2 and 3. The key spec [...]]]></description>
			<content:encoded><![CDATA[<p>Oracle announced its Big Data Appliance. Specs may be found in <a href="http://www.oracle.com/us/corporate/press/1453721">the Oracle Big Data Appliance press release</a>. Beyond that:</p>
<ul>
<li>The most important software on the Oracle Big Data Appliance is a full set of <a href="../2012/01/10/a-couple-of-links-explaining-cloudera-manager/">Cloudera Enterprise</a> code. Oracle will do Tier 1 Cloudera/Hadoop support, while Cloudera handles Tiers 2 and 3.</li>
<li>The key spec ratios are 1 core/4 GB RAM/3 TB raw disk. That&#8217;s reasonably in line with <a href="http://www.dbms2.com/2011/06/04/hardware-for-hadoop/">Cloudera figures I published in June, 2010</a>.</li>
<li>This is really Oracle&#8217;s <a href="http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/">multi-structured big data appliance</a>. Oracle&#8217;s relational big data appliance is Exadata, which has been out for years and has comparable capacity to Oracle&#8217;s new &#8220;Big Data Appliance.&#8221; (<a href="http://www.eweek.com/c/a/IT-Infrastructure/Oracle-Launches-ClouderaPowered-Big-Data-Appliance-172364/">Chris Preimesberger</a> made a similar point.)</li>
<li>The Oracle Big Data Appliance list price is $450,000 for 18 12-core servers, plus $54,000/year maintenance.
<ul>
<li>That&#8217;s around $25,000 per server (and associated storage).</li>
<li>That&#8217;s also around $2,000/core.</li>
<li>That&#8217;s also around $500/TB of spinning disk, before <a href="http://www.dbms2.com/2011/07/06/hadoop-hardware-and-compression/">compression</a>.</li>
<li>None of those per-unit figures sounds ridiculous &#8230;</li>
<li>&#8230; but because of Oracle&#8217;s appliance configuration there&#8217;s indeed a hefty minimum initial purchase.</li>
</ul>
</li>
</ul>
<p><a href="http://www.zdnet.com/blog/btl/oracle-rolls-out-big-data-play-with-aggressive-price-cloudera/66529"><span id="more-5809"></span>Peter Goldmacher</a> argues that, because of size and price point, the Oracle Big Data appliance is targeted for high-end deployments rather than starter/test/development set-ups. To first approximation, that makes sense, in that:</p>
<ul>
<li>The Oracle Big Data Appliance is in the petabyte range for data capacity, and &#8230;</li>
<li>&#8230; <a href="http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/">the number of petabyte-scale Hadoop deployments is in the low tens</a>, and &#8230;</li>
<li>&#8230; many of those aren&#8217;t at Oracle shops anyway.</li>
</ul>
<p>Surely the Oracle Big Data Appliance isn&#8217;t designed for the 4-8 node play-with-Hadoop crowd.</p>
<p>On the the other hand, if you&#8217;re at a big, committed Oracle shop, and you want to do your first serious Hadoop deployment, why not go with the Oracle Big Data Appliance? You probably could save money with an alternative approach &#8212; but if your employers are committed to Oracle, saving money is surely not their greatest concern. Overpay by a bit; make your management happy with the Oracle logo; get Hadoop on your resume; prosper. That seems like a winning plan all the way around.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/10/notes-on-the-oracle-big-data-appliance/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>A couple of links explaining Cloudera Manager</title>
		<link>http://www.dbms2.com/2012/01/10/a-couple-of-links-explaining-cloudera-manager/</link>
		<comments>http://www.dbms2.com/2012/01/10/a-couple-of-links-explaining-cloudera-manager/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 22:23:22 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Oracle]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5798</guid>
		<description><![CDATA[Predictably, I wasn&#8217;t pre-briefed on the details of Oracle&#8217;s Big Data Appliance announcement today, and an inquiry to partner Cloudera doesn&#8217;t happen to have been immediately answered.* But anyhow, it&#8217;s clear from coverage by Larry Dignan and Derrick Harris that Oracle&#8217;s Big Data Appliance includes: Some version of Cloudera Manager (I&#8217;m guessing more or less [...]]]></description>
			<content:encoded><![CDATA[<p>Predictably, I wasn&#8217;t pre-briefed on the details of Oracle&#8217;s Big Data Appliance announcement today, and an inquiry to partner Cloudera doesn&#8217;t happen to have been immediately answered.* But anyhow, it&#8217;s clear from coverage by <a href="http://www.zdnet.com/blog/btl/oracle-rolls-out-big-data-play-with-aggressive-price-cloudera/66529">Larry Dignan</a> and <a href="http://gigaom.com/cloud/cloudera-brings-the-hadoop-to-oracles-big-data-appliance/">Derrick Harris</a> that Oracle&#8217;s Big Data Appliance includes:</p>
<ul>
<li>Some version of Cloudera Manager (I&#8217;m guessing more or less the best one).*</li>
<li>Some version of Apache Hadoop (I&#8217;m guessing the same distribution that Cloudera prefers to use).*</li>
<li>Some kind of support.</li>
</ul>
<p>In other words, it&#8217;s a lot like getting Cloudera Enterprise,* plus some hardware, plus some other stuff.</p>
<p><em>*Edit: About 2 minutes after I posted this, I got email from Cloudera CEO Mike Olson. Yes, the Oracle Big Data Appliance bundles Cloudera Enterprise.</em></p>
<p>That raises an anyway recurring question: <strong>What exactly is Cloudera Manager?</strong> <span id="more-5798"></span>When asked, I&#8217;ve always tended to mumble something like: <strong>Um, it&#8217;s management stuff. </strong>There&#8217;s an overview on <a href="http://www.cloudera.com/products-services/tools/">the Cloudera Manager product page</a>, but it doesn&#8217;t really say much, even if you click on the Data Sheet link. More helpful, I think, is <a href="http://www.cloudera.com/blog/2011/12/cloudera-manager-3-7-released/">a December post on Cloudera&#8217;s busy blog</a>. Technically, the post is about the new features in the Cloudera Manager 3.7 point release, but more generally it helps to explain what Cloudera Manager does, in areas such as (and these bullet points are all direct quotes):</p>
<ul>
<li> Automated Hadoop Deployment</li>
<li> Centralized Management</li>
<li> Configuration Management</li>
<li> Service Monitoring</li>
<li> Log Search</li>
<li> Events and Alerts</li>
<li> Configuration versioning and Audit trails</li>
<li> Activity Monitoring</li>
<li> Operational Reports</li>
</ul>
<p>Taken together,<strong> those two Cloudera links do a pretty good job of explaining Cloudera Manager, and illustrating why a Hadoop user would want to have either Cloudera Manager or a similar competitive offering.</strong></p>
<p><em>Edit: The day after I originally made this post, Cloudera put up another post <a href="http://www.cloudera.com/blog/2012/01/cloudera-manager-thank-you-customers/">directly explaining what Cloudera Manager is about</a>.<br />
</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/10/a-couple-of-links-explaining-cloudera-manager/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Big data terminology and positioning</title>
		<link>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/</link>
		<comments>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 01:35:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5768</guid>
		<description><![CDATA[Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions: Bigness &#8212; Volume, Velocity, size Structure &#8212; Variety, Variability, Complexity given that High-velocity &#8220;big data&#8221; problems are usually high-volume as well.* Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction. But the conflation should [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I observed that <a href="../../../../../2011/09/11/big-data-has-jumped-the-shark/">Big Data terminology is seriously broken</a>. It is reasonable to reduce the subject to two quasi-dimensions:</p>
<ul>
<li><strong>Bigness</strong> &#8212; Volume, Velocity, size</li>
<li><strong>Structure</strong> &#8212; Variety, Variability, Complexity</li>
</ul>
<p>given that</p>
<ul>
<li>High-velocity &#8220;big data&#8221; problems are usually high-volume as well.*</li>
<li>Variety, variability, and complexity all relate to the <a href="../../../../../2011/05/17/poly-structured-database/">simply-structured/poly-structured</a> distinction.</li>
</ul>
<p>But the conflation should stop there.</p>
<p><em>*Low-volume/high-velocity problems are commonly referred to as <a href="../2011/08/25/renaming-cep-or-not/">&#8220;event processing&#8221; and/or &#8220;streaming&#8221;</a>.</em></p>
<p>When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2&#215;2 matrix of possibilities. For want of better alternatives, my suggestions are:</p>
<ul>
<li><strong>Relational big data</strong> is data of high volume that fits well into a relational DBMS.</li>
<li><strong>Multi-structured big data</strong> is data of high volume that doesn&#8217;t fit well into a relational DBMS. <em>Alternative: Poly-structured big data.</em></li>
<li><strong>Conventional relational data</strong> is data of not-so-high volume that fits well into a relational DBMS. <em>Alternatives: Ordinary/normal/smaller relational data.</em></li>
<li><strong>Smaller poly-structured data</strong> is data for which <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schema</a> capabilities are important, but which doesn&#8217;t rise to &#8220;big data&#8221; volume.</li>
</ul>
<p><span id="more-5768"></span>Notes on all this include:</p>
<ul>
<li>&#8220;Relational big data&#8221; is commonly what you need a scalable analytic relational DBMS for. But there are non-analytic use cases as well.</li>
<li>The paradigmatic example of &#8220;multi-structured big data&#8221; is log files. Thus, multi-structured big data is commonly what you need a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> for.</li>
<li>One might want to equate non-analytic relational big data technology to &#8220;NewSQL&#8221;. However, I&#8217;m struggling to think of a database size range in which the entire NewSQL industry can match Oracle&#8217;s market share alone.</li>
<li>One might want to equate non-analytic multi-structured big data technology to &#8220;NoSQL&#8221;. However:
<ul>
<li>&#8220;NoSQL&#8221; is also used to encompass not-so-big-data use cases, such as prototyping in MongoDB.</li>
<li><a href="../../../../../2011/10/02/defining-nosql/">&#8220;NoSQL&#8221; has non-ACID/low(er)-data-integrity connotations</a> that aren&#8217;t appropriate for all non-relational systems.</li>
</ul>
</li>
<li>Up to a point, you can analyze relational big data in a conventional relational DBMS, but an analytic RDBMS will usually win on TCO (Total Cost of Ownership). In particular, reasonable thresholds for moving an analytic database off Oracle might be:
<ul>
<li>1-2 terabytes if you&#8217;ve never bought anything past Oracle Standard Edition.</li>
<li>5-10 terabytes if you&#8217;re already paying for Oracle Enterprise Edition.</li>
<li>A lot higher than that if you actually find Oracle Exadata to be cost-effective.</li>
</ul>
</li>
<li>Depending on how big one acknowledges as &#8220;big&#8221;, the market share leader in &#8220;big bit bucket&#8221; use cases is either Splunk or Hadoop.</li>
<li>If we look at multi-structured big data management overall, MarkLogic joins the list of market share contenders, as do various NoSQL alternatives.</li>
<li>It is wrong to say that the large web companies invented &#8220;big data&#8221; technology. But it is more reasonable to say they invented much of &#8220;multi-structured big data&#8221; management. In particular (and this is just a partial list), Google, Amazon, Yahoo, Facebook, et al. can reasonably be credited with Hadoop, Cassandra, HBase and various predecessors to same.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Some big-vendor execution questions, and why they matter</title>
		<link>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/</link>
		<comments>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 11:01:20 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cognos]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[HP and Neoview]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[SAP AG]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5704</guid>
		<description><![CDATA[When I drafted a list of key analytics-sector issues in honor of look-ahead season, the first item was &#8220;execution of various big vendors&#8217; ambitious initiatives&#8221;.  By &#8220;execute&#8221; I mean mainly: &#8220;Deliver products that really meet customers&#8217; desires and needs.&#8221; &#8220;Successfully convince them that you&#8217;re doing so &#8230;&#8221; &#8220;&#8230; at an attractive overall cost.&#8221; Vendors mentioned [...]]]></description>
			<content:encoded><![CDATA[<p>When I drafted a list of key analytics-sector issues in honor of <a href="http://www.dbms2.com/2011/11/21/analytic-trends-in-2012-qa/">look-ahead season</a>, the first item was &#8220;execution of various big vendors&#8217; ambitious initiatives&#8221;.  By &#8220;execute&#8221; I mean mainly:</p>
<ul>
<li>&#8220;Deliver products that really meet customers&#8217; desires and needs.&#8221;</li>
<li> &#8220;Successfully convince them that you&#8217;re doing so &#8230;&#8221;</li>
<li>&#8220;&#8230; at an attractive overall cost.&#8221;</li>
</ul>
<p>Vendors mentioned here are Oracle, SAP, HP, and IBM. Anybody smaller got left out due to the length of this post. Among the bigger omissions were:</p>
<ul>
<li>salesforce.com (multiple subjects).</li>
<li><a href="../../../../../2011/04/21/sas-hpa-does-make-sense-after-all/">SAS HPA</a>.</li>
<li><a href="../../../../../2011/08/21/hadoop-evolution/">The evolution of Hadoop</a>.</li>
</ul>
<p><span id="more-5704"></span><strong>A (lingering) issue for SAP and Oracle alike</strong></p>
<p>As I noted in January of this year, <a href="../../../../../2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/">integration of business intelligence into operational apps is making very slow progress</a>. Even so, it&#8217;s a huge part of the apparent strategy at SAP and Oracle alike, as well it should be. Much of the benefit from automating routine desk work has already happened. The areas ripest for exploitation are the ones where analytics are part of the equation.</p>
<p>Given the lack of tangible progress, why do I think this is a genuine area of Oracle and SAP emphasis? Three reasons of many are:</p>
<ul>
<li>Why else did SAP buy Business Objects?</li>
<li>If they&#8217;re not trying to <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrate operational apps and analytics</a>, why else does SAP&#8217;s emphasis on HANA make sense?</li>
<li>Without business intelligence in the picture, how does Oracle&#8217;s integrated-stack story promise any direct user benefits?*</li>
</ul>
<p><em>*As opposed to IT concerns &#8212; integration, administration, TCO (Total Cost of Ownership), etc.</em></p>
<p>After so many years of disappointment, I&#8217;m not going to forecast 2012 as a pivotal year for <strong>the integration of business intelligence into operational applications.</strong> But if one of SAP or Oracle ever does get a significant BI/operational app integration advantage over the other, it could be a major competitive advantage in those application market segments that are still up for grabs. It also is an opportunity for both vendors to gain BI market share in their respective application customer bases.</p>
<p><strong>A more urgent issue for SAP</strong></p>
<p>SAP has put huge amounts of credibility on the line for HANA, the integration of two different and not particularly mature in-memory database technologies. So far, it is difficult to find evidence that HANA is robust enough for widespread adoption. Whether or not SAP can fix that is a huge open question, which could have significant impact on the course of several technology areas: applications, business intelligence, in-memory DBMS, and maybe even hardware.</p>
<p>Based on current information, which is admittedly partial, I&#8217;m a short-term pessimist on HANA. Longer-term, I&#8217;m on record as saying that <a href="../../../../../2011/05/23/databases-ram/">traditional databases will eventually wind up in RAM</a>. SAP will surely get that technology right some day, whether or not the way it does so has anything to do with present-day HANA code.</p>
<p><strong>Four more issues for Oracle </strong></p>
<p>Oracle&#8217;s ambitions are near-endless, and so also therefore is its list of execution challenges. Four in the analytics area that I find particularly interesting are:</p>
<ul>
<li><strong>True hybrid columnar DBMS.</strong> <a href="../../../../../2011/09/22/teradata-columnar-compression/">I was guessing that Oracle, like Teradata, would announce true hybrid columnar the week of Oracle OpenWorld</a>. I was wrong. But if Oracle can&#8217;t bring out true hybrid columnar DBMS functionality relatively soon, Exadata will lose credibility as a competitor to more specialized analytic DBMS.</li>
<li><strong>Oracle Exalytics.</strong> With Exalytics in the mix, Oracle&#8217;s technology stack has HANA-like potential. But will Exalytics even ship in 2012? (I think so.) Will it be good for much in the first release? (I&#8217;m skeptical.)</li>
<li><strong>Oracle&#8217;s Big Data Appliance</strong>. I&#8217;m skeptical both about <a href="../../../../../2011/10/20/more-notes-on-oracle-nosql/">Oracle&#8217;s NoSQL product</a> &#8212; <a href="http://www.infoworld.com/d/data-explosion/first-look-oracle-nosql-database-179107">a favorable InfoWorld review</a> notwithstanding &#8212; and <a href="../../../../../2011/09/23/hadoop-appliances/">Hadoop appliances</a>. But if I&#8217;m wrong, and Oracle can successfully embrace/extend the new non-relational paradigms, then it really might regain control over the evolution of data management.</li>
<li><strong><a href="../../../../../2011/10/18/oracle-is-buying-endeca/">Oracle&#8217;s Endeca acquisition</a></strong> &#8212; will Oracle prove me wrong and integrate Endeca effectively into its overall analytic product line? If it does, we might finally see effective text (and eventually speech) navigation of enterprise software. (But as with all Oracle issues cited here, this is something that probably won&#8217;t amount to much in 2012 even if it does later go well.)</li>
</ul>
<p><strong>Three issues for IBM</strong></p>
<p>Like Oracle, IBM is a huge company with many ambitions and hence many execution challenges. The biggest of those is surely: <strong>How effective can IBM be at selling outside its existing customer base?</strong> I don&#8217;t hear as much competitively about IBM DataStage, IBM SPSS or now IBM Netezza as I did when their vendors were independent companies. Even Cognos may not be much of an exception to the rule, although it has its own large customer base outside of IBM&#8217;s traditional one. (To lesser extents , the same is of course true of Netezza and numerous other IBM acquisitions.)</p>
<p>Another general issue for IBM is <strong>substantively integrating its various product lines,</strong> at least to the extent that makes sense. DB2/Netezza integration sounds good, but even that is a matter more of product marketing (the admirable part of that discipline) more than of actual technology. Other integrations (e.g. Cognos/DB2 in various bundles) have tended toward the dubious side.*</p>
<p><em>*I&#8217;m still waiting for IBM to get back to me with examples of how Cognos/DB2 joint tuning amounts to anything. It&#8217;s been more than a year, so I&#8217;m glad I didn&#8217;t hold my breath.</em></p>
<p>In a somewhat narrower vein, I wonder: <strong><a href="../../../../../2011/11/10/cep-streaming-catchup/">Will IBM be able to gain traction for InfoSphere Streams</a>? </strong>And if so, when and where will the traction be?</p>
<p><strong>Will HP screw up Vertica?</strong></p>
<p>Vertica has a very attractive product offering. It&#8217;s perhaps <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">the most scalable analytic DBMS outside of Teradata</a>, running on the hardware of your reasonable choice.  It&#8217;s also the one I recommend most often to clients in the 1-50 terabyte range.</p>
<p>So far HP doesn&#8217;t seem to have done much to leadfoot Vertica. (About all I&#8217;ve heard from competitors is that Vertica seems to have faded somewhat in the financial services market, and there could be multiple explanations if that is indeed true.) But if HP Vertica does somehow manage to botch things, opportunities will open up for a range of columnar analytic DBMS competitors.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Hadapt is moving forward</title>
		<link>http://www.dbms2.com/2011/11/08/hadapt-is-moving-forward/</link>
		<comments>http://www.dbms2.com/2011/11/08/hadapt-is-moving-forward/#comments</comments>
		<pubDate>Tue, 08 Nov 2011 05:40:10 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Hadapt]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5609</guid>
		<description><![CDATA[I&#8217;ve talked with my clients at Hadapt a couple of times recently. News highlights include: The Hadapt 1.0 product is going &#8220;Early Access&#8221; today. General availability of Hadapt 1.0 is targeted for an officially unspecified time frame, but it&#8217;s soon. Hadapt raised a nice round of venture capital. Hadapt added Sharmila Mulligan to the board. [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve talked with my clients at Hadapt a couple of times recently. News highlights include:</p>
<ul>
<li>The Hadapt 1.0 product is going &#8220;Early Access&#8221; today.</li>
<li>General availability of Hadapt 1.0 is targeted for an officially unspecified time frame, but it&#8217;s soon.</li>
<li>Hadapt raised a nice round of venture capital.</li>
<li>Hadapt added Sharmila Mulligan to the board.</li>
<li>Dave Kellogg is in the picture too, albeit not as involved as Sharmila.</li>
<li>Hadapt has moved the company to Cambridge, which is preferable to Yale environs for obvious reasons. (First location = space they&#8217;re borrowing from their investors at Bessemer.)</li>
<li>Headcount is in the low teens, with a target of doubling fast.</li>
</ul>
<p>The <a href="../../../../../2011/07/06/hadapt-update/">Hadapt product story</a> hasn&#8217;t changed significantly from what it was before. Specific points I can add include:   <span id="more-5609"></span></p>
<ul>
<li>With one exception to date, Hadapt beta customers have used PostgreSQL as the underlying DBMS, rather than some faster columnar system.</li>
<li>Sure, you want to process data on the nodes where it resides on the cluster. But if each copy is replicated 3X or so, that gives you good flexibility to be adaptive by deciding which of the three copies you&#8217;ll operate against.</li>
<li>In Hadapt Version 1.0, scheduling and workload management are pretty much Hadoop&#8217;s. However &#8230;</li>
<li>&#8230; an improvement in scheduling is being actively researched.</li>
<li>In general, Hadapt&#8217;s design philosophy for executing SQL is to use MapReduce to get data to the proper nodes, while using the underlying DBMS for node-specific operations such as:
<ul>
<li>Initial retrieval from disk.</li>
<li>Joins and aggregations on data residing at (or visiting) a specific node.</li>
</ul>
</li>
</ul>
<p>A very busy Daniel Abadi also took the time to walk me through how Hadapt does joins. More precisely, what we discussed about joins includes some of the last features being added to Hadapt 1.0; many of the pieces are still missing from early-access Hadapt 1.0, and some may even slip out of the Hadapt 1.0 GA version. As Dan tells it, there are five kinds of joins in Hadapt:</p>
<ul>
<li><strong>Co-partitioned join.</strong> Both tables being joined happen to be partitioned on the join key. Happy happy joy joy. The tables are joined locally on each node, with the results aggregated via MapReduce.</li>
<li><strong>Directed join</strong>. One of the tables being joined happens to be partitioned on the join key. MapReduce distributes the other table along the join key, joins happen locally, and MapReduce does the rest.</li>
<li><strong>Broadcast join.</strong> One of the tables is broadcast in its entirety to every node. Joins then happen locally, and MapReduce does the rest.</li>
<li><strong>Split semijoin. </strong>One of the tables is projected to the join key and a row ID, and then distributed via MapReduce. Joins then happen locally. Later on, the joined rows are completed with the help of a second projection on the first table. MapReduce does the rest.</li>
<li><strong>Distributed/parallel hash join. </strong>Sometimes, Hadapt indeed joins just as Hadoop/Hive would.</li>
</ul>
<p>Highlight&#8217;s of Hadapt&#8217;s performance story include:</p>
<ul>
<li>Dan contends that using a DBMS rather than HDFS (Hadoop Distributed File System) for I/O always gives a performance advantage.</li>
<li>DBMS local-node join performance can be presumed to be superior as well.</li>
<li>Of course, Dan also thinks that using a columnar DBMS would extend Hadapt&#8217;s performance advantage further, but most of the specifics of what Hadapt has told me about why they don&#8217;t routinely use a columnar DBMS yet are NDA.</li>
<li>Even beta Hadapt/PostgreSQL outperforms Hadoop/Hive by almost 10X at Hadapt&#8217;s relatively small number of beta customer sites.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/08/hadapt-is-moving-forward/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>MarkLogic&#8217;s Hadoop connector</title>
		<link>http://www.dbms2.com/2011/11/03/marklogic-hadoop-connector/</link>
		<comments>http://www.dbms2.com/2011/11/03/marklogic-hadoop-connector/#comments</comments>
		<pubDate>Fri, 04 Nov 2011 00:58:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Clustering]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5585</guid>
		<description><![CDATA[It&#8217;s time to circle back to a subject I skipped when I otherwise wrote about MarkLogic 5: MarkLogic&#8217;s new Hadoop connector. Most of what&#8217;s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you: Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s time to circle back to a subject I skipped when I otherwise wrote about <a href="http://www.dbms2.com/2011/11/01/marklogic-version-5/">MarkLogic 5</a>: MarkLogic&#8217;s new Hadoop connector.</p>
<p>Most of what&#8217;s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you:</p>
<ul>
<li>Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API for streaming documents into or out of a MarkLogic database.</li>
<li>Hadoop can make requests to MarkLogic in MarkLogic&#8217;s normal mode of operation, namely to address any node in the MarkLogic cluster, which then serves as a &#8220;head&#8221; node for the duration of that particular request. But alternatively, Hadoop can use a long-standing MarkLogic option to circumvent the whole DBMS cluster and only talk to one specific MarkLogic node.</li>
</ul>
<p>Otherwise, the whole thing is just what you would think:</p>
<ul>
<li>Hadoop can read from and write to MarkLogic, in parallel at both ends.</li>
<li>If Hadoop is just writing to MarkLogic, there&#8217;s a good chance the process is properly called &#8220;ETL.&#8221;</li>
<li>If Hadoop is reading a lot from MarkLogic, there&#8217;s a good chance the process is properly called &#8220;batch analytics.&#8221;</li>
</ul>
<p>MarkLogic said that it wrote this Hadoop connector itself.</p>
<p><span id="more-5585"></span>When I realized MarkLogic was claiming the ability to seamlessly integrate short-request and batch analytic processing, I asked about workload management. I gathered that:</p>
<ul>
<li>MarkLogic believes that MarkLogic 5 does a great job of granular workload monitoring.</li>
<li>However, MarkLogic doesn&#8217;t have a strong workload management administrative interface. Rather, you may have to do workload management programmatically.</li>
</ul>
<p>Overall, I think the MarkLogic Hadoop connector could prove pretty useful. The first question I ask somebody who wants to process relational data in Hadoop is &#8220;Why not just an analytic RDBMS?&#8221; But the natural use cases for MarkLogic are often ones in which you might as well do your analytics in Hadoop, including a 4 billion Word/PDF/image document insurance-industry example I recently encountered, and for which <a href="../../../../../2011/10/10/text-data-management-part-2-general-and-short-request/">I favor MarkLogic over MongoDB or straight Hadoop alike</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/03/marklogic-hadoop-connector/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The cool aspects of Odiago WibiData</title>
		<link>http://www.dbms2.com/2011/11/02/5576/</link>
		<comments>http://www.dbms2.com/2011/11/02/5576/#comments</comments>
		<pubDate>Wed, 02 Nov 2011 15:05:01 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Odiago and WibiData]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5576</guid>
		<description><![CDATA[Christophe Bisciglia and Aaron Kimball have a new company. It&#8217;s called Odiago, and is one of my gratifyingly more numerous tiny clients. Odiago&#8217;s product line is called WibiData, after the justly popular We Be Sushi restaurants. We&#8217;ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on TechCrunch. But this [...]]]></description>
			<content:encoded><![CDATA[<p>Christophe Bisciglia and Aaron Kimball have a new company.</p>
<ul>
<li>It&#8217;s called Odiago, and is one of my gratifyingly more numerous tiny clients.</li>
<li>Odiago&#8217;s product line is called <a href="http://www.wibidata.com/">WibiData</a>, after the justly popular We Be Sushi restaurants.</li>
<li>We&#8217;ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on <a href="http://techcrunch.com/2011/11/02/cloudera-founder-debuts-big-data-management-and-analysis-platform-wibidata-with-backing-from-eric-schmidt/">TechCrunch</a>. But this is the place for &#8212; well, for the tech crunch.</li>
</ul>
<p><strong>WibiData is designed for management of, <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> on, and operational analytics on consumer internet data,</strong> the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop,* is <strong>a data management and analytic execution layer.</strong> That&#8217;s where the secret sauce resides. Also included are:</p>
<ul>
<li>REST APIs for interactive access.</li>
<li>Import/export tools, including JDBC access.</li>
<li>Management tools.</li>
<li>Analytic libraries &#8212; data mining, predictive      analytics, machine learning, and so on.</li>
</ul>
<p>The whole thing is in beta, with about three (paying) beta customers.</p>
<p><em>*And Avro and so on.</em></p>
<p>The core ideas of WibiData include:</p>
<ul>
<li><strong>ALL data pertaining to a single user </strong>(or mobile device) <strong>is kept in      a single, </strong>possibly very long,<strong> HBase row.</strong><strong> </strong></li>
<li>There are two primary operators in WibiData, <strong>Produce </strong>and <strong>Gather.</strong>
<ul>
<li>Produce operates on single       rows. It can operate on one row at HBase speed (milliseconds) if you need       to inform an interactive user response. Or it can operate on the whole       database in batch via Hadoop MapReduce.</li>
<li>It is reasonable to think of       Produce as mainly doing two things. One is the aforementioned serving of       data out of WibiData into interactive applications. The other is scoring,       classifying, recommending, etc. on individual users (i.e. rows), in line       with an analytic model.</li>
<li>Gather typically operates on       all your rows at once, and emits suitable input for a MapReduce Reduce       step. It is reasonable to think of Gather as being a key cog in the       training of analytic models.</li>
</ul>
</li>
<li>HBase <strong>schema management is done at the      WibiData system level,</strong> not directly in applications. There&#8217;s a      WibiData HBase data dictionary, powered by a set of system tables, that      specifies cell data types/record types and, in effect, primitive schemas.</li>
</ul>
<p><span id="more-5576"></span>WibiData-enhanced HBase differs from relational DBMS in most of the ways you would imagine, both good and bad. In particular:</p>
<ul>
<li>Depending on how you look at it,      WibiData-enhanced HBase either has no DML (Data Manipulation Language) at      all, or else has one that &#8216;s a lot less rich than SQL.</li>
<li>WibiData-enhanced HBase schemas are much more <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic</a> than SQL schemas.</li>
<li>WibiData-enhanced HBase schemas can have nested      or recursive data structures, such as array-valued cells.</li>
</ul>
<p>To expand on each of those points in turn:</p>
<p>WibiData&#8217;s underlying one-giant-table philosophy notwithstanding, there are times you manage multiple tables with it. (For example, you ingest data into WibiData however you can, and then run transformations &#8212; typically batch &#8212; until the data is in the preferred structure.) While Wibidata does have ways to simulate joins, foreign keys, and so on, there&#8217;s nothing resembling referential integrity or foreign key constraints.</p>
<p><strong>WibiData takes single-table schema flexibility to an extreme.</strong> Not only can different rows in the same table have different associated columns &#8212; something that relational systems can in effect also do via NULL values &#8212; but schemas can even change over the life of a column. If you have an array-valued cell storing the results of a marketing campaign, and you start recording more data partway through the campaign, then different rows in the table will, in the same column, hold different-sized arrays.</p>
<p>That nesting can also get pretty serious; <strong>where you’d have a single value in a relational table, you might have the equivalent of a whole relational table (or at least selection/view) in WibiData-enhanced HBase. </strong>For example, if a user visits the same web page ten times, and each time 50 attributes are recorded (including a timestamp), all 500 data – to use the word “data” in its original “plural of <em>datum</em>” sense – would likely be stored in the same WibiData cell.</p>
<p>That’s about all Odiago is disclosing about WibiData right now. Christophe will also be talking at Hadoop World next week, and presumably can be hit up with any burning questions then.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/02/5576/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
	</channel>
</rss>

