<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Database diversity</title>
	<atom:link href="http://www.dbms2.com/category/database-theory-practice/database-diversity/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Soundbites: the Facebook/MySQL/NoSQL/VoltDB/Stonebraker flap, continued</title>
		<link>http://www.dbms2.com/2011/07/15/facebook-mysql-nosql-voltdb-stonebraker/</link>
		<comments>http://www.dbms2.com/2011/07/15/facebook-mysql-nosql-voltdb-stonebraker/#comments</comments>
		<pubDate>Fri, 15 Jul 2011 08:27:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Akiban]]></category>
		<category><![CDATA[Cache]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Clustrix]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[Michael Stonebraker]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[ScaleBase]]></category>
		<category><![CDATA[ScaleDB]]></category>
		<category><![CDATA[Schooner Information Technology]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Tokutek]]></category>
		<category><![CDATA[VoltDB and H-Store]]></category>
		<category><![CDATA[dbShards and CodeFutures]]></category>
		<category><![CDATA[memcached]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4977</guid>
		<description><![CDATA[As a follow-up to the latest Stonebraker kerfuffle, Derrick Harris asked me a bunch of smart followup questions. My responses and afterthoughts include: Facebook et al. are in effect Software as a Service (SaaS) vendors, not enterprise technology users. In particular: They have the technical chops to rewrite their code as  needed. Unlike packaged software [...]]]></description>
			<content:encoded><![CDATA[<p>As a follow-up to the latest <a href="http://www.dbms2.com/2011/07/14/an-odd-claim-attributed-to-mike-stonebraker/">Stonebraker kerfuffle</a>, Derrick Harris asked me a bunch of smart followup questions. My responses and afterthoughts include:</p>
<ul>
<li>Facebook et al. are in effect Software as a Service (SaaS) vendors, not enterprise technology users. In particular:
<ul>
<li>They have the technical chops to rewrite their code as  needed.</li>
<li>Unlike packaged software vendors, they&#8217;re not answerable to anybody for keeping legacy code alive after a rewrite. That makes migration a lot easier.</li>
<li>If they want to write different parts of their system on different technical underpinnings, nobody can stop them. For example &#8230;</li>
<li>&#8230;  <a href="http://www.dbms2.com/2008/07/21/project-cassandra-facebook-open-sourced-quasi-dbms/">Facebook innovated Cassandra</a>, and is now heavily committed to HBase.</li>
</ul>
</li>
<li>It makes little sense to talk of Facebook&#8217;s use of &#8220;MySQL.&#8221; Better to talk of Facebook&#8217;s use of &#8220;MySQL +  memcached  + non-transparent sharding.&#8221; That said:
<ul>
<li>It&#8217;s hard to see why somebody today would use MySQL +  memcached  + non-transparent sharding for a new project. At least one of <a href="http://www.dbms2.com/2011/02/08/couchbase-membase-couchone-couchdb/">Couchbase</a> or <a href="http://www.dbms2.com/2011/02/24/transparent-sharding/">transparently-sharded</a> MySQL is very likely a superior alternative. Other alternatives might be better yet.</li>
<li>As noted above in the example of Facebook, the many major web businesses that are using MySQL +  memcached  + non-transparent sharding for existing projects can be presumed able to migrate away from that stack as the need arises.</li>
</ul>
</li>
</ul>
<p>Continuing with that discussion of DBMS alternatives:</p>
<ul>
<li>If you just want to write to the memcached API anyway, why not go with Couchbase?</li>
<li>If you want to go relational, why not go with MySQL? There are many alternatives for scaling or accelerating MySQL &#8212; dbShards, Schooner, Akiban, Tokutek, ScaleBase, ScaleDB, Clustrix, and Xeround come to mind quickly, so there&#8217;s a great chance that one or more will fit your use case. (And if you don&#8217;t get the choice of MySQL flavor right the first time, porting to another one shouldn&#8217;t be all THAT awful.)</li>
<li>If you really, really want to go in-memory, and don&#8217;t mind writing Java stored procedures, and don&#8217;t need to do the kinds of joins it isn&#8217;t good at, but do need to do the kinds of joins it is, VoltDB could indeed be a good alternative.</li>
</ul>
<p>And while we&#8217;re at it &#8212; going <strong>schema-free</strong> often makes a whole lot of sense. I need to write much more about the point, but for now let&#8217;s just say that I look favorably on the Big Four schema-free/NoSQL options of MongoDB, Couchbase, HBase, and Cassandra.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/15/facebook-mysql-nosql-voltdb-stonebraker/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 2)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:18:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Rainstor]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4867</guid>
		<description><![CDATA[In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/">Part 1</a> of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list match that is even less clear.  <span id="more-4867"></span></p>
<p><strong><em>Bit bucket</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Logs, other technical/external</li>
<li><em>Likely use styles:</em> Staging/ETL, investigative</li>
<li><em>Canonical example: </em>Log files in a Hadoop cluster<em> </em></li>
<li><em>Stresses:</em> TCO, scale-out, transform/big-query performance, ETL functionality</li>
</ul>
<p>With the explosion of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> has come the need for a place to put it all, sometimes called the <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a>. This is like the investigative data mart for big databases, but more <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured</a>. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.</p>
<p>The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.</p>
<p><strong><em>Archival data store</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Operational, CDR (call detail record), security log</li>
<li><em>Likely use styles:</em> Archival, reporting (for compliance), possibly also investigative</li>
<li><em>Examples:</em> Any long-term detailed historical store</li>
<li><em>Stresses: </em>TCO, compression, scale-out, performance (if multi-use)<em> </em></li>
</ul>
<p><em> </em></p>
<p>Analytic DBMS vendors have been insulting each other with the claim &#8220;that&#8217;s just an archival data store,&#8221; dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only <a href="../../../../../2010/06/11/rainstor-update/">Rainstor</a> truly embraces the archival positioning, and I&#8217;ve become pretty dubious about their technical claims and their company alike.</p>
<p>Still, there&#8217;s a legitimate need for data stores &#8212; especially relational analytic DBMS that:</p>
<ul>
<li>Store data cheaply, with high rates of compression.</li>
<li>Have decent performance if you do want to query the data.</li>
<li>May have archiving/compliance-specific features as well.</li>
</ul>
<p>Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.</p>
<p><strong><em>Outsourced data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Traditional BI, investigative analytics, staging/ETL</li>
<li><em>Examples:</em> Advertising tracking, SaaS CRM</li>
<li><em>Stresses:</em> Performance, TCO, reliability, concurrency</li>
</ul>
<p>Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I&#8217;ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that&#8217;s an analytic data management challenge. The possibilities expand from there.</p>
<p>Data outsourcers are in the IT business, and so their IT development is &#8212; hopefully! &#8212; more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.*  <a href="../../../../../2011/06/26/what-to-think-about-before-you-make-a-technology-decision/">Multitenancy</a> is commonly an issue, as is running in the cloud.<em> </em></p>
<p><em>*Even so, there&#8217;s often That Guy who doesn&#8217;t want to migrate away from Oracle, no matter what.<strong> </strong></em></p>
<p>Vertica gets the nod in a number of these cases; it&#8217;s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you&#8217;re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.</p>
<p><strong><em>Operational analytic(s) server</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> Customer-centric, log, financial trade</li>
<li><em>Likely use styles:</em> Advanced operational analytics</li>
<li><em>Examples:</em>
<ul>
<li>Lower latency: Web or call-center personalization, anti-fraud</li>
<li>Higher latency: Customer profiling, Basel 3 risk analysis</li>
</ul>
</li>
<li><em>Stresses:</em> Performance, reliability, analytic functionality, perhaps concurrency</li>
</ul>
<p>Even with eight different choices, I need a &#8220;catch-all&#8221; category; this is it.</p>
<p>Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrating short-request and analytic processing</a>. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you&#8217;re set. Otherwise, you may want to pipe <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">derived data</a> into a more &#8220;industrial-strength&#8221; DBMS, ideally the one that runs your operational apps anyway</p>
<p>Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you&#8217;re starting out with the data in a convenient bit bucket.</p>
<p>Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.</p>
<p><em>So did I get them all? Or are there yet other analytic data management use cases that I don&#8217;t fit into my eight categories?</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 1)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:17:44 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4868</guid>
		<description><![CDATA[Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help. Let&#8217;s try eight categories instead. While no categorization [...]]]></description>
			<content:encoded><![CDATA[<p>Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help.</p>
<p>Let&#8217;s try eight categories instead. While <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">no categorization is ever perfect</a>, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need &#8212; and in most cases you&#8217;ll need several &#8212; is a great early step in your analytic technology planning.  <span id="more-4868"></span></p>
<p><strong><em>Enterprise data warehouse</em></strong> (Full or partial)</p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, but especially operational</li>
<li><em>Likely use styles:</em> All</li>
<li><em>Canonical example:</em> Central EDW for a big enterprise</li>
<li><em>Stresses:</em> Concurrency, reliability, workload management</li>
</ul>
<p>The enterprise data warehouse (EDW) ideal says that you copy all your data into one place, and drive all decision-making from there. <a href="../../../../../2011/06/21/its-official-the-grand-central-edw-will-never-happen/">Full EDWs are pipedreams</a>. Still, a partial EDW makes sense for most large enterprises, and many indeed already have one. The first product lines to consider for classical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL Server, especially if you&#8217;re going to stress concurrency and/or operational use cases.</p>
<p><strong><em>Traditional data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Business intelligence, budgeting/consolidation, investigative</li>
<li><em>Examples:</em> Reporting servers, planning/consolidation servers, anything MOLAP, etc.</li>
<li><em>Stresses:</em> Performance, concurrency, TCO</li>
</ul>
<p>Whether or not you have something like an enterprise data warehouse, it&#8217;s common to have lighter-weight data marts as well. A traditional data mart might drive reports and dashboards. Or it might be specialized for budgeting, planning, and/or consolidation.  Some <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> may be in the mix as well.</p>
<p>Any DBMS that can support an EDW can also support a data mart, but it may not be the most cost-effective way to do so. Columnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them &#8212; e.g. Sybase IQ and <a href="../../../../../2011/06/20/vertica-release-5/">Vertica</a> &#8212; have excellent track records in concurrent usage as well. <a href="../../../../../2011/05/29/when-to-use-relational-database-management-system/">Ted Codd</a> pushed what amounts to MOLAP (Multidimensional OnLine Analytic Processing) systems for these use cases. But relational DBMS commonly do a better job, which is one reason most major MOLAP products have wound up at RDBMS companies.</p>
<p><strong><em>Investigative data mart &#8212; agile</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> A few analysts getting a few TB to examine</li>
<li><em>Stresses:</em> Ease of setup/load, ease of admin, price/performance</li>
</ul>
<p>Besides the traditional data mart, there are at least two other kinds. Both are focused on investigative analytics, but they&#8217;re differentiated by database size.</p>
<p>If you have just a few analysts,* looking at no more than a few terabytes of data (perhaps even just some gigabytes) &#8212; and if that data is &#8220;single-subject&#8221; and fairly homogenous &#8212; your watchwords should be &#8220;cheap&#8221;, &#8220;easy&#8221;, and &#8220;fast&#8221;. You don&#8217;t need to invest in much hardware, in expensive software, in much administrative effort (the analysts can be their own DBAs),  nor should you endure much set-up time. Just grab a product, grab some data, and start running queries (or extracts into the statistical tool of your choice).</p>
<p><em>*If you have dozens or even hundreds of analysts hitting the same database, you&#8217;re probably back to the more concurrency-oriented scenarios outlined above.</em></p>
<p>Infobright is often cost-effective among columnar analytic DBMS. Other vendors might cut you a price break as well. If you have multiple terabytes of data, don&#8217;t rule out Netezza&#8217;s lowest-end products (even if they&#8217;d really rather sell you something bigger). Or, if you&#8217;re in the sub-terabyte range, maybe you can get by with an in-memory BI tool such as QlikView, and not do anything special on the DBMS side at all.</p>
<p><strong><em>Investigative data mart &#8212; big</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric, logs, financial trade, scientific</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> Single-subject 20 TB &#8211; 20 PB relational database<em></em></li>
<li><em>Stresses:</em> Performance, scale-out, analytic functionality</li>
</ul>
<p>But if you&#8217;re looking at tens of terabytes of relational data, or even more, you really do have a &#8220;big data&#8221; problem. Performance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum. Performance POCs (Proofs Of Concept) are a big part of the buying process. Vendor price negotiations are crucial too.</p>
<p><em>Actually, in the low tens of terabytes you might be able to get away with a shared-disk system that has excellent compression &#8212; e.g., columnar products like Sybase IQ, Infobright, or SAND, rather than just Vertica and ParAccel.</em></p>
<p>Assuming you have affordable, scalable query performance, the competitive differentiator can switch to additional analytic functionality. Aster, Netezza, ParAccel, Vertica, and Greenplum either offer full <a href="../../../../../2011/02/24/analytic-platforms/">analytic platforms</a>, or seem to be on the path to doing so. Teradata, which now owns Aster Data, offers substantial built-in analytic capability in its traditional products as well, and the same goes for Sybase IQ.</p>
<p><em>Continued in <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/">Part 2</a>,</em><em> where we cover some of the more difficult use cases.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>When it&#8217;s still best to use a relational DBMS</title>
		<link>http://www.dbms2.com/2011/05/29/when-to-use-relational-database-management-system/</link>
		<comments>http://www.dbms2.com/2011/05/29/when-to-use-relational-database-management-system/#comments</comments>
		<pubDate>Sun, 29 May 2011 19:56:37 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4569</guid>
		<description><![CDATA[There are plenty of viable alternatives to relational database management systems. For short-request processing, both document stores and fully object-oriented DBMS can make sense. Text search engines have an important role to play. E. F. &#8220;Ted&#8221; Codd himself once suggested that relational DBMS weren&#8217;t best for analytics.* Analysis of machine-generated log data doesn&#8217;t always have [...]]]></description>
			<content:encoded><![CDATA[<p>There are plenty of viable alternatives to relational database management systems. For <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">short-request processing</a>, both <a href="../../../../../2011/02/07/notes-on-document-oriented-nosql/">document stores</a> and <a href="../../../../../2011/05/21/object-oriented-database-management-systems-oodbms/">fully object-oriented DBMS</a> can make sense. Text search engines have an important role to play. E. F. &#8220;Ted&#8221; Codd himself once suggested that <a href="http://www.minet.uni-jena.de/dbis/lehre/ss2005/sem_dwh/lit/Cod93.pdf">relational DBMS weren&#8217;t best for analytics</a>.* Analysis of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated</a> log data doesn&#8217;t always have a naturally relational aspect. And I could go on with more examples yet.</p>
<p><em>*Actually, he didn&#8217;t admit that what he was advocating was a different kind of DBMS, namely a MOLAP one &#8212; but he was. And he was wrong anyway about the necessity for MOLAP. But let&#8217;s overlook those details. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </em></p>
<p>Nonetheless, relational DBMS dominate the market. As I see it, the reasons for relational dominance cluster into four areas (which of course overlap):</p>
<ul>
<li><strong>Data re-use.</strong> Ted Codd&#8217;s famed original paper referred to <a href="http://www.seas.upenn.edu/%7Ezives/03f/cis550/codd.pdf">shared data banks</a> for a reason.</li>
<li>The benefits of <strong>normalization,</strong> which include:
<ul>
<li>You only have to do programming work of writing something once &#8230;</li>
<li>&#8230; and you don&#8217;t have to do the programming work of keeping multiple versions of the information consistent.</li>
<li>You only have to do processing work of writing something once.</li>
<li>You only have to buy storage to hold each fact once.</li>
</ul>
</li>
<li>Separation of concerns.
<ul>
<li>Different people can worry about programming and &#8220;database stuff.&#8221;</li>
<li>Indeed, even performance optimization can sometimes be separated from programming (i.e., when all you have to do to get speed is implement the correct indexes).</li>
</ul>
</li>
<li>Maturity and momentum, as reflected in the availability of:
<ul>
<li>People.</li>
<li>A broad variety of mature relational DBMS.</li>
<li>Vast amounts of packaged software that &#8220;talks&#8221; SQL.</li>
</ul>
</li>
</ul>
<p>Generally speaking, I find the reasons for sticking with relational technology compelling in cases such as:  <span id="more-4569"></span></p>
<ul>
<li><strong>You&#8217;re building a low-volume, medium-complexity suite of applications that will evolve over time.</strong> This is the use case for which relational DBMS were invented, and they&#8217;re still great for it.</li>
<li><strong>Your (duplicated) data volumes would be ridiculous if you didn&#8217;t do a reasonable amount of normalization.</strong> Once you need to normalize, you need to do joins &#8212; and if you&#8217;re doing joins, you&#8217;re in relational territory.</li>
<li><strong>You simply don&#8217;t see a cost/benefit advantage to moving away from proven legacy technology.</strong> If you&#8217;re looking for an off-the-shelf answer to your needs &#8212; or if you&#8217;re inventorying your own technological shelves &#8212; relational-oriented technology has overwhelming share.</li>
</ul>
<p>For many enterprises, that third point alone should be decisive in a large fraction of cases.</p>
<p>But the advantages of relational technology are less clear when you&#8217;re doing <strong>serious engineering of path-breaking new applications, </strong>where by &#8220;serious engineering&#8221; I mean:</p>
<ul>
<li>The problem is big enough that you simply want the best solution, with only loose coupling needed to the rest of your technical environment.</li>
<li>Long-lasting &#8220;strategic&#8221; or legacy technology is not a great concern; you&#8217;re willing to keep &#8220;rebuilding the 747 while it&#8217;s flying&#8221; if that&#8217;s what&#8217;s necessary to get the best possible result.</li>
<li>You have access to sufficient quantities of sufficiently smart people.</li>
</ul>
<p>For example:</p>
<ul>
<li>I recently suggested that <a href="../../../../../2011/05/21/object-oriented-database-management-systems-oodbms/">innovative SaaS vendors could adopt object-oriented database technology.</a></li>
<li>Major web applications are rarely very relational. Until recently, the default approach to scaling out web databases was memcached/sharded MySQL, hardly a whole-hearted adoption of relational technology. Now NoSQL DBMS are vigorous competitors.</li>
<li>Analytic challenges that amount to teasing out signals from streams of data are sometimes handled non-relationally as well, although it&#8217;s often nice to be able to do a few joins to mix in information from more relationally-structured data.</li>
</ul>
<p>Not coincidentally, in a lot of those cases, throwing performance concerns &#8220;over the wall&#8221; to the database administrator isn&#8217;t going to work.</p>
<p><em>*I do expect the pendulum to swing back a bit as high-performance/highly-scalable MySQL implementations mature, but there are relatively few supporting examples to date.</em></p>
<p>To look at it another way, it&#8217;s right to be skeptical about relational DBMS when you can defeat all of the reasons to favor them. For example:</p>
<ul>
<li>Data re-use may not arise when applications are self-contained and rapidly-changing.</li>
<li>Sometimes you don&#8217;t need to normalize your data.</li>
<li>It&#8217;s not obvious that the relational approach to separation of concerns is the best one. Perhaps you&#8217;d be better off with the people who understand a specific application best being responsible for all the decisions connected with it.</li>
<li>As for that maturity and momentum:
<ul>
<li>People don&#8217;t actually learn much SQL in school.</li>
<li>Are any of the mature relational DBMS what you really want?</li>
<li>Is any of that packaged software out there really helpful for your specific problem?</li>
</ul>
</li>
</ul>
<p>I should probably stop there. But in an appeal to authority, I&#8217;ll close instead with a quote from Codd&#8217;s own OLAP paper:</p>
<blockquote><p>IT should never forget that technology is a means to an end, and not an end in itself. Technologies must be evaluated individually in terms of their ability to satisfy the needs of their respective users. IT should never be reluctant to use the most appropriate interface to satisfy users’ requirements. Attempting to force one technology or tool to satisfy a particular need for which another tool is more effective and efficient is like attempting to drive a screw into a wall with a hammer when a screwdriver is at hand: the screw may eventually enter the wall but at what cost?</p></blockquote>
<p><strong><em>Related link</em></strong></p>
<ul>
<li><a href="../../../../../2008/02/15/database-management-system-choices-overview/">My exchange with Mike Stonebraker highlighting our shared advocacy for database diversity</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/05/29/when-to-use-relational-database-management-system/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>NoSQL overview</title>
		<link>http://www.dbms2.com/2010/10/11/nosql-overview/</link>
		<comments>http://www.dbms2.com/2010/10/11/nosql-overview/#comments</comments>
		<pubDate>Mon, 11 Oct 2010 04:02:46 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Parallelization]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3222</guid>
		<description><![CDATA[My NoSQL article is finally posted; I hope it lives up to all the foreshadowing. It is being run online at Intelligent Enterprise/Information Week, as per the link above, where Doug Henschen edited it with an admirably light touch. Below please find three excerpts* that convey the essence of my thinking on NoSQL. For much [...]]]></description>
			<content:encoded><![CDATA[<p>My <a href="http://intelligent-enterprise.informationweek.com/channels/information_management/showArticle.jhtml;jsessionid=1LVTB0M20CB1XQE1GHOSKH4ATMY32JVN?articleID=227701021&amp;pgno=2">NoSQL article</a> is finally posted; I hope it lives up to all the <a href="http://www.dbms2.com/2010/08/26/nosql-hvsp-olrp/">foreshadowing</a>. It is being run online at <em>Intelligent Enterprise/Information Week,</em> as per the link above, where Doug Henschen edited it with an admirably light touch.</p>
<p>Below please find three excerpts* that convey the essence of my thinking on NoSQL. For much more detail, please see the article itself.</p>
<p><em>*Notwithstanding my admiration for Doug&#8217;s editing, the excerpts are taken from my final pre-editing submission, not from the published article itself. </em></p>
<p>My quasi-definition of &#8220;NoSQL&#8221; wound up being:  <span id="more-3222"></span></p>
<blockquote><p>NoSQL DBMS start from three design premises:</p>
<ul>
<li>Transaction      semantics are unimportant, and locking is downright annoying.</li>
<li>Joins      are also unimportant, especially joins of any complexity.</li>
<li>There      are some benefits to having a DBMS even so.</li>
</ul>
<p>NoSQL DBMS further incorporate one or more of three assumptions:</p>
<ul>
<li>The      database will be big enough that it should be scaled across multiple      servers.</li>
<li>The      application should run well if the database is replicated across multiple      geographically distributed data centers, even if the connection between      them is temporarily lost.</li>
<li>The      database should run well if the database is replicated  across a host      server and a bunch of occasionally-connected mobile  devices.</li>
</ul>
<p>In addition, NoSQL advocates commonly favor the idea that a database should have no fixed schema, other than whatever emerges as a byproduct of the application-writing process.</p></blockquote>
<p>I subdivided the space by saying:</p>
<blockquote><p>If not SQL, then what?  A number of possibilities have been tried, with the four main groups being:</p>
<ul>
<li>Simple      key-value store.</li>
<li>Quasi-tabular.</li>
<li>Fully      SQL/tabular.</li>
<li>Document/object.</li>
</ul>
<p>DBMS based on <a href="../category/datatype/rdf-graph-database/">graphical data models</a> are also sometimes suggested to be part of NoSQL, as are the file systems that underlie many <a href="../category/parallelization/mapreduce/">MapReduce</a> implementations. But as a general rule, those data models are most effective for analytic use cases somewhat apart from the NoSQL mainstream.</p></blockquote>
<p>My conclusion was:</p>
<blockquote><p>So should you adopt NoSQL technology? Key considerations include:</p>
<ul>
<li><strong>Immaturity.</strong> The very term “NoSQL” has only been around since 2009. Most NoSQL      “products” are open source projects backed by a company of fewer than 20      employees.</li>
<li><strong>Open      source.</strong> Many NoSQL adopters are constrained, by money or ideology, to      avoid closed-source products. Conversely, it is difficult to deal with      NoSQL products’ immaturity unless you’re comfortable with the rough-and-tumble      of open source software development.</li>
<li><strong>Internet      orientation.</strong> A large fraction of initial NoSQL implementations are for      web or other internet (e.g., mobile application) projects.</li>
<li><strong>Schema      mutability.</strong> If you like the idea of being able to have different      schemas for different parts of the same “table,” NoSQL may be for you. If      you like the database reusability guarantees of the relational model,      NoSQL may be a poor fit.</li>
<li><strong>Project      size.</strong> For a large (and suitable) project, the advantages of NoSQL      technology may be large enough to outweigh its disadvantages. For a small,      ultimately disposable project, the disadvantages of NoSQL may be minor. In      between those extremes, you may be better off with SQL.</li>
<li><strong>SQL      DBMS diversity.</strong> The choice of SQL DBMS goes far beyond the “Big 3-4”      of Oracle, IBM DB2, Microsoft SQL Server, and SAP/Sybase Adaptive Server      Anywhere. MySQL, PostgreSQL, and other mid-range SQL DBMS – open source or      otherwise – might meet your needs. So might some of the scale-out-oriented      startups cited above. Or if your needs are more analytic, there’s a whole      range of powerful and cost-effective specialized products, from vendors      such as <a href="../category/products-and-vendors/netezza/">Netezza</a>,      <a href="../category/products-and-vendors/vertica-systems/">Vertica</a>,      <a href="../category/products-and-vendors/aster-data-warehouse/">Aster      Data</a>, or <a href="../category/products-and-vendors/greenplum/">EMC/Greenplum</a>.</li>
</ul>
<p><em> </em></p>
<p><em>Bottom line: For cutting-edge applications – often but not only internet-centric &#8212; NoSQL technology can make sense today. In other use cases, its drawbacks are likely to outweigh its advantages.</em></p></blockquote>
<p><strong><em>Related link</em></strong></p>
<ul>
<li><a href="http://www.dbms2.com/2010/09/21/acid-compliant-transaction-integrity/">How to tell when you need ACID-compliant transaction integrity</a><em><br />
</em></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/10/11/nosql-overview/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>Partnering with Cloudera</title>
		<link>http://www.dbms2.com/2010/10/10/partnering-with-cloudera/</link>
		<comments>http://www.dbms2.com/2010/10/10/partnering-with-cloudera/#comments</comments>
		<pubDate>Sun, 10 Oct 2010 16:40:07 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3177</guid>
		<description><![CDATA[After I criticized the marketing of the Aster/Cloudera partnership, my clients at Aster Data and Cloudera ganged up on me and tried to persuade me I was wrong. Be that as it may, that conversation and others were helpful to me in understanding the core thesis:  There are a lot of big datasets out there, [...]]]></description>
			<content:encoded><![CDATA[<p>After <a href="http://www.strategicmessaging.com/dont-try-to-emulate-the-treaty-of-torsedillas/2010/10/03/">I criticized the marketing of the Aster/Cloudera partnership</a>, my clients at Aster Data and Cloudera ganged up on me and tried to persuade me I was wrong. Be that as it may, that conversation and others were helpful to me in understanding the core thesis:  <span id="more-3177"></span></p>
<ul>
<li>There are a lot of big datasets out there, where &#8220;big&#8221; commonly means &#8220;petabyte scale.&#8221;</li>
<li>Owners of that much data commonly like to store it using free or quasi-free software, especially if the data isn&#8217;t structured in such a way that relational tables are a great fit in the first place. HDFS (Hadoop Distributed File System) is the default choice. (Of course, there always are <a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">exceptions</a>.)</li>
<li><a href="http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/">Some kinds of analytics can be done perfectly well in Hadoop</a>.</li>
<li>Some kinds of analytics, of course, can not be done well in Hadoop, with the most obvious examples being:
<ul>
<li>Queries that involve serious joins.</li>
<li>Anything that requires a lower <a href="http://www.dbms2.com/2009/09/10/analytic-speed-latency/">latency</a> than Hadoop provides.</li>
</ul>
</li>
<li>When doing analytics in Hadoop on data stored in HDFS, you often will want to include data you&#8217;re storing in your relational DBMS.</li>
</ul>
<p>So Cloudera is promising fast, bidirectional connectors between Hadoop/HDFS and various DBMS, such as Aster Data nCluster, and will provide them on a services basis even before the productized versions ship. Here &#8220;fast&#8221; should and in multiple cases does mean &#8220;fully parallel,&#8221; with all data-owning nodes on either side (Hadoop or HDFS) more or less equally involved. Indeed, Aster is (I think for the first time) bypassing its<a href="http://www.dbms2.com/2008/09/05/mpp-data-warehouse-nodes/"> loader nodes</a>, instead sending Hadoop data straight to its workers.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/10/10/partnering-with-cloudera/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Is the enterprise data warehouse a myth?</title>
		<link>http://www.dbms2.com/2010/04/12/enterprise-data-warehouse-edw-myt/</link>
		<comments>http://www.dbms2.com/2010/04/12/enterprise-data-warehouse-edw-myt/#comments</comments>
		<pubDate>Mon, 12 Apr 2010 11:52:02 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1883</guid>
		<description><![CDATA[An enterprise data warehouse should: Manage data to high standards of accuracy, consistency, cleanliness, clarity, and security. Manage all the data in your organization. Pick ONE. There&#8217;s little to dislike in the enterprise data warehouse dream, as represented (for example) in this 2004 Teradata Magazine article. But in a world where ever more data comes [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">An <strong>enterprise data warehouse</strong> should:</p>
<ul>
<li>Manage da<span style="font-weight: normal;">ta 	to high standards of </span><strong>accuracy, consistency, cleanliness, 	clarity, and security.</strong></li>
<li>Manage <strong>all the data in your 	organization.</strong></li>
</ul>
<p style="margin-bottom: 0in;"><strong>Pick ONE.<span id="more-1883"></span></strong></p>
<p style="margin-bottom: 0in;">There&#8217;s little to dislike in the enterprise data warehouse dream, as represented (for exam<span style="font-style: normal;">ple) in this <a href="http://www.teradata.com/library/pdf/TD_Mag_1Q_2004_Insert.pdf">2004 </a></span><a href="http://www.teradata.com/library/pdf/TD_Mag_1Q_2004_Insert.pdf"><em>Teradata Magazine</em><span style="font-style: normal;"> article</span></a><span style="font-style: normal;">. But in a world where ever more data comes in from ever more sources – and is needed </span><em>ever faster</em><span style="font-style: normal;"> – it simply isn&#8217;t realistic to expe</span><span style="font-style: normal;"><span style="font-weight: normal;">ct that all an</span></span><span style="font-style: normal;"> enterprise&#8217;s data will be vetted, organized, and managed to the highest of standards. </span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">This is a core premise of </span><a href="http://www.dbms2.com/2010/04/12/greenplumchorus/">Greenplum&#8217;s Enterprise Data Cloud (EDC)/Chorus</a><span style="font-style: normal;"> marketing initiative, and in that respect Greenplum is correct.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">If the EDW is a great idea that can never be 100% implemented, what should you do? At conventional enterprises, the answer is pretty obvious: </span><span style="font-style: normal;"><strong>Manage some of your data to enterprise data warehouse  standards, but not all of it. </strong></span><span style="font-style: normal;"><span style="font-weight: normal;">Specifically, </span></span><span style="font-style: normal;"><strong>your highest-value data should be in something that looks like a classic enterprise data warehouse, and your lower-value data shouldn&#8217;t.</strong></span></p>
<p style="margin-bottom: 0in;">Of course, if you&#8217;re a data mart outsourcer or other analytic service provider, whose data is about your customers&#8217; businesses rather than your own, and whose business is managing your customers&#8217; data, this may not apply to you. But otherwise it&#8217;s a position with many supporting arguments, including:</p>
<ul>
<li><strong>Financial reporting, 	compliance, and other legitimate concerns introduce rigidity into 	data models</strong>. This increases the cost and reduces the speed of 	getting data into enterprise data warehouses.</li>
<li><strong>Data governance procedures </strong>imposed for any other business purpose have the same effect. 	What&#8217;s deemed necessary for enterprise data warehouses can be fatal 	to timely analytics.</li>
<li><span style="font-weight: normal;">The </span><strong>highest-value data</strong><span style="font-weight: normal;"> typically </span><strong>comes from transactional systems, </strong><span style="font-weight: normal;">such 	as order entry or sales contact management. So it starts out with a 	degree of governan</span>ce that, say, web log files may never 	enjoy.</li>
<li>In some enterprises, it is 	affordable or even cost-effective to manage your highest-value data 	in your favorite big-brand DBMS, but necessary to manage most of 	your data in something with lower TCO (Total Cost of Ownership). 	<strong>Big-brand OLTP DBMS are often better </strong><span style="font-weight: normal;">(or 	at least less bad) </span><strong>at </strong><span style="font-weight: normal;">managing </span><strong>enterprise data warehouses than </strong><span style="font-weight: normal;">they 	are </span><strong>at </strong><span style="font-weight: normal;">running</span><strong> data mart workloads.</strong></li>
<li>At certain enterprise and database 	sizes, it may indeed make sense to run what amounts to an <strong>enterprise 	data warehouse out of the same database instance that does OLTP,</strong> while putting larger data sets into more cost-effective data marts. 	A trend to “operational BI” may actually make that option more 	appealing going forward than it has been in the past.</li>
<li>And finally, there&#8217;s the empirical 	fact that <strong>not one really large enterprise on the whole planet has 	a true, perfectly comprehensive enterprise data warehouse. </strong><span style="font-weight: normal;">At 	least, I&#8217;ve never heard of one.</span></li>
</ul>
<p><em><strong>Related links</strong></em></p>
<ul>
<li>Even <a href="http://www.dbms2.com/2008/10/23/teradata-appliance-product-lines/">Teradata doesn&#8217;t push an EDW-only strategy</a> any more</li>
<li>I agreed when Greenplum first started pushing the EDC idea that something like it would be <a href="http://www.dbms2.com/2009/06/08/the-future-of-data-marts/">the future of data marts</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/12/enterprise-data-warehouse-edw-myt/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>The Naming of the Foo</title>
		<link>http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/</link>
		<comments>http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/#comments</comments>
		<pubDate>Sat, 13 Mar 2010 22:47:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1703</guid>
		<description><![CDATA[Let&#8217;s start from some reasonable premises. No technology category name is ever perfect. It&#8217;s particularly hard to describe NoSQL (Not Only SQL) accurately, given the basic confusion as to what NoSQL is all about. That said, it seems pretty clear that NoSQL is about making big websites (and perhaps other cloud-like installations) run and scale. [...]]]></description>
			<content:encoded><![CDATA[<p>Let&#8217;s start from some reasonable premises.<span id="more-1703"></span></p>
<ul>
<li><a href="http://www.strategicmessaging.com/monashs-first-law-of-commercial-semantics-explained/2009/01/09/">No technology category name is 	ever perfect</a>.</li>
<li>It&#8217;s particularly hard to describe 	NoSQL (Not Only SQL) accurately, given <a href="http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/">the basic confusion as to 	what NoSQL is all about</a>.</li>
<li>That said, it 	seems pretty clear that NoSQL is about making big websites (and 	perhaps other cloud-like installations) run and scale.</li>
<li>Dwight Merriman (founder/CEO of 	MongoDB vendor 10gen) is heading in the right direction when he says 	that the unifying ideas of NoSQL are that you do away with 	transactions and joins. But if he&#8217;s ever said something like “NoSQL 	is Foo without joins and transactions,” I don&#8217;t know what Foo is.</li>
<li><span style="font-style: normal;">Actually, 	I do know what Foo is – Foo is what happens when lots of people 	want to get small amounts each of information in or out of a 	database at the same time. I just don&#8217;t know what Foo is called.</span></li>
<li>Obviously, Foo is a lot like OLTP 	(OnLine Transaction Processing). However, it would be pretty silly 	for Foo to actually be OLTP, given that one of the core points of 	NoSQL is that you don&#8217;t have transactions.</li>
<li>It not just the “T” part of 	OLTP that&#8217;s fried.  Calling something “OnLine” only makes sense 	as long as offline is an option, and offline transaction processing 	has been obsolete for a very long time.*</li>
</ul>
<p style="margin-bottom: 0in;"><em>*Sure, if you strain you can talk yourself into exceptions. But the point stands.</em></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">So we need a name for Foo, where Foo is what happens when</span><span style="font-style: normal;"><strong> lots of people want to get small amounts each of information in or out of a database at the same time.</strong></span><span style="font-style: normal;"> Thus, three major subcategories of more-or-less disk-based Foo are:</span></p>
<ul>
<li><span style="font-style: normal;">No-compromises 	ACID-compliant relational OLTP</span></li>
<li><span style="font-style: normal;">Sharded 	MySQL</span></li>
<li>NoSQL</li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">There may be some more purely memory-centric versions too, but let&#8217;s put those aside for the moment. </span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">Absent a better idea, I can squeeze Foo into yet another four-letter acronym:</span></p>
<p style="margin-bottom: 0in;"><strong><span style="font-style: normal;">HVSP (High-Volume Simple Processing)</span></strong></p>
<p style="margin-bottom: 0in; font-style: normal;">That&#8217;s as imperfect as any other category name, and an awkward mouthful to boot. So I&#8217;d love to hear a better one; if you have such, please share it!  In the mean time, I think “HVSP” has merit because:</p>
<ul>
<li><span style="font-style: normal;">The 	“Processing” part should be noncontroversial.</span></li>
<li>“<span style="font-style: normal;">High-Volume” 	is inherent to the challenge. If RDBMS scale well enough for your 	use case, using something less powerful is probably silly.*  	Similarly, while Oracle shines at high-volume OLTP workloads, there 	are many cheaper DBMS that do a fine job of OLTP at lower volumes.</span></li>
<li>“<span style="font-style: normal;">Simple” 	is the core principle of NoSQL systems, which drop joins and 	transactions as being too much foofarah.  That only makes sense at 	all under the assumption that you have bone-simple queries and 	updates, so that programming around the lack of joins and 	transactions isn&#8217;t all that much of a burden.</span></li>
<li><span style="font-style: normal;">Something 	similar is true of sharded MySQL.</span></li>
<li><span style="font-style: normal;">Less 	obviously, “simple” is a core principle of relational OLTP as 	well. The point of the relational model is to cap the complexity of 	data operations, or more precisely to hide that complexity from 	programmers.</span></li>
<li><span style="font-style: normal;">And 	overloading the word “simple” a bit, it&#8217;s fair to say that if 	you&#8217;re reading or writing one record at a time, you&#8217;re doing 	something relatively simple, at least as opposed to what you do in 	analytic processing. The OLTP vs. OLAP distinction is preserved in 	this name change.</span></li>
<li><span style="font-style: normal;">The whole thing matches my definition above, namely &#8220;what happens when lots of people want to get small amounts each of information in or out of a database at the same time.&#8221;</span></li>
</ul>
<p style="margin-bottom: 0in;"><em>*Assuming, of course, that rows-and-tables are a good metaphor for your data structure in the first place.</em></p>
<p style="margin-bottom: 0in; font-style: normal;">Systems I&#8217;m leaving out of the HVSP and hence also NoSQL categories include:</p>
<ul>
<li><span style="font-style: normal;"><strong>Hadoop 	and other batch-oriented MapReduce.</strong></span><span style="font-style: normal;"> Hadoop isn&#8217;t part of NoSQL. I&#8217;m pretty sure that </span><a href="http://twitter.com/mikeolson/status/10388695185">Cloudera 	CEO Mike Olson</a><span style="font-style: normal;"> agrees with me.</span></li>
<li><span style="font-style: normal;"><span style="font-weight: normal;">More 	generally, </span></span><span style="font-style: normal;"><strong>non-SQL 	data stores that don&#8217;t meet the HVSP criteria.</strong></span><span style="font-style: normal;"> Dave Kellogg stretches things when he claims that <a href="http://www.kellblog.com/2010/03/10/ieee-computer-society-article-on-nosql-an-executive-level-overview/">MarkLogic 	is a NoSQL system</a>. (But then, that was in a post where he 	seemingly praised </span><a href="http://www.dbms2.com/2009/12/11/nosql-q-and-a/">a train wreck of an article</a><span style="font-style: normal;">.)</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">But hey – what good is a categorization if it doesn&#8217;t leave some things out?</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/03/13/the-naming-of-the-foo/feed/</wfw:commentRss>
		<slash:comments>37</slash:comments>
		</item>
		<item>
		<title>Three broad categories of data</title>
		<link>http://www.dbms2.com/2010/01/17/three-broad-categories-of-data/</link>
		<comments>http://www.dbms2.com/2010/01/17/three-broad-categories-of-data/#comments</comments>
		<pubDate>Sun, 17 Jan 2010 15:31:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1421</guid>
		<description><![CDATA[People often try to draw a distinction between: Traditional data of the sort that&#8217;s stored in relational databases, aka “structured.” Everything else, aka “unstructured” or “semi-structured” or “complex.” There are plenty of problems with these formulations, not the least of which is that the supposedly “unstructured” data is the kind that actually tends to have [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">People often try to draw a distinction between:</p>
<ul>
<li>Traditional data of the sort 	that&#8217;s stored in relational databases, aka “structured.”</li>
<li>Everything else, aka 	“unstructured” or “semi-structured” or “complex.”</li>
</ul>
<p style="margin-bottom: 0in;">There are plenty of problems with these formulations, not the least of which is that the supposedly “unstructured” data is the kind that actually tends to have interesting internal structures. But of the many reasons why these distinctions don&#8217;t tend to work very well, I think the most important one is that:</p>
<p><strong>Databases shouldn&#8217;t be divided into just two categories. </strong><span style="font-weight: normal;"> Even as a rough-cut approximation, </span><strong>they should be divided into three,</strong><span style="font-weight: normal;"> namely:</span></p>
<ul>
<li><strong>Human/Tabular</strong> data &#8211;i.e., human-generated data that fits well 	into relational tables or arrays</li>
<li><strong>Human/Nontabular</strong> data &#8212; i.e., all other data generated by humans</li>
<li><strong>Machine-Generated</strong> data</li>
</ul>
<p style="margin-bottom: 0in;">Even that trichotomy is grossly oversimplified, for reasons such as:</p>
<ul>
<li>These categories overlap.</li>
<li>There are kinds of data that get 	into fuzzy border zones.</li>
<li>Not all data in each category has 	all the same properties.</li>
</ul>
<p style="margin-bottom: 0in;">But at least as a starting point, I think this basic categorization has some value.<span id="more-1421"></span></p>
<p style="margin-bottom: 0in;">By <strong>human-generated data that fits well into relational tables or arrays,</strong> what I really mean is: <strong>the input from most conventional kinds transactions</strong> – purchase/sale, inventory/manufacturing, employment status change, etc. This is the core data managed by OLTP relational DBMS everywhere. It is also the core data in analytic relational or MOLAP databases. The vast majority of what we think or know about “database management” applies primarily to data of this kind, in large part because of two fundamental properties of this information:</p>
<ul>
<li>It is meaningful to contemplate 	this data as being 100% accurate and complete (even if that goal is 	difficult to achieve in the real world).</li>
<li>This data is precise – i.e., one 	can check predicates against it and (give or take regrettable data 	imperfections) get inarguable yes/no answers.</li>
</ul>
<p style="margin-bottom: 0in;">For most enterprises, this is the most important data they have. It was created as a result of expensive business activities. It deals directly with money, employees, physical goods, and the rest of the things that make an enterprise go. It can be fruitfully analyzed in ever more ways, which is why it should never be thrown out or even entirely relegated to tape, now that data warehouse software, hardware, and storage has become so cheap. (“Disk is the new tape.”) And because of the importance of both preserving and accessing it, it should often be stored in multiple copies – OLTP, data warehouse, data mart, in-memory analytics, near-line quasi-archive, MOLAP cubes (if you must) and so on, plus of course replicas for high throughput and availability.</p>
<p style="margin-bottom: 0in;">But <strong>humans generate many other kinds of data as well,</strong> especially in a form directly suitable for <strong>communication</strong> – text (in many formats), documents (text or otherwise), pictures, videos, etc. <a href="../2005/12/09/relational-dbms-versus-text-data/">Traditional relational databases are a poor home for this kind of data</a> because:</p>
<ul>
<li>This data often deals with 	opinions or aesthetic judgments – there is little concept of 	perfect accuracy.</li>
<li>Similarly, there is little concept 	of perfect completeness.</li>
<li>There&#8217;s also little concept of 	perfectly, unarguably accurate query results – different people 	will have different opinions as to what comprises good results for a 	search.</li>
<li>Queries don&#8217;t lend themselves to 	binary answers; rather, documents can have differing degrees of 	relevancy.</li>
</ul>
<p style="margin-bottom: 0in;">Systems for managing this kind of data are much less advanced than relational database managers. Nobody knows how to get all the information out of a text document, or query all of it if they could, and the story is even worse for non-text examples. The systems that give the best query results aren&#8217;t necessarily the same ones that have the best database administration features. Basically, this area is still a mess, and it&#8217;s a mess that consumes a huge fraction of all the data storage products sold today.</p>
<p style="margin-bottom: 0in;">But give or take questions of storage efficiency and deduplication, if humans created that kind of data, they put a lot of effort into it, so it&#8217;s worth keeping. Besides, compliance regulations commonly mandate that we do so – except, perhaps, when they mandate that we throw it away.</p>
<p style="margin-bottom: 0in;"><strong>Machine-generated data</strong> is a whole other can of worms. Paradigmatic examples of what I mean by <a href="http://www.dbms2.com/2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> include:</p>
<ul>
<li>Computer, network, and other 	equipment logs</li>
<li>Satellite and similar telemetry 	(whether for espionage or science)</li>
<li>Location data such as RFID chip 	readings, GPS system output, etc.</li>
<li>Temperature and other 	environmental sensor readings</li>
<li>Sensor readings from factories, 	pipelines, etc.</li>
<li>Output from many kinds of medical 	device, in hospitals and (increasingly) homes alike</li>
</ul>
<p style="margin-bottom: 0in;">Unlike human-generated data, whose growth is constrained by macro factors such as population and total level of economic activity, <strong>machine-generated data will continue to grow as fast as Moore&#8217;s Law lets it. </strong><span style="font-weight: normal;">That fact has two profound consequences:</span></p>
<ul>
<li><strong>It is unrealistic to hope ever 	to keep most or all machine-generated data,</strong><span style="font-weight: normal;"> whereas I think that&#8217;s exactly what should and will happen with human-generated data</span></li>
<li><span style="font-weight: normal;">Before 	long, </span><strong>most data (by volume) will be machine-generated</strong></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">And so it is not really an exaggeration to say that <strong>machine-generated data is the future of data management.</strong></span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">I&#8217;d like to close this long post by immediately pointing out some of the flaws in this simple trichotomy. One obvious gray area lies in<strong> hybrid human/machine-generated data,</strong> three big examples of which are:</span></p>
<ul>
<li><span style="font-weight: normal;">Web 	clickstreams</span></li>
<li><span style="font-weight: normal;">Call 	detail records (CDR)</span></li>
<li><span style="font-weight: normal;">Stock 	trades</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">In all three cases, we are quickly getting to the point where this data is preserved in its entirety (even if the network event data associated with the web logs is reduced before storage). And in each case it fits pretty well into RDBMS, although Hadoop has a role to play as well. So pretending it&#8217;s purely human-generated probably isn&#8217;t all that misleading.<br />
</span></p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">Another gray area lies in text that gets linguistically processed – i.e. via <a href="http://www.texttechnologies.com/2007/12/23/text-mining-myths-realities/">text-mining</a> tools – with the output placed into a relational database. I don&#8217;t immediately see a workaround for that flaw in my labeling scheme.  So let&#8217;s just say no taxonomy is perfect.*</span></p>
<p style="margin-bottom: 0in;"><em><span style="font-weight: normal;">*Come to think of it, that&#8217;s one of the problems holding back text-mining technology. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </span></em></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;"><span style="font-weight: normal;">And of course some of the <a href="../2009/12/12/legit-nosql-key-value-store/">NoSQL</a> folks would note that I was oversimplifying when I tied my first category specifically to relational DBMS. So would the folks at <a href="../2010/01/15/intersystems-cache-highlights/">Intersystems</a>.</span></span></p>
<p style="margin-bottom: 0in; font-style: normal; font-weight: normal;">But the biggest oversimplification stems from this:</p>
<p style="margin-bottom: 0in;"><span style="font-weight: normal;">As Mike Stonebraker* and I argued a couple of years ago, I really <a href="../2008/04/10/my-own-data-management-software-taxonomy/">think that database management technologies should be divided into 10+ categories.</a> </span></p>
<p style="margin-bottom: 0in;"><em><span style="font-weight: normal;">*Note: The links to Stonebraker&#8217;s own posts will be broken until Vertica&#8217;s webmaster gets his/her act together. But you can find them under other URLs via web search.</span></em></p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/01/17/three-broad-categories-of-data/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>The legit part of the NoSQL idea</title>
		<link>http://www.dbms2.com/2009/12/12/legit-nosql-key-value-store/</link>
		<comments>http://www.dbms2.com/2009/12/12/legit-nosql-key-value-store/#comments</comments>
		<pubDate>Sat, 12 Dec 2009 06:06:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1319</guid>
		<description><![CDATA[I&#8217;ve written some snarky things about the “NoSQL” concept – or at least the moniker. (Carl Olofson&#8217;s term &#8220;non-schematic databases&#8221; seems less bad.) Yet I&#8217;m actually favorable about the increasing use of SQL alternatives. Perhaps I should pull those thoughts together. Relational database management systems were invented to let you use one set of data [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I&#8217;ve written some snarky things about the “<a href="http://www.dbms2.com/2009/12/11/nosql-q-and-a/">NoSQL</a>” concept – or at least the moniker. (Carl Olofson&#8217;s term &#8220;<a href="http://searchsoa.techtarget.com/news/article/0,289142,sid26_gci1376713,00.html">non-schematic databases</a>&#8221; seems less bad.) Yet I&#8217;m actually favorable about the increasing use of <a href="http://www.dbms2.com/2008/02/15/non-relational-database-management/">SQL alternatives</a>. <span style="font-style: normal;">Perhaps I should pull those thoughts together.<span id="more-1319"></span></span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;"><strong>Relational database management systems were invented to let you use one set of data in multiple ways,</strong></span><span style="font-style: normal;"> including ways that are unforeseen at the time the database is built and the first applications against it are written. In almost all cases, </span><span style="font-style: normal;"><strong>RDBMS are the best way to manage data of that nature.</strong></span><span style="font-style: normal;"> The increasing diversity in kinds of RDBMS – especially on the analytic side – just strengthens the point: Also, </span><span style="font-style: normal;"><strong>RDBMS are more mature than most competing technologies.</strong></span><span style="font-style: normal;"> And so, for multiple reasons, </span><span style="font-style: normal;"><strong>your highest-value data often belongs in an RDBMS.</strong></span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">The main reason I wrote “often” instead of “always” is that some of your highest-value data is in formats that don&#8217;t fit well into an RDBMS at all. The most obvious example is text. </span><a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">Text data shouldn&#8217;t be shoehorned into the relational model</a>,<span style="font-style: normal;"> and to date it often has been best to </span><span style="font-style: normal;"><strong>manage text entirely outside of RDBMS.</strong></span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">Even lower-value data often belongs in RDBMS. </span><a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/">eBay</a><span style="font-style: normal;"> has huge volumes of log files stored in RDBMS. </span><a href="http://www.dbms2.com/2009/10/01/yahoos-decapetabyte-data-warehousinghadoop/">Yahoo</a> and <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/">Facebook</a><span style="font-style: normal;"> both prefer Hadoop over traditional RDBMS – but both are also building capabilities into Hadoop that pretty much will amount to a new RDBMS.</span></p>
<p style="margin-bottom: 0in;"><a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/">Science</a><span style="font-style: normal;"> provides some pretty compelling use cases for non-SQL-oriented DBMS. So does </span><a href="http://www.dbms2.com/2008/08/16/intersystems-cache-microsoft-sql-serve/">health care</a>. <span style="font-style: normal;">But that&#8217;s not the kind of thing the NoSQL folks seem to focus on. Rather, “NoSQL” seems mainly to encompass three kinds of systems:</span></p>
<ul>
<li><span style="font-style: normal;"><strong>Key-value 	stores, such as </strong></span><a href="http://www.dbms2.com/2008/07/21/project-cassandra-facebook-open-sourced-quasi-dbms/"><strong>Cassandra</strong></a><span style="font-style: normal;"><strong> or BigTable</strong></span><span style="font-style: normal;">. So far as I 	can tell, a key-value store is just a substitute for a 	transaction-processing DBMS, inferior in most ways except scalable 	performance, where they can shine. As an additional benefit, a 	key-value store frees you from that pesky SQL programming you never 	learned in school. What&#8217;s more, if you can&#8217;t stabilize your schema, 	a key-value store lets you get some level of database programming 	done anyway.</span></li>
<li><span style="font-style: normal;"><strong>Document 	managers such as CouchDB or MongoDB. </strong></span><span style="font-style: normal;">I 	haven&#8217;t figured out how those are different from low-volume 	distributed file systems, or why anybody should care about them.</span></li>
<li><span style="font-style: normal;"><strong>DBMS 	imitations built on top of HDFS</strong></span><span style="font-style: normal;"> (Hadoop Distributed File System). For the most part, I think those 	will wind up talking SQL.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">So it seems that, at least for now, </span><span style="font-style: normal;"><strong>the legit part of the NoSQL movement is the distributed key-value stores.</strong></span><span style="font-style: normal;"> Frankly, even if transactional data is persisted in a key-value store, it should wind up in an RDBMS, whether OLTP or analytic. But even so, the big web companies seem to have demonstrated that key-value stores have very legitimate uses.</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/12/legit-nosql-key-value-store/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
	</channel>
</rss>

