<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Archiving and information preservation</title>
	<atom:link href="http://www.dbms2.com/category/database-management-system/archiving-and-information-preservation/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Clarifying SAND&#8217;s customer metrics, positioning and technical story</title>
		<link>http://www.dbms2.com/2011/11/12/clarifying-sands-customer-metrics-positioning-and-technical-story/</link>
		<comments>http://www.dbms2.com/2011/11/12/clarifying-sands-customer-metrics-positioning-and-technical-story/#comments</comments>
		<pubDate>Sun, 13 Nov 2011 02:45:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5669</guid>
		<description><![CDATA[Talking with my clients at SAND can be confusing. That said: I need to revise my figures for SAND&#8217;s customer count way downward. SAND finally has a reasonably clear positioning. SAND&#8217;s product actually seems to have a lot of features. A few months ago, I wrote: SAND Technology reported &#62;600 total customers, including &#62;100 direct. [...]]]></description>
			<content:encoded><![CDATA[<p>Talking with my clients at SAND can be confusing. That said:</p>
<ul>
<li>I need to revise my figures for SAND&#8217;s customer count way downward.</li>
<li>SAND finally has a reasonably clear positioning.</li>
<li>SAND&#8217;s product actually seems to have a lot of features.</li>
</ul>
<p>A few months ago, I wrote:</p>
<blockquote><p>SAND Technology reported &gt;600 total customers, including &gt;100 direct.</p></blockquote>
<p>Upon talking with the company, I need to revise that figure downward, from &gt; 600 to 15.</p>
<p><span id="more-5669"></span><em>One embarrassing point: SAND is a client, and I view it as part of my job to save clients from that kind of inadvertent misstatement.</em></p>
<p>It turns out that SAND has a very impressive customer &#8212; Dunnhumby, a data mart outsourcer with 200 terabytes of data in SAND, 30 or so incoming data streams, 400 or so nodes &#8230; and 600 or so end customers, all of which SAND was counting as OEM end customers for its DBMS. But I, other industry observers, and other vendors generally don&#8217;t count that way.</p>
<p>Besides Dunnhumby, SAND has 14 other customers on maintenance, with &lt; 1 terabyte of data each. Until recently, SAND had a couple dozen more customers than that, but it <a href="http://www.sand.com/sand-technology-announces-sale-sap-ilm-product-line/">sold its SAP-oriented archiving/near-line storage product line to Informatica</a>.</p>
<p>I still don&#8217;t know where the &#8220;&gt; 100 direct&#8221; part came from.</p>
<p>After the sale of its other product line, SAND is squarely in the market for analytic DBMS. SAND&#8217;s sales efforts seem to be focused on <a href="http://www.dbms2.com/2011/03/03/investigative-analytics/">investigative analytics</a>, although some of its existing users seem to be more focused on <a href="http://www.dbms2.com/2011/11/08/terminology-operational-analytics/">operational analytics</a>. Most specifically, SAND is trying to focus on &#8220;people data&#8221; &#8212; customer loyalty, health care, etc . &#8212; rather than purely <a href="http://www.dbms2.com/2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>, with the paradigmatic target application being personalized marketing.</p>
<p>SAND technical highlights include:</p>
<ul>
<li>SAND sells a columnar analytic DBMS.</li>
<li>The SAND DBMS operates on bitmaps, with heavy use of run-length encoding on the bitmaps. Bitmaps are used for everything except BLOBs (Binary Large OBjects).</li>
<li>Actual data compression also comes into play, e.g. as result sets are being assembled. This is based on a true global dictionary &#8212; multiple columns are tokenized together.</li>
<li>Indeed, SAND can decompose columns and tokenize their parts (e.g. time stamps).</li>
<li>SAND&#8217;s workload management sees RAM and CPU, but not explicitly I/O.</li>
<li>SAND lets you pin certain tables or even table segments in RAM if you want to.</li>
</ul>
<p>SAND&#8217;s update story is straightforward &#8212; when data comes in, all the columns and bitmaps are updated as needed. Still, since SAND is columnar, you wouldn&#8217;t expect true updates in place, and you&#8217;d be right. Rather, there&#8217;s a story with MVCC (MultiVersion Concurrency Control) and garbage collection, lock-free. The MVCC is also exploited for a kind of time travel, and further for some kind of virtual data mart capability.</p>
<p>SAND&#8217;s parallelization story is a bit complicated.</p>
<ul>
<li>SAND has, or at least has the potential for, <a href="../../../../../2008/09/05/mpp-data-warehouse-nodes/">node specialization</a>, with database and storage nodes being different.</li>
<li>In principle, disks are specific to storage nodes, and it&#8217;s a configuration option as to whether a database node sees one, some, or all storage nodes.</li>
<li>In practice, only Dunnhumby among SAND&#8217;s customers operates on other than a shared-disk basis. Dunnhumby&#8217;s configuration is mixed/matched among various SAND sharing options.</li>
</ul>
<p>SAND is proud of its PMML (Predictive Modeling Markup Language) scoring capabilities, but otherwise hasn&#8217;t shipped much in the way of <a href="../../../../../2011/02/24/analytic-platforms/">analytic platform</a> capabilities. That said, work is underway on a user-defined table function capability that can also query external tables, fire off MapReduce jobs, and so on, under the code name UQL.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/12/clarifying-sands-customer-metrics-positioning-and-technical-story/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Text data management, Part 1: Confusion</title>
		<link>http://www.dbms2.com/2011/10/10/text-data-management-confusion/</link>
		<comments>http://www.dbms2.com/2011/10/10/text-data-management-confusion/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 01:58:03 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5421</guid>
		<description><![CDATA[This is Part 1 of a three post series. The posts cover: Confusion about text data management. Choices for text data management (general and short-request). Choices for text data management (analytic). There&#8217;s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include: The terminology around text [...]]]></description>
			<content:encoded><![CDATA[<p><em>This is Part 1 of a three post series. The posts cover:</em></p>
<ol>
<li> <em><a href="http://www.dbms2.com/2011/10/10/text-data-management-confusion/">Confusion about text data management</a>.</em></li>
<li><em><a href="http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/">Choices for text data management (general and short-request)</a>.</em></li>
<li><em><a href="http://www.dbms2.com/2011/10/10/text-data-management-part-3-analytic-and-progressively-enhanced/">Choices for text data management (analytic)</a>.</em></li>
</ol>
<p>There&#8217;s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:</p>
<ul>
<li>The terminology around text data is inaccurate.</li>
<li>Data volume estimates for text are misleading.</li>
<li>Multiple different technologies are in the mix, including:
<ul>
<li>Enterprise text search.</li>
<li>Text analytics &#8212; <a href="http://www.texttechnologies.com/category/software-as-a-service-saas/category/text-mining/">text mining</a>, sentiment analysis, etc.</li>
<li>Document stores &#8212; e.g. document-oriented NoSQL, or MarkLogic.</li>
<li>Log management and parsing &#8212; e.g. Splunk.</li>
<li>Text archiving &#8212; e.g., various specialty email archiving products I couldn&#8217;t even name.</li>
<li>Public web search &#8212; Google et al.</li>
</ul>
</li>
<li>Text search vendors have disappointed, especially technically.</li>
<li>Text analytics vendors have disappointed, especially financially.</li>
<li>Other analytic technology vendors ignore <a href="http://www.texttechnologies.com/2010/12/01/state-of-the-art-text-analytics-mining-applications/">what the text analytic vendors actually have accomplished</a>, and reinvent inferior wheels rather than OEM the state of the art.</li>
</ul>
<p>Above all: <a href="http://www.dbms2.com/2011/10/10/text-data-management-part-2-general-and-short-request/">The use cases for text data vary greatly</a>, just as the use cases for simply-structured databases do.</p>
<p>There are probably fewer people now than there were six years ago who need to be told that <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">text and relational database management are very different things</a>. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: <span id="more-5421"></span></p>
<ul>
<li><strong> The terms &#8220;unstructured&#8221; or &#8220;semi-structured&#8221; data are inherently misleading</strong>. That&#8217;s why <a href="../../../../../2011/05/17/poly-structured-database/">I favor &#8220;multi-structured&#8221; or &#8220;poly-structured&#8221; instead</a>. (&#8220;Multi-structured&#8221; seems to be winning; e.g., it&#8217;s been adopted by Teradata and Teradata/Aster.)</li>
<li>The &#8220;social media&#8221; text data any one enterprise brings in house isn&#8217;t all that much. For example, <a href="../../../../../2011/04/14/attensity-update/">Attensity serves many different enterprises&#8217; social media needs from a single 20-terabyte data store</a>, and reports that no single enterprise has required as much as 1 terabyte of text yet. <strong>Text data may consume a lot of storage </strong>on spinning disks somewhere,<strong> but it&#8217;s not that big a factor in future DBMS industry growth.</strong> (That 20 terabyte figure does seem low.)</li>
<li><strong>Structured databases are typically worth a lot more per bit than other kinds.</strong> The most valuable electronic data, per-bit, is probably records of significant economic transactions &#8212; purchases, sales, money transfers, etc. The least valuable may be sensor log files, whose contents consist mainly of &#8220;Nothing going on here; ping you again in a minute.&#8221; Email logs, web interaction data and many other kinds fall somewhere in between. Highly valuable documents &#8212; such as signed contracts &#8212; generally persist in paper as well as electronic forms. <strong>Investors commonly overlook this point.</strong></li>
<li><strong>The enterprise text search industry is screwed up.</strong>
<ul>
<li>FAST was a goofy company before it was acquired for far too much money by Microsoft.</li>
<li>Autonomy was a goofy company before it was acquired for far too much money by HP.</li>
<li>Google&#8217;s enterprise efforts are quiet.</li>
<li>The integration of text search and relational DBMS &#8212; e.g. at Oracle &#8212; has languished, with poor performance and evident lack of management attention.</li>
<li>Smaller text search vendors don&#8217;t seem to be getting a lot of traction &#8212; e.g., <a href="http://www.texttechnologies.com/category/vendors/coveo/">Coveo</a> has a decent reputation, but when&#8217;s the last time you heard much about them? What has Attivio actually accomplished?</li>
</ul>
</li>
<li><strong>Text analytics is a small business</strong>. Add up the revenue for Attensity, Clarabridge, Lexalytics, Temis, and all the others, and you might poke above $100 million, especially now that Attensity had a three-way merger. Then again, you might not.</li>
<li>Even so, <strong>the text analytics vendors have developed sophisticated technology.</strong> In particular, you can use it to get a pretty good idea as to what people are writing about you, individually or as groups.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/10/text-data-management-confusion/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Teradata Columnar and Teradata 14 compression</title>
		<link>http://www.dbms2.com/2011/09/22/teradata-columnar-compression/</link>
		<comments>http://www.dbms2.com/2011/09/22/teradata-columnar-compression/#comments</comments>
		<pubDate>Thu, 22 Sep 2011 05:25:42 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Rainstor]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5296</guid>
		<description><![CDATA[Teradata is pre-announcing Teradata 14, for delivery by the end of this year, where by &#8220;Teradata 14&#8243; I mean the latest version of the DBMS that drives the classic Teradata product line. Teradata 14&#8242;s flagship feature is Teradata Columnar, a hybrid-columnar offering that follows in the footsteps of Greenplum (now part of EMC) and Aster [...]]]></description>
			<content:encoded><![CDATA[<p>Teradata is pre-announcing Teradata 14, for delivery by the end of this year, where by &#8220;Teradata 14&#8243; I mean the latest version of the DBMS that drives the classic Teradata product line. Teradata 14&#8242;s flagship feature is Teradata Columnar, a hybrid-columnar offering that follows in the footsteps of <a href="../../../../../2009/10/14/greenplum-hybrid-columnar/">Greenplum</a> (now part of EMC) and <a href="../../../../../2010/09/15/aster-data-ncluster-version-4-6/">Aster Data</a> (now part of Teradata).</p>
<p>The basic idea of Teradata Columnar is:</p>
<ul>
<li>Each table can be stored in Teradata in row format, column format, or a mix.</li>
<li>You can do almost anything with a Teradata columnar table that you can do with a row-based one.</li>
<li>If you choose column storage, you also get some new compression choices.</li>
</ul>
<p><span id="more-5296"></span>The &#8220;mix&#8221; option is like Vertica&#8217;s <a href="../../../../../2009/08/04/flexstore-and-the-rest-of-vertica-35/">FlexStore</a>, in that different columns (e.g. different components of a street address) can be grouped into a mini-row, even if you otherwise choose to store that table in a columnar way. Teradata does not at this time offer the Greenplum or Aster way of mixing rows and columns, whereby some of the rows in a table can be stored in a column-store way, while other rows are stored in entire-row row-store solidarity</p>
<p>Thus, Teradata Columnar gives you many of the basic I/O and compression benefits of columnar DBMS, along with all the usual Teradata goodness of concurrency, workload management, system management, concurrency, SQL support, and so on. By way of comparison:</p>
<ul>
<li>Similar things are true of Greenplum&#8217;s offering (except for the parts about concurrency, advanced workload management, and so on).</li>
<li>Aster doesn&#8217;t have columnar compression.</li>
<li>Oracle has <a href="../../../../../2011/02/06/columnar-compression-database-storage/">columnar compression but no true columnar storage</a>.*</li>
</ul>
<p>Also, as I noted above, Teradata mixes rows and columns in a different way than Aster or EMC Greenplum do.</p>
<p><em>*However, I won&#8217;t be surprised if Oracle soon announces true hybrid-columnar as well. I originally heard about Teradata Columnar and Oracle&#8217;s efforts to develop true hybrid-columnar storage the same week, 23 months ago.</em></p>
<p>Going hybrid-columnar is a big deal. Aster Data, for example, told me that a considerable fraction of all its workloads ran faster with columnar than row-based storage.* And it&#8217;s of extra importance to a vendor that, like Teradata, needs to play catch-up in the compression derby.</p>
<p><em>*Anything in which the queries eliminated more than half or so of the columns (60%, if I recall correctly, but it was definitely an approximate figure). That pretty much means any query except full and near-full table scans.</em></p>
<p>Teradata&#8217;s columnar compression story is pretty complicated. To quote from a forthcoming press release:</p>
<blockquote><p>Teradata automatically chooses from among six types of compression: run length, dictionary, trim, delta on mean, null and UTF8. based on the column demographics.</p></blockquote>
<p>The trickiest words in that are &#8220;automatic&#8221; and &#8220;dictionary&#8221;. Teradata divides column-store data into &#8220;column containers&#8221; of, say, 8 KB. (Current thinking is 8 KB default, 65 KB maximum, but that could change by the time of product release.) By default, Teradata software decides separately for each column container which compression algorithm(s) to use. It can even change its mind dynamically over time, as the contents of the container change.</p>
<p>What I find weird about Teradata&#8217;s columnar dictionary compression is that the dictionary is container-specific. One benefit versus having a more global dictionary is that, since you compress fewer items, compression tokens can each be shorter. (The length of a typical token is a lot like the log of the cardinality of the dictionary.) Another benefit is that smaller dictionaries are faster to search. The obvious offsetting drawback is that a larger and more global dictionary has the potential to compress various items that wind up being left uncompressed in this smaller-scale scheme.</p>
<p>Other notes about Teradata compression include:</p>
<ul>
<li>Teradata has for a while had a more manual form of dictionary compression.</li>
<li>Teradata also has block-level compression.</li>
<li>You can do block-level compression even on top of the columnar compression described above.</li>
<li>The Teradata/Rainstor partnership for archiving-level compression that Rainstor made so much fuss about doesn&#8217;t seem to actually be happening; Teradata seems content with the other compression choices it offers.</li>
</ul>
<p>And finally, Teradata 14 extends <a href="../../../../../2008/10/14/teradata-virtual-storage/">Teradata Virtual Storage</a> with a feature called Compress on Cold. The idea is that &#8220;cold&#8221; data can safely get (extra) compression &#8212; that block-level stuff &#8212; automatically. If the data heats up again (e.g. by becoming relevant for a while to the latest year-over-year comparisons) it can be just as automatically removed from compression. Teradata thinks this is significantly better than the alternative of making manual compression choices based on not-so-granular range partitions.</p>
<p>Unsurprisingly, Teradata lacks some features and benefits found in certain columnar-first analytic DBMS. One biggie is that, absent clever workarounds such as Vertica&#8217;s in-memory write-optimized store, columnar DBMS have a single-row-update performance problem, because you are putting the information in many places on disk rather than just one. I generally take it for granted that a columnar-first vendor has such a workaround. Row-based vendors gone columnar, however, are a different story. Teradata et al. are also likely to decompress data and reassemble it into full rows as soon as it hits RAM, which obviates the potential benefit that you have less data per row clogging up cache.*<em> (Edit: As per Todd Walter&#8217;s comments below, this is not accurate &#8212; and that&#8217;s a potentially important feature.)</em></p>
<p><em>*Late decompression actually depends on columnar compression, not columnar storage, and hence can also be enjoyed by row-based DBMS such as </em><a href="../../../../../2010/06/21/netezza-ibm-db2-compression/"><em>DB2</em></a><em>. </em></p>
<p>To use Teradata Columnar, you need to be using round-robin data distribution rather than, say, hash. Teradata jargon for this is NoPI, where the &#8220;PI&#8221; stands for Primary Index.* Drawbacks to that include:</p>
<ul>
<li>You don&#8217;t get the hash distribution benefit of saving a data redistribution step on joins whose join key happens to be the same as the hash key.</li>
<li>In Teradata-land, NoPI implies append-only, so you get the garbage collection/compactification that implies.</li>
</ul>
<p>However, that&#8217;s a physical append-only; you can still do logical updates.</p>
<p><em>*PI is not to be confused with PPI, which stands for Primary Partition Index, and is Teradata&#8217;s name for range (or case-statement-based) partitioning. PPI works just fine with Teradata Columnar. As of Teradata 14, you can do PPI up to 62 levels deep.</em></p>
<p>The Teradata folks also sent along a slide deck laying out parts of the <a href="http://www.monash.com/uploads/Teradata-Columnar-September-2011.ppt">Teradata Columnar</a> story. But it&#8217;s not one of the better Teradata decks I&#8217;ve ever posted.<em><br />
</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/22/teradata-columnar-compression/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 2)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:18:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Rainstor]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4867</guid>
		<description><![CDATA[In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/">Part 1</a> of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list match that is even less clear.  <span id="more-4867"></span></p>
<p><strong><em>Bit bucket</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Logs, other technical/external</li>
<li><em>Likely use styles:</em> Staging/ETL, investigative</li>
<li><em>Canonical example: </em>Log files in a Hadoop cluster<em> </em></li>
<li><em>Stresses:</em> TCO, scale-out, transform/big-query performance, ETL functionality</li>
</ul>
<p>With the explosion of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> has come the need for a place to put it all, sometimes called the <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a>. This is like the investigative data mart for big databases, but more <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured</a>. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.</p>
<p>The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.</p>
<p><strong><em>Archival data store</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Operational, CDR (call detail record), security log</li>
<li><em>Likely use styles:</em> Archival, reporting (for compliance), possibly also investigative</li>
<li><em>Examples:</em> Any long-term detailed historical store</li>
<li><em>Stresses: </em>TCO, compression, scale-out, performance (if multi-use)<em> </em></li>
</ul>
<p><em> </em></p>
<p>Analytic DBMS vendors have been insulting each other with the claim &#8220;that&#8217;s just an archival data store,&#8221; dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only <a href="../../../../../2010/06/11/rainstor-update/">Rainstor</a> truly embraces the archival positioning, and I&#8217;ve become pretty dubious about their technical claims and their company alike.</p>
<p>Still, there&#8217;s a legitimate need for data stores &#8212; especially relational analytic DBMS that:</p>
<ul>
<li>Store data cheaply, with high rates of compression.</li>
<li>Have decent performance if you do want to query the data.</li>
<li>May have archiving/compliance-specific features as well.</li>
</ul>
<p>Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.</p>
<p><strong><em>Outsourced data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Traditional BI, investigative analytics, staging/ETL</li>
<li><em>Examples:</em> Advertising tracking, SaaS CRM</li>
<li><em>Stresses:</em> Performance, TCO, reliability, concurrency</li>
</ul>
<p>Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I&#8217;ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that&#8217;s an analytic data management challenge. The possibilities expand from there.</p>
<p>Data outsourcers are in the IT business, and so their IT development is &#8212; hopefully! &#8212; more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.*  <a href="../../../../../2011/06/26/what-to-think-about-before-you-make-a-technology-decision/">Multitenancy</a> is commonly an issue, as is running in the cloud.<em> </em></p>
<p><em>*Even so, there&#8217;s often That Guy who doesn&#8217;t want to migrate away from Oracle, no matter what.<strong> </strong></em></p>
<p>Vertica gets the nod in a number of these cases; it&#8217;s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you&#8217;re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.</p>
<p><strong><em>Operational analytic(s) server</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> Customer-centric, log, financial trade</li>
<li><em>Likely use styles:</em> Advanced operational analytics</li>
<li><em>Examples:</em>
<ul>
<li>Lower latency: Web or call-center personalization, anti-fraud</li>
<li>Higher latency: Customer profiling, Basel 3 risk analysis</li>
</ul>
</li>
<li><em>Stresses:</em> Performance, reliability, analytic functionality, perhaps concurrency</li>
</ul>
<p>Even with eight different choices, I need a &#8220;catch-all&#8221; category; this is it.</p>
<p>Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrating short-request and analytic processing</a>. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you&#8217;re set. Otherwise, you may want to pipe <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">derived data</a> into a more &#8220;industrial-strength&#8221; DBMS, ideally the one that runs your operational apps anyway</p>
<p>Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you&#8217;re starting out with the data in a convenient bit bucket.</p>
<p>Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.</p>
<p><em>So did I get them all? Or are there yet other analytic data management use cases that I don&#8217;t fit into my eight categories?</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Rainstor update</title>
		<link>http://www.dbms2.com/2010/06/11/rainstor-update/</link>
		<comments>http://www.dbms2.com/2010/06/11/rainstor-update/#comments</comments>
		<pubDate>Fri, 11 Jun 2010 10:54:09 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Rainstor]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2252</guid>
		<description><![CDATA[I was tired and cranky when I talked with my former clients at Rainstor (formerly Clearpace) yesterday, so our call was shorter than it otherwise might have been. Anyhow, there&#8217;s a new version called Rainstor 4, the two main themes of which are: Compliance-specific features. Bottleneck Whack-A-Mole. The point is that Rainstor is focusing its [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I was tired and cranky when I talked with my former clients at <a href="http://www.dbms2.com/2009/12/11/rainstor-clearpace/">Rainstor (formerly Clearpace)</a> yesterday, so our call was shorter than it otherwise might have been. Anyhow, there&#8217;s a new version called Rainstor 4, the two main themes of which are:</p>
<ul>
<li>Compliance-specific features.</li>
<li><a href="http://www.dbms2.com/2009/08/21/bottleneck-whack-a-mole/">Bottleneck Whack-A-Mole</a>.</li>
</ul>
<p style="margin-bottom: 0in;">The point is that Rainstor is focusing its efforts on enterprises that:  <span id="more-2252"></span></p>
<ul>
<li>Have a compliance mandate to keep 	detailed information, either now or coming down the pike.</li>
<li>Would like to query the 	information, either as part of the compliance mandate or for the 	usual business reasons one does analysis (or for that matter 	pinpoint lookup of historical information).</li>
<li>Might want to delete the 	information as soon as the compliance mandate runs out. (That&#8217;s a 	new feature. Frankly, I think the clients demanding it are being 	foolish. Information is valuable and <a href="http://www.dbms2.com/2010/04/04/the-retention-of-everything/">should never be thrown away</a> if one can afford to keep it.)</li>
<li>Might want to annotate the 	information, even though it is being preserved immutably. (Also a 	new feature. I think that one is smart.)</li>
</ul>
<p style="margin-bottom: 0in;">“Application retirement” was mentioned only in the context of Rainstor&#8217;s flagship Informatica partnership, and even then mainly for clients who had a compliance reason to keep old application data around. “Cloud” and “private cloud” get mentioned, but they don&#8217;t seem to be as central as Rainstor was previously hoping they would be. (This is one area we could and probably should have touched on more had I been more awake.)</p>
<p style="margin-bottom: 0in;">One thing that hasn&#8217;t changed:  “<a href="http://www.dbms2.com/2008/12/16/database-archiving-and-information-preservation/">Information preservation</a>,” which I coined for Rainstor at our first meeting, is still the company catchphrase.</p>
<p style="margin-bottom: 0in;">So far as I could tell, the big point on Rainstor 4 Bottleneck Whack-A-Mole is this: When you load data into Rainstor (bulk or otherwise), it likes to do some metadata analysis first. (I imagine this is related to the sophisticated <a href="http://www.dbms2.com/2009/05/14/the-secret-sauce-to-clearpaces-compression/">Rainstor compression scheme</a>.) Well, that isn&#8217;t much of a performance hit for schemas with small numbers of tables, but is a bigger deal for more complex schemas. The Rainstor 4 fix is to remember/persist some of that analysis from one time the database is updated until the next time. Sounds obvious, but so do a lot of bottleneck fixes once they are made.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/11/rainstor-update/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>I&#8217;ll be speaking in Washington, DC on May 6</title>
		<link>http://www.dbms2.com/2010/04/18/washington-dc-may-2010-big-data-summi/</link>
		<comments>http://www.dbms2.com/2010/04/18/washington-dc-may-2010-big-data-summi/#comments</comments>
		<pubDate>Sun, 18 Apr 2010 21:48:15 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[Presentations]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1937</guid>
		<description><![CDATA[My clients at Aster Data are putting on a sequence of conferences called &#8220;Big Data Summit(s)&#8221;, and wanted me to keynote one. I agreed to the one in Washington, DC, on May 6, on the condition that I would be allowed to start with the same liberty and privacy themes I started my New England [...]]]></description>
			<content:encoded><![CDATA[<p>My clients at Aster Data are putting on a sequence of conferences called &#8220;Big Data Summit(s)&#8221;, and wanted me to keynote one. I agreed to the one <a href="http://bigdatasummit.com/2010/dc/">in Washington, DC, on May 6</a>, on the condition that I would be allowed to start with the same liberty and privacy themes I started my <a href="http://www.dbms2.com/2010/01/31/data-based-snooping-threat-libert/">New England Database Summit keynote</a> with. Since I already knew Aster to be one of the multiple companies in this industry that is responsibly concerned about the liberty and privacy threats we&#8217;re all helping cause, I expected them to agree to that condition immediately, and indeed they did.</p>
<p>On a rough-draft basis, my talk concept is:</p>
<p style="margin-bottom: 0in;"><strong>Implications of New Analytic Technology in four areas:</strong></p>
<ul>
<li><strong>Liberty &amp; privacy</strong></li>
<li><strong>Data acquisition &amp; retention</strong></li>
<li><strong>Data exploration</strong></li>
<li><strong>Operationalized analytics</strong></li>
</ul>
<p>I haven&#8217;t done any work yet on the talk besides coming up with that snippet, and probably won&#8217;t until the week before I give it. Suggestions are welcome.</p>
<p>If anybody actually has a link to a clear discussion of legislative and regulatory data retention requirements, that would be cool. I know they&#8217;ve exploded, but I don&#8217;t  have the details.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/18/washington-dc-may-2010-big-data-summi/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The retention of everything</title>
		<link>http://www.dbms2.com/2010/04/04/the-retention-of-everything/</link>
		<comments>http://www.dbms2.com/2010/04/04/the-retention-of-everything/#comments</comments>
		<pubDate>Sun, 04 Apr 2010 07:25:37 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1833</guid>
		<description><![CDATA[I&#8217;d like to reemphasize a point I&#8217;ve been making for a while about data retention: As costs go down, the wisdom of keeping detailed data goes up. I’d go so far as to say that every piece of data generated by a human being should be preserved and kept online, legal and privacy considerations permitting.* [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;d like to reemphasize a point I&#8217;ve been making for a while about <a href="http://www.dbms2.com/2009/12/07/data-warehouse-volume-growth/">data retention</a>:<span id="more-1833"></span></p>
<blockquote>
<p style="margin-bottom: 0in;">As costs go down, the wisdom of keeping <strong>detailed data</strong> goes up. I’d go so far as to say that <strong>every piece of data generated by a human being should be preserved and kept online,</strong> legal and privacy considerations permitting.* Most forms of capital-, labor-, and/or location-based competitive advantage being commoditized and/or globalized away. But information remains a unique corporate asset. Don’t discard it lightly.</p>
<p style="margin-bottom: 0in;"><em>*Unless there’s an explicit law mandating data destruction, legal considerations </em>should <em>permit. The idea “Let’s destroy something of irreplaceable value today, against the possibility we might be brought to judgment tomorrow” is both morally and pragmatically weird. Privacy, however, may be a different matter.</em></p>
</blockquote>
<p style="margin-bottom: 0in;">That applies to the structured/tabular kinds of data I tend to focus on in this blog. It applies even more to anything that&#8217;s like a document (or email, instant message, whatever) somebody has taken the trouble to place into words.  A top document-oriented archiving analyst (and my good friend), David Ferris, <a href="http://www.ferris.com/2008/04/02/expect-to-archive-everything/">quite</a> <a href="http://www.ferris.com/2008/03/30/you-dear-reader-are-immortal/">agrees</a>. As David puts it:</p>
<blockquote><p>I think we’ll end up archiving everything, except egregious garbage like spam:</p>
<ul>
<li> It’s too hard to get users to conform to policy.</li>
<li> Automated methods of capturing a human-understandable policy, for example “tax records,” are too hard to implement through automatic filters. The filters are too inaccurate.</li>
<li> It’s impractical to get users to classify everything, and automatic classification is too crude.</li>
<li> You never know what you might want later. Stuff you think you won’t want now may end up being very useful.</li>
<li> The cost of storage is trivial when looked at on a per-user basis.</li>
</ul>
</blockquote>
<p>In particular, I think information destruction is a crude instrument for the protection of privacy, wasteful at best, and likely to be vigorously resisted by governments and large businesses.  For example:</p>
<ul>
<li>Businesses are increasingly subject to retention-oriented compliance regulation. Your lawyers may want you to destroy information that could be used to sue you, but governments won&#8217;t let you.</li>
<li>Information about individuals&#8217; web surfing is being retained, under law, so that they may be fingered later for pornography consumption or illegal file sharing. I deplore some of the ways <a href="http://www.monashreport.com/2006/06/06/freedom-even-without-data-privacy/">web-surfing data can be and is being used</a>, and want <a href="http://www.dbms2.com/2010/04/04/privacy-liberty-continued/">laws passed to rein them in</a>. But the retention will happen.</li>
<li>Marketers want all that data. Duh.</li>
<li><a href="http://blogs.computerworld.com/node/1002">Electronic health records</a> are coming &#8212; slowly, but they&#8217;ll get here some day.</li>
</ul>
<p>Besides, <a href="http://www.dbms2.com/category/database-management-system/archiving-and-information-preservation/">archiving technologies</a> are getting ever more cost-effective.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/04/the-retention-of-everything/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Notes on RainStor, the company formerly known as Clearpace</title>
		<link>http://www.dbms2.com/2009/12/11/rainstor-clearpace/</link>
		<comments>http://www.dbms2.com/2009/12/11/rainstor-clearpace/#comments</comments>
		<pubDate>Sat, 12 Dec 2009 00:15:02 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Rainstor]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Telecommunications]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1295</guid>
		<description><![CDATA[Information preservation* DBMS vendor Clearpace officially changed its name to RainStor this week. RainStor is also relocating its CEO John Bantleman and more generally its headquarters to San Francisco. This all led to a visit with John and his colleague Ramon Chen, highlights of which included: RainStor expects to finish the year with &#62; 50 [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;"><a href="http://www.dbms2.com/2008/12/16/database-archiving-and-information-preservation/">I</a><a href="http://www.dbms2.com/2008/12/16/database-archiving-and-information-preservation/">nformation preservation</a>* DBMS vendor Clearpace officially changed its name to RainStor this week. RainStor is also relocating its CEO John Bantleman and more generally its headquarters to San Francisco. This all led to a visit with John and his colleague Ramon Chen, highlights of which included:<span id="more-1295"></span><!--more--></p>
<ul>
<li>RainStor expects to finish the 	year with &gt; 50 users (overwhelmingly via partners)</li>
<li>A big market for RainStor (at 	least in terms of signed partnerships and large deal activity) is 	retention of telecom records, for compliance purposes, typically for 	a 1-3 year period. This includes:
<ul>
<li>CDRs (Call Detail Records)</li>
<li>Mobile phone records including 	CDRs and missed calls</li>
<li>SMS (Short Message Service), 	including the complete text of same</li>
</ul>
</li>
<li>RainStor thinks a number of larger 	telcos have the need to store a billion records per day each. (I&#8217;m 	not sure how many subscribers such a telco would have to have).</li>
<li>John further thinks that, for the 	same query performance, RainStor can handle such a database on 4 	blades. More precisely, he says that&#8217;s what happened at a test 	conducted by a major technology firm. In the same test case, SenSage 	required 40 blades, and Oracle required 80 or more cores on a pair 	of big SMP machines.  John further says that the Oracle solution 	required a new table and new tablespace every day, while RainStor&#8217;s 	took 3 days for initial installation and required no DBA afterwards. 	However, I&#8217;m in no position to verify this report independently.</li>
<li>In a different kind of proof 	point, so extreme it gives even the RainStor folks pause, a user has 	retired 300 different applications and put their databases onto a 	single 2-core box. (Presumably, this is via RainStor&#8217;s OEM 	relationship with Informatica.)</li>
<li>Coming Very Soon are some services 	tying RainStor&#8217;s DBMS to obvious-suspect SaaS offerings. The core 	positioning is “SaaS data escrow”.i.e., RainStor will help you 	ensure that, in a worst-case scenario, there&#8217;s a nice safe copy of 	your data you can get at. RainStor also encourages you to do basic 	reporting and BI against the RainStor copy of the data, if you 	choose.</li>
<li>The idea I&#8217;ve been pushing lately 	of taking a heterogeneous replication offering like Continuent&#8217;s and 	having it feed an archiving store like RainStor&#8217;s has hit a rather 	basic snag. RainStor doesn&#8217;t actually consume change data capture 	kinds of information directly, at least as of yet, because of 	difficulties fitting such a stream into its 	guaranteed-data-immutability model.</li>
</ul>
<p><em>*I coined that category description for John in the tea room of the Park Lane Hotel. He&#8217;s subsequently embraced it enthusiastically, and I kind of like it myself. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </em></p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>
<p style="margin-bottom: 0in;">RainStor&#8217;s approach to 	compression, as described by <a href="http://www.dbms2.com/2009/05/14/the-secret-sauce-to-clearpaces-compression/">me</a> and by <a href="http://www.rainstor.com/news-blog/blog/rainstors-secret-sauce-data-and-pattern-deduplication">RainStor itself</a></p>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/11/rainstor-clearpace/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Boston Big Data Summit keynote outline</title>
		<link>http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/</link>
		<comments>http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/#comments</comments>
		<pubDate>Mon, 23 Nov 2009 06:25:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[DBMS product categories]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Humor]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1227</guid>
		<description><![CDATA[Last month, Bob Zurek asked me to give a talk on “Big Data”, where “big” is anything from a few terabytes on up, then moderate a panel on cloud computing. We agreed that I could talk just from notes, without slides. So, since I have them typed up, I&#8217;m posting them below. The top two [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Last month, Bob Zurek asked me to give a talk on <a href="http://www.dbms2.com/2009/10/09/presentations-upcoming/">“Big Data”, where “big” is anything from a few terabytes on up</a>, then moderate a panel on cloud computing. We agreed that I could talk just from notes, without slides. So, since I have them typed up, I&#8217;m posting them below.</p>
<p><span id="more-1227"></span></p>
<p style="margin-bottom: 0in;">The top two points from Q&amp;A probably were:</p>
<ul>
<li><strong>Big Data and the cloud actually 	have relatively little to do with each other,</strong> <a href="http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/">a few exceptions</a> notwithstanding, especially if the data is in a shared-nothing DBMS 	(as opposed to, say, a MapReduce-oriented file cluster). Two 	principal reasons are:
<ul>
<li>Redistributing data from node to 	node is a little slow, undermining some of the elasticity benefits 	of the cloud.</li>
<li><a href="http://www.dbms2.com/2009/05/29/sneakernet-to-the-cloud/">Getting data into the cloud in the 	first place is a lot slow</a>.</li>
</ul>
</li>
<li><strong>The NoSQL movement is a lot like 	the Ron Paul campaign</strong> &#8212; it consists of people who are dissatisfied 	with the status quo, whose dissatisfaction has a lot to do with 	insufficient liberty and/or excessive expenditure, and who otherwise 	don&#8217;t have a whole lot in common with each other.</li>
</ul>
<p style="margin-bottom: 0in;">Anyhow, here are my notes for the talk, edited in just a couple of places for readability or linkage.</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><strong>Quick introduction</strong></p>
<ul>
<li>Big Data vs. cloud</li>
<li>How big is Big Data?</li>
<li>At the low end of that range, 	there&#8217;s little you can&#8217;t do with conventional technology if you 	have:
<ul>
<li>An unlimited budget for hardware</li>
<li>An unlimited budget for software</li>
<li>An unlimited budget for people, 	especially Oracle DBAs</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Big Data in OLTP</strong></p>
<ul>
<li>Hard-core OLTP
<ul>
<li>Focus of DBMS technology for a 	long-time</li>
<li>Big budgets because each 	transaction has significant value</li>
<li>Tough to get users to change 	technologies</li>
</ul>
</li>
<li>Lighter-weight OLTP
<ul>
<li>Classic example = web companies
<ul>
<li>Big ones &#8212;  retail-oriented ones 	(eBay, Amazon) partially excepted &#8212; <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/">rolled their own technology 	stacks</a></li>
<li>Reluctant to give money to anybody
<ul>
<li>Open source, etc.</li>
</ul>
</li>
</ul>
</li>
<li>Difficulty finding market
<ul>
<li>Product vs. feature
<ul>
<li>Clustering/HA/DR/whatever</li>
<li>Ditto cloud enablement</li>
</ul>
</li>
<li>True products haven&#8217;t found much 	traction yet</li>
</ul>
</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Analytic Big Data use cases</strong></p>
<ul>
<li>Kinds of data for analytics
<ul>
<li>More of same != big</li>
<li>More detail and/or new kinds
<ul>
<li>Complete data sets</li>
<li>Transactions</li>
<li>Call details</li>
<li>Tick/trade history</li>
<li>Web clickstreams</li>
<li>Network event logs</li>
<li>Other machine-generated data</li>
<li>CAM bottom line
<ul>
<li>Anything human-generated should 	and will be retained in its entirety</li>
<li>Quantities of machine-generated 	data retained should and will grow roughly in line w/ computing cost 	reductions (Moore&#8217;s Law, etc.)</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>Analytic uses of Big Data
<ul>
<li>Analytics is mainly about three 	things
<ul>
<li>Problem detection</li>
<li>Customer relationship improvement
<ul>
<li>(Those overlap when the customer 	relationship is bad)</li>
</ul>
</li>
<li>Financial statements on steroids</li>
</ul>
</li>
</ul>
<ul>
<li>Main kinds of analytics
<ul>
<li>What BI vendors traditionally sell
<ul>
<li>General reporting and dashboards</li>
<li>Ad-hoc query (now driven from 	those reports and dashboards)</li>
<li>Planning (allegedly integrated 	with BI)</li>
</ul>
</li>
<li>Research
<ul>
<li>Ad hoc relational query (worth 	mentioning twice because it drives so much of the market)</li>
<li>Data mining</li>
<li>Most web search and web mining</li>
</ul>
</li>
<li>Operational/near-real-time</li>
<li>Archiving/compliance</li>
</ul>
</li>
<li>What gets Big?
<ul>
<li>Mainly research and archiving</li>
<li>But when reporting or operational 	get Big, you have really interesting computing problems</li>
</ul>
</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Technology issues and trends</strong></p>
<ul>
<li>Moore&#8217;s Law
<ul>
<li>CPUs &#8212; All about cores, hence 	parallelism is key</li>
<li>RAM</li>
<li>SSDs – hence replace disks</li>
<li>Sensors – hence generate lots 	more data</li>
</ul>
</li>
<li>Kryder&#8217;s Law
<ul>
<li>But <a href="http://www.dbms2.com/2005/11/13/breaking-the-disk-speed-barrier/">rotational speeds up only 	12.5X since Eisenhower Administration</a></li>
<li>Hence solid-state memory (or RAM) 	will soon take over</li>
</ul>
</li>
<li>In the mean time, I/O bottlenecks 	have had to be beaten
<ul>
<li>Hence sequential scans</li>
<li>Hence <a href="http://www.dbms2.com/2007/03/26/index-light-mpp-data-warehouse-appliances/">index-light</a> architectures</li>
<li>Hence columnar</li>
</ul>
</li>
<li>DBMS “overhead”
<ul>
<li>Raw license and maintenance fees – 	software increasing fraction of total</li>
<li>OLTP vestiges – locking and all 	that</li>
<li>DBAs
<ul>
<li>People costs = huge fraction of 	total</li>
<li>Index-lightness addresses</li>
<li>So does appliance</li>
</ul>
</li>
<li>Many people don&#8217;t really know how to 	write SQL</li>
</ul>
</li>
<li>Configuration
<ul>
<li>Appliance/tightly-balanced
<ul>
<li>Netezza</li>
<li>Teradata earlier</li>
<li>Greenplum/Sun</li>
<li>Oracle</li>
<li>IBM</li>
<li>Microsoft/Madison</li>
</ul>
</li>
<li>Commodity/do what you want
<ul>
<li>Vertica</li>
<li>Greenplum now</li>
<li>Infobright, Aster and others</li>
<li>MapReduce-oriented file systems</li>
</ul>
</li>
<li><a href="http://www.dbms2.com/2009/10/25/data-warehouse-balanced-hardware-configuration/">Extreme rigidity is silly</a>
<ul>
<li><a href="http://www.dbms2.com/2009/10/25/teradata-hardware-strategy-and-tactics/">Teradata, Oracle have both 	signaled moving to more modularity</a></li>
<li>Big driver of that = heterogeneous 	storage
<ul>
<li>Cheap disk</li>
<li>Expensive disk</li>
<li>Solid-state</li>
<li>RAM</li>
</ul>
</li>
</ul>
<ul>
<li>CPU/storage ratio is even more of a 	driver</li>
</ul>
</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Theoretically defensible ways to segment the market</strong></p>
<ul>
<li><a href="http://www.dbms2.com/2009/09/10/analytic-speed-latency/">Latency requirements</a>
<ul>
<li>High availability and low latency 	go together</li>
</ul>
</li>
<li>Query types
<ul>
<li>Simultaneous users for same</li>
</ul>
</li>
<li>Database size</li>
<li>Budget</li>
</ul>
<p style="margin-bottom: 0in;"><strong>Actual segments right now</strong></p>
<ul>
<li><a href="http://www.dbms2.com/2009/08/24/teradatas-active-enterprise-data-warehouse-story/">Utter ADW/EDW</a></li>
<li>Data mart
<ul>
<li>Size</li>
<li>Naturally columnar vs. naturally 	row-based</li>
</ul>
</li>
<li>Operational/frontline</li>
<li>Less dramatic/smaller EDW</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Notes on the Oracle Database 11g Release 2 white paper</title>
		<link>http://www.dbms2.com/2009/09/21/notes-on-the-oracle-database-11g-release-2-white-paper/</link>
		<comments>http://www.dbms2.com/2009/09/21/notes-on-the-oracle-database-11g-release-2-white-paper/#comments</comments>
		<pubDate>Mon, 21 Sep 2009 17:12:33 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Cache]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Oracle TimesTen]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=923</guid>
		<description><![CDATA[The Oracle Database 11g Release 2 white paper I cited a couple of weeks ago has evidently been edited, given that a phrase I quoted last month is no longer to be found. Anyhow, here are some quotes from and comments on what evidently is the latest version. The In-Memory Database Cache (IMDB Cache) option [...]]]></description>
			<content:encoded><![CDATA[<ul></ul>
<p><!-- 		@page { size: 8.5in 11in; margin: 0.79in } 		P { margin-bottom: 0.08in } -->The <a href="http://www.oracle.com/technology/products/database/oracle11g/pdf/oracle-database-11g-release2-overview.pdf">Oracle Database 11g Release 2 white paper</a> I cited <a href="http://www.dbms2.com/2009/09/03/oracle-11g-exadata-hybrid-columnar-compression/">a couple of weeks ago </a>has evidently been edited, given that a phrase I quoted last month is no longer to be found. Anyhow, here are some quotes from and comments on what evidently is the latest version.<span id="more-923"></span></p>
<blockquote><p>The In-Memory Database Cache (IMDB Cache) option of Oracle Database 11g Release 2, allows data to be cached and processed in the memory of the applications themselves, off-loading the data processing to middle tier resources. Any network latency between the middle tier and the back-end database is removed from the transaction path, with the result that individual transactions can often be executed up to 10 times faster. This is particularly useful where very high rates of transaction processing is required, such as those found under market trading systems, Telco switching systems, and Real Time manufacturing environments. All data in the middle tier is fully protected through local recovery, and asynchronous posting to the back end Oracle Database. With Oracle Database 11g Release 2, the ability to transparently deploy IMDB Cache with existing Oracle applications becomes much easier – with common data types, SQL and PL/SQL support, and native support for the Oracle Call Interface (OCI).</p></blockquote>
<p>At a guess, this sounds like it&#8217;s based on Oracle&#8217;s TimesTen acquisition.</p>
<blockquote><p>Oracle Database 11g Release 2 adds further optimizations, including capabilities to automatically determine the most optimal degree of parallelization for a query, based on available resources. With this comes automated parallel statement queuing, where the database determines that, based on current resource availability, it is more effective to queue a query for later execution once required resources have freed up.</p></blockquote>
<p>Sounds like a kind of automatic workload management &#8212; i.e., the kind of optimization vendors of mature products get around to putting into their systems. It does not sound like query pipelining, however.</p>
<blockquote><p>Oracle Database 11g Release 2 will automatically distribute a large compressed table (or a smaller non-compressed table), into the available memory across all the servers in the Grid, and will then localize parallel query processing to the data in memory on the individual nodes. This dramatically improves query performance, and is especially useful where large tables can be entirely compressed into the available memory using compression capabilities.</p></blockquote>
<p>So Oracle caches compressed data. Not stated is which compression techniques are covered.</p>
<blockquote><p>Each Exadata Storage Server stores up to 7 Terabyte [sic] of uncompressed user data, and also comes enabled with 384 GB of solid-state Flash cache. This Flash Cache automatically caches active data of the magnetic disks in the Oracle Exadata Storage Server, delivering a 10x performance gain for read and write operations under OLTP applications.</p></blockquote>
<p>Sounds like the Flash memory is positioned for OLTP use.</p>
<blockquote><p>In the past, Database Administrators and System Administrators have spent a great deal of time determining to how best place data across these disk arrays, to get maximum performance and availability. The best procedure for data placement is to simply Stripe And Mirror Everything; stripe data blocks equally across all disks in an array, and then mirror the blocks on at least two disks. This approach provides the perfect balance between performance, disk utilization, and ease of use.</p></blockquote>
<p>This is a big part of what could be called the &#8220;Administering Oracle doesn&#8217;t suck nearly as badly as it used to&#8221; pitch. (Mitchell Kertzman, who was Sybase CEO after the mid-1990s meltdown, told me his motto was &#8220;We suck less every day.&#8221; But I digress &#8230;)</p>
<blockquote><p>Automatic Storage Management (ASM), a feature of Oracle Database 11g automates the striping and mirroring of database without the need to purchase third party volume management software. As data volumes increase, additional disks can be added, and ASM will automatically restripe and rebalance the data across available disks to ensure optimal performance. Similarly, disks that report errors can be removed from the disk array, and ASM will re-adjust accordingly.</p></blockquote>
<p>I.e., you can add nodes without taking the system down. That&#8217;s becoming a pretty standard feature for serious parallel DBMS.</p>
<blockquote><p>Oracle Database 11g Release 2 improves ASM in significant areas. New intelligent data placement capabilities store infrequently accessed data on the inner rings of the physical disks, while frequently accessed data is placed on the outer rings, offering better performance optimization.</p></blockquote>
<p>Also pretty standard.</p>
<blockquote><p>Oracle has been enhancing partitioning capabilities for over ten years. Oracle Partitioning, an option of Oracle Database 11g Release 2, allows very large tables (and their associated indexes) to be partitioned into smaller, more manageable units, providing a “divide and conquer” approach to very large database management. Partitioning also improves performance, as the optimizer will prune queries to only use the relevant partitions of a table or index in a lookup. Oracle Database 11g Release 2 provides multiple methods for partitioning data, and also allows different levels of partitioning on the same table, so that a single partitioning strategy can be used to improve both performance and manageability.</p></blockquote>
<p>Even better might be a system that doesn&#8217;t lean heavily on complex partitioning to achieve good performance.</p>
<blockquote><p>Oracle Partitioning can also manage the lifecycle of information. Typically, all databases have active data – the information being processed this month or quarter, and historical data that is primarily read-only. Organizations can take advantage of the inherent lifecycle of data to implement a multi-tiered storage solution and lower their overall storage costs. For example, a large table within an order-entry system could contain all the orders processed in the last 7 years. Oracle Partitioning can be used to set up monthly partitions, with the current last four months of order data partitioned onto a high-end storage array, with all the other partitions placed on a lower-cost storage solution, often 2-3 times less cost than the high end storage environment.</p></blockquote>
<p>This is becoming a standard feature for any parallel DBMS that can support multiple kinds of storage in one system.</p>
<blockquote><p>Oracle Database 11g also provides advanced compression techniques to further reduce storage requirements. Using Oracle Advanced Compression, an option to Oracle Database 11g, all data in a table can be compressed using a continuous table compression capability that achieves a 2-4 times compression ratio with little performance impact on OLTP or Data Warehousing workloads. This compression technology replaces duplicate values in a table with a single value, and continuously adapts to data changes over time, so compression ratios are always maintained.</p></blockquote>
<p>Sounds like dictionary/token compression.</p>
<blockquote><p>With Oracle Database 11g Release 2, the Exadata Storage Servers in the Sun Oracle Database Machine also enable new hybrid columnar compression technology that provides up to a 10 times compression ratio, with corresponding improvements in query performance. And, for pure historical data, a new archival level of hybrid columnar compression can be used that provides up to 50 times compression ratios.</p></blockquote>
<p>I thought they said 40X before. But even if my memory isn&#8217;t playing tricks regarding that, single-point compression ratio estimates are always very approximate.</p>
<blockquote><p>Any hardware component in an Oracle Grid can be dynamically added or removed as required. Disks can be added or removed online with ASM, with the data automatically rebalanced across the new disk infrastructure. Additional servers can also be easily added or removed to a Real Application Cluster with users connected to these nodes rebalanced across the infrastructure. This ability to migrate users from one server to another in a RAC cluster also enables rolling patching of the database software. If a patch needs to be applied, then a server can be removed from the cluster, patched, and then put back into the cluster. The same operation can be repeated for the next server in the cluster, and so on.</p></blockquote>
<p>Nice. And the paper goes on in that vein for quite a while.</p>
<blockquote><p>Oracle Total Recall, an option to Oracle Database 11g, provides a solution for the retention of historical information. With Oracle Total Recall, all changes made to data are kept to provide a complete change history of information. This means that auditors can not only see who did what when, but they can also see what the actual information was at the time – something that previously has only be [sic] available by building into the application, or by expensive backup retention policies.</p></blockquote>
<p>Timestamping/time-travel/whatever is increasingly becoming a standard feature as well, especially given the number of PostgreSQL-based DBMS on the market.</p>
<blockquote><p>New internal control requirements found in regulations can be difficult and expensive to implement in an environment with multiple applications. Oracle Database Vault, an option to Oracle Database 11g, allows access controls to be transparently applied underneath existing applications. Users can be prevented from accessing specific application data, or from accessing the database outside of normal hours; separation-of-duty requirements can be enforced for different Database Administrators without a costly least privilege exercise. And Oracle Advanced Security, an option to Oracle Database 11g, can be used to transparently encrypt data at all levels – data in transit on the network; data at rest on physical storage and in backups. Similarly, the Data Masking pack can be used to obfuscate data as it moves from production to development, reducing the potential violation of privacy regulations or risking sensitive data leaks.</p></blockquote>
<p>Oracle is the gold standard in database security.</p>
<blockquote><p>Oracle’s self-management approach takes two tacks. Firstly, wherever possible, repeatable, labor intensive and error prone tasks that can be fully automated in the database have been. For example, Storage Management, Memory Management, Statistics collection, Backup and Recovery, and SQL Tuning have all been automated. Secondly, where operations cannot be fully automated, intelligent advisors are built into the database to mentor Database Administrators on how to get the best out of their systems. Advisors are provided for Configuration Management, Patching, Indexing, Partitioning, Performance Diagnostics, Data Recovery, and, new in Oracle Database 11g Release 2, Compression and Maximum Availability.</p></blockquote>
<p>And boy are they needed.</p>
<blockquote><p>Recent studies performed by an independent research company shows that Database Administrators can expect to spend 26% less time managing their 11g environments over their 10g environments, and as much as 50% when compared to older Oracle9i deployments.</p></blockquote>
<p>50% of way too much is still way too much.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/09/21/notes-on-the-oracle-database-11g-release-2-white-paper/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>

