<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Database compression</title>
	<atom:link href="http://www.dbms2.com/category/database-theory-practice/database-compression/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 02 Sep 2010 09:06:44 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>More on temp space, compression, and &#8220;random&#8221; I/O</title>
		<link>http://www.dbms2.com/2010/08/18/more-on-temp-space-compression-and-random-io/</link>
		<comments>http://www.dbms2.com/2010/08/18/more-on-temp-space-compression-and-random-io/#comments</comments>
		<pubDate>Wed, 18 Aug 2010 05:44:59 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2805</guid>
		<description><![CDATA[My PhD was in a probability-related area of mathematics (game theory), so I tend to squirm when something is described as &#8220;random&#8221; that clearly is not. That said, a comment by Shilpa Lawande on our recent Flash/temp space discussion suggests the following way of framing a key point:

You really, really want to have multiple data [...]]]></description>
			<content:encoded><![CDATA[<p>My PhD was in a probability-related area of mathematics (game theory), so I tend to squirm when something is described as &#8220;random&#8221; that clearly is not. That said, <a href="http://www.dbms2.com/2010/08/16/vertica-flash-temp-space/#comment-181134" >a comment by Shilpa Lawande</a> on our recent <a href="http://www.dbms2.com/2010/08/16/vertica-flash-temp-space/" >Flash/temp space discussion</a> suggests the following way of framing a key point:</p>
<ul>
<li>You really, really want to have multiple data streams coming out of temp space, as close to simultaneously as possible.</li>
<li>The storage performance characteristics of such a workload are more reminiscent of &#8220;random&#8221; than &#8220;sequential&#8221; I/O.</li>
</ul>
<p>If everybody else is cool with it too, I can live with that. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Meanwhile, I talked again with Tim Vincent of IBM this afternoon. Tim endorsed the temp space/Flash fit, but with a different emphasis, which upon review I find I don&#8217;t really understand. The idea is:</p>
<ul>
<li>Analytic DBMS processing generally stresses reads over writes.</li>
<li>Temp space is an exception &#8212; read and write use of temp space is pretty balanced. (You spool data out once, you read it back in once, and that&#8217;s the end of that; next time it will be overwritten.)</li>
</ul>
<p>My problem with that is: Flash typically has lower write than read IOPS (I/O per second), so being (relatively) write-intensive would, to a first approximation, seem if anything to disfavor a workload for Flash.</p>
<p>On the plus side, I was reminded of something I should have noted when I wrote about <a href="http://www.dbms2.com/2010/06/21/netezza-ibm-db2-compression/" >DB2 compression</a> before:</p>
<p>Much like Vertica, <strong>DB2 operates on compressed data all the way through, including in temp space. </strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/18/more-on-temp-space-compression-and-random-io/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Vertica&#8217;s innovative architecture for Flash, plus more about temp space than you perhaps wanted to know</title>
		<link>http://www.dbms2.com/2010/08/16/vertica-flash-temp-space/</link>
		<comments>http://www.dbms2.com/2010/08/16/vertica-flash-temp-space/#comments</comments>
		<pubDate>Mon, 16 Aug 2010 08:07:33 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2788</guid>
		<description><![CDATA[Vertica is announcing:

Technology it already has 	released*, but has not published any reference architectures 	for
A 	Barney partnership**

In other words, Vertica has succumbed to the common delusion that it&#8217;s a good idea to put out half-baked press releases the week of TDWI conferences. But if we look past that kind of all-too-common nonsense, Vertica is highlighting [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Vertica is announcing:</p>
<ul>
<li>Technology it already has 	released*, but has not published any reference architectures 	for</li>
<li><span style="font-style: normal;">A 	<a href="http://www.strategicmessaging.com/barney-partnerships/2010/08/12/" onclick="javascript:pageTracker._trackPageview('/www.strategicmessaging.com');">Barney</a> partnership**</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">In other words, Vertica has succumbed to the common delusion that it&#8217;s a good idea to put out half-baked press releases the week of TDWI conferences. </span>But if we look past that kind of all-too-common nonsens<span style="font-weight: normal;">e, Vertica is highlighting an interesting technical story, about </span><strong>how the analytic DBMS industry can exploit solid-state memory technology.</strong></p>
<p style="margin-bottom: 0in;"><em>*Upgrades to <a href="../2009/08/04/flexstore-and-the-rest-of-vertica-35/">Vertica FlexStore</a> to handle Flash memory, actually released as part of <a href="../2010/02/22/vertica-4/">Vertica 4.0</a></em></p>
<p style="margin-bottom: 0in;"><em>** With Fusion I/O</em></p>
<p style="margin-bottom: 0in;">To set the context, let&#8217;s recall a few points I&#8217;ve noted in the past:</p>
<ul>
<li><a href="../2010/01/31/flash-pcmsolid-state-memory-disk/">Solid-state 	memory&#8217;s price/throughput tradeoffs obviously make it the future of 	database storage</a>.</li>
<li><a href="../2010/06/25/flash-is-coming-well/">The 	Flash future is coming soon</a>, in part because Flash&#8217;s propensity 	to wear out is overstated. This is especially true in the case of 	modern analytic DBMS, which tend to write to blocks all at once, and 	most particularly the case for append-only systems such as Vertica.</li>
<li><a href="../2010/08/12/teradata-future-product-strategy/">Being 	able to intelligently split databases among various cost tiers of 	storage – e.g. Flash and disk – makes a whole lot of sense</a>.</li>
</ul>
<p style="margin-bottom: 0in;">Taken together, those points tell us:</p>
<p style="margin-bottom: 0in;"><strong>For optimal price/performance, analytic DBMS should support databases that run part on Flash, part on disk.</strong></p>
<p style="margin-bottom: 0in;">While all this is a future for some other analytic DBMS vendors, Vertica is shipping it today.* What&#8217;s more, three aspects of Vertica&#8217;s architecture make it particularly well-suited for hybrid Flash/disk storage, in each case for a similar reason – you can get most of the performance benefit of all-Flash for a relatively low actual investment in Flash chips:  <span id="more-2788"></span></p>
<ul>
<li><strong>Vertica lets you split tables 	by column, </strong><span style="font-weight: normal;">and Vertica 	FlexStore is versatile enough to let you put only the most-used 	columns in Flash. (Vertica offers a figure that 85% of usage calls 	on only 15% of columns, but I don&#8217;t know how rigorously grounded 	those numbers are.)</span></li>
<li>To the extent that Vertica data is<span style="font-weight: normal;"> <a href="../2008/09/24/vertica-finally-spells-out-its-compression-claims/">more </a></span><a href="../2008/09/24/vertica-finally-spells-out-its-compression-claims/">compressed</a> than many of Vertica&#8217;s competitors&#8217; (which it probably is, debates 	over the magnitude of Vertica&#8217;s advantage notwithstanding), the 	total storage-hardware cost of sticking stuff in Flash is less when 	you use Vertica than with other systems.</li>
<li>Vertica has <span style="font-weight: normal;">relatively 	less need for </span><strong>temp space</strong> than some other systems. 	(Vertica uses figures of &lt;20% of total storage, vs. 30%+ for some 	other systems.) If you want to use Flash for temp space, so as to 	accelerate your toughest queries, that can save you some cash …</li>
<li>… and by the way, <strong>temp space 	is an especially good use of Flash, </strong>because <strong>temp space is 	accessed in a less sequential manner than data storage is.</strong></li>
</ul>
<p style="margin-bottom: 0in;">The least obvious of those points are about temp space; I only understood the particulars when Vertica development chief Shilpa Lawande explained them to me Thursday.</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><em>* At least in theory; customer adoption may be a different matter.</em></p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">But before drilling down on temp space, let me first note that there&#8217;s one offsetting factor to all those “We need somewhat less Flash than the other guys” Vertica advantages. Like all serious databases, a Vertica installation keeps two or more copies of all data, to that there&#8217;s no storage single point of failure. In a flexible system like Vertica, you can put one copy on Flash and one on disk. But if you do that in Vertica, you forgo fully exploiting one possible benefit of Vertica&#8217;s architecture – the ability to store different copies of a column in different orders, which are beneficial for accelerating different groups of queries.*</p>
<p style="margin-bottom: 0in;"><em>*More precisely, you don&#8217;t get the full benefits of Flash acceleration for every query touching those columns.</em></p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">OK. Back to temp space. There are four kinds of things you can put in storage if you&#8217;re running a database management system:</p>
<ul>
<li>The <strong>software</strong> itself.</li>
<li><span style="font-weight: normal;">Persistent </span><strong>data. </strong><span style="font-weight: normal;">(I.e., tables, 	if the DBMS you&#8217;re running is relational.)</span></li>
<li><strong>Metadata,</strong> especially the 	kind that lets you find data &#8211;<strong> indexes,</strong> zone maps, catalogs, 	etc.</li>
<li><strong>Temporary data constructs</strong> built as part of, say, a s<span style="font-weight: normal;">ort-merge 	join. These, by definition, are what populate temp space.</span></li>
</ul>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">Just to be clear, those constructs are NOT temporary tables of the sort created by, say, Microstrategy; such tables are handled like any other data. Rather, they are ephemeral creat<span style="font-weight: normal;">ions and, so far as I can tell, not tables at all. </span></p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">Vertica offered two theories as to why its DBMS requires less temp space than competitors do:</p>
<ul>
<li>To the extent data is decompressed 	before being operated on in memory by the DBMS, that decompression 	would of course also apply to temp space as well. Vertica prides 	itself on <strong>keeping data compressed</strong> all the way through, and 	seems to get away with smaller temp space allocations as a benefit.</li>
<li>Since Vertica can store columns in 	expedient sort orders, it does less sorting overall, and sorting is 	a big use of temp space.</li>
</ul>
<p style="margin-bottom: 0in;">Obviously, no matter which DBMS you use, the amount of temp space you need is surely workload-dependent. Even so, Vertica&#8217;s claim to something of an advantage seems legit.</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;"><em>Truth be told, I&#8217;m not convinced the savings involved are great enough to </em>matter<em> a whole lot – but it&#8217;s a fun subject to think through. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </em></p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">And finally: One of my biggest surprises since starting to look at analytic-DBMS-on-Flash has been the centrality of temp space. Talking to Vertica Thursday, I finally uncovered a key reason why: <strong>Temp space tends to be accessed via multiple streams of data at once.</strong> I&#8217;m still struggling with WHY that is true, with two reasons suggested being:</p>
<ul>
<li>Temp space can be accessed by 	multiple operations at once. (But isn&#8217;t that also true of the rest 	of storage?)</li>
<li>Merge sorts, a common use of temp 	space, read multiple streams of data. (Couldn&#8217;t you tweak your 	software to make that not be true?)</li>
</ul>
<p style="margin-bottom: 0in;">But if we grant that temp space naturally is accessed in multiple places at once – well, that&#8217;s a lot like random I/O, and <a href="../2005/11/13/breaking-the-disk-speed-barrier/">if you&#8217;re doing a lot of random reads, you&#8217;d love to use something other than spinning disk</a>.</p>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/16/vertica-flash-temp-space/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>The Netezza and IBM DB2 approaches to compression</title>
		<link>http://www.dbms2.com/2010/06/21/netezza-ibm-db2-compression/</link>
		<comments>http://www.dbms2.com/2010/06/21/netezza-ibm-db2-compression/#comments</comments>
		<pubDate>Mon, 21 Jun 2010 12:05:47 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Netezza]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2320</guid>
		<description><![CDATA[Thursday, I spent 3 ½ hours talking with 10 of Netezza&#8217;s more senior engineers. Friday, I talked for 1 ½ hours with IBM Fellow and DB2 Chief Architect Tim Vincent, and we agreed we needed at least 2 hours more. In both cases, the compression part of the discussion seems like a good candidate to [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Thursday, <a href="http://www.dbms2.com/2010/06/21/netezza-database-software-technology-overview/" >I spent 3 ½ hours talking with 10 of Netezza&#8217;s more senior engineers</a>. Friday, I talked for 1 ½ hours with IBM Fellow and DB2 Chief Architect Tim Vincent, and we agreed we needed at least 2 hours more. In both cases, the compression part of the discussion seems like a good candidate to split out into a separate post. So here goes.</p>
<p style="margin-bottom: 0in;">When you sell a row-based DBMS, as Netezza and IBM do, there are a couple of approaches you can take to compression. First, you can compress the blocks of rows that your DBMS naturally stores. Second, you can compress the data in a column-aware way. Both Netezza and IBM have chosen completely column-oriented compression, with no block-based techniques entering the picture to my knowledge. But that&#8217;s about as far as the similarity between Netezza and IBM compression goes.  <span id="more-2320"></span></p>
<p style="margin-bottom: 0in;"><strong>IBM&#8217;s basic DB2 compression strategy</strong> is remarkably simple. In every table (not column) – or in each range partition in a range-partitioned table &#8212; <strong>the 4096 most common* values are identified; these are all encoded into 12-bit strings</strong>. And that&#8217;s that. This has been happening since DB2 9.1, released 4 ½ years ago. DB2&#8217;s compression persists through logs, buffer pools (i.e., RAM cache), and so on. In DB2 9.7, the most recent release, IBM extended the use of the compression to a few areas it hadn&#8217;t stretched before, such as log-based replication, native XML, or CLOBs (Character Large OBjects) that happen not to be too big.</p>
<p style="margin-bottom: 0in;"><em>*Actually, I&#8217;d presume it&#8217;s not exactly the “most common”; there surely is some minimum length of a value to be encoded, or some bias toward length. Also, the determination of what to encode is probably a little imprecise. E.g., I forgot to ask whether the choice of values ever changes as data got updated.</em></p>
<p style="margin-bottom: 0in;">The sophisticated part of DB2&#8217;s simple compression strategy is its breadth of applicability; DB2 compression can apply to:</p>
<ul>
<li>Values in columns (numeric, 	character, whatever)</li>
<li>Substrings of values in columns</li>
<li>Groups of columns (e.g., 	city/state/zip code)</li>
</ul>
<p style="margin-bottom: 0in;">Except for the 4096 values limit, that sounds at least as flexible as the <a href="http://www.dbms2.com/2009/05/14/the-secret-sauce-to-clearpaces-compression/" >Rainstor/Clearpace compression approach</a>.</p>
<p style="margin-bottom: 0in;"><strong>Netezza,</strong> unlike IBM, takes a grab-bag approach to compression – try out a bunch of techniques, see which work best, and incorporate those in the product. <a href="http://www.enzeecommunity.com/blogs/nzblog/2008/05/15/issue-19-the-compress-engine-the-netezza-philosophy" onclick="javascript:pageTracker._trackPageview('/www.enzeecommunity.com');">Netezza first introduced compression a couple of years ago,</a> for numeric columns only, especially integer.  Techniques  used in Netezza numeric compression include but are not limited to:</p>
<ul>
<li>Delta compression, wherein you 	store the increment between a value and its predecessor rather than 	a whole new value.</li>
<li>Ways of indicating that a value or 	increment was just the same as in the row before.</li>
</ul>
<p style="margin-bottom: 0in;">This was via something called Compress Engine,* now being renamed to Compress Engine 1. Netezza&#8217;s new Compress Engine 2 improves on what Netezza did in Compress Engine 1 for numeric data, most notably by trimming away excess field length. (Netezza says it got 28% better compression on a test data set with almost no character strings, primarily from that enhancement.) Further, Netezza Compress Engine 2 adds new compression techniques, allowing it to handle VARCHAR – i.e. character strings &#8212; as well.</p>
<p style="margin-bottom: 0in;"><em>*Fortunately, the original name or at least description of “Compiled Tables” is retreating ever more from view.</em></p>
<p style="margin-bottom: 0in;">Netezza&#8217;s Compress Engine 2 has two ways to compress character fields/text strings – <strong>prefix compression </strong><span style="font-weight: normal;">and </span><strong>Huffman coding.</strong> By way of contrast, Netezza tested suffix compression and decided it wasn&#8217;t beneficial enough to bother messing with.</p>
<ul>
<li>The idea behind prefix compression 	is that if two strings start with the same characters, for the 	second one you only have to record the part that&#8217;s different. Prefix 	compression has a lot of the same merits as delta compression; like 	delta compression, it works best on sorted columns. (An example of 	where prefix compression makes obvious sense is URLs, which tend to 	all start in similar ways.)</li>
<li>In Netezza&#8217;s version of Huffman 	coding, the alphabet is encoded symbol-by-symbol, with more common 	characters getting codes of shorter length. These codes are chosen 	on a column-by-column basis. (I presume the “/” character gets 	shorter code in a URL column than it would, for example, in one that 	stored addresses.)</li>
</ul>
<p style="margin-bottom: 0in;">While I didn&#8217;t ask explicitly, it seems pretty obvious that Compress Engine 2&#8217;s functionality is a strict superset of Compress Engine 1&#8217;s. <a href="http://www.dbms2.com/2010/06/21/netezza-silicon-balance/" >Netezza is going to run Compress Engines 1 and 2 side by side</a>, but expects pages to move from Compress Engine 1&#8217;s purview to Compress Engine 2&#8217;s as part of the new “table grooming” process.</p>
<p><em><strong>Related links</strong></em></p>
<ul>
<li>IBM kindly permitted me to post some of <a href="http://www.monash.com/uploads/ibm-db2-compression-june-2010.pdf" onclick="javascript:pageTracker._trackPageview('/www.monash.com');">its slides in the area of compression</a></li>
<li><a href="http://msdn.microsoft.com/en-us/library/cc280464.aspx" onclick="javascript:pageTracker._trackPageview('/msdn.microsoft.com');">Microsoft SQL Server seems to rely on prefix and dictionary compression</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/21/netezza-ibm-db2-compression/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Netezza&#8217;s silicon balance</title>
		<link>http://www.dbms2.com/2010/06/21/netezza-silicon-balance/</link>
		<comments>http://www.dbms2.com/2010/06/21/netezza-silicon-balance/#comments</comments>
		<pubDate>Mon, 21 Jun 2010 12:00:12 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2321</guid>
		<description><![CDATA[As I&#8217;ve mentioned in a couple of other posts, Netezza is stressing that the most recent wave of its technology is software-only, with no hardware upgrades made or needed. In other words, Netezza boxes already have all the silicon they need. But of course, there are really at least three major aspects to the Netezza [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">As I&#8217;ve mentioned in a couple of other posts, Netezza is stressing that the most recent <a href="http://www.dbms2.com/2010/06/21/netezza-database-software-technology-overview/" >wave</a> of its technology is software-only, with no hardware upgrades made or needed. In other words, Netezza boxes already have all the silicon they need. But of course, there are really at least three major aspects to the Netezza silicon story – FPGA (Field-Programmable Gate Array), CPU, and RAM.</p>
<ul>
<li>Netezza planned to be “generous” 	in its original TwinFin FPGA capacity, anticipating software 	upgrades like the ones it&#8217;s introducing now. It is satisfied that 	this strategy worked. More on this below.</li>
<li>The same surely applies to CPU.</li>
<li>What&#8217;s more, I get the sense that 	the CPU turned out in practice to be even more over-provisioned than 	they anticipated …</li>
<li>… at least when one just 	considers Netezza&#8217;s base NPS software.</li>
<li>However, I suspect that if the 	advanced analytics capability takes off, Netezza will determine that 	more CPU is always better.</li>
<li>And by the way, NEC is making 	versions of Netezza appliances with more advanced chips than Netezza 	is. So if anybody should really, really need more CPU in their 	Netezza boxes, there&#8217;s a very straightforward way to make that 	happen. (And if there were nontrivial demand for that, appropriate 	support plans could surely be structured.)</li>
<li>Everybody needs to be careful 	about RAM. Netezza is surely no exception.</li>
</ul>
<p style="margin-left: 0.49in; text-indent: -0.25in; margin-bottom: 0in;">
<p style="margin-bottom: 0in;">The major parts of Netezza&#8217;s FPGA software are:</p>
<ul>
<li><strong>Compress Engine 2.</strong> This is 	Netezza&#8217;s new way of doing compression.</li>
<li><strong>Compress Engine 1.</strong> This is 	Netezza&#8217;s old way of doing compression. It is being kept around so 	that existing Netezza tables don&#8217;t suddenly have to be changed or 	reloaded.</li>
<li><strong>Project Engine.</strong> Guess what 	this does.</li>
<li><strong>Restrict Engine.</strong> Ditto.</li>
<li><strong>Visibility Engine.</strong> This 	<a href="http://www.dbms2.com/2006/09/27/logless-lockless-netezza-more-carefully-explained/" >enforces ACID</a> and handles row-level security. It is “sort 	of a corner of” the Restrict Engine (Actually, Netezza seems to 	waver as to whether to describe “Restrict” and “Visibility” 	as being two engines or one.)</li>
<li>Miscellaneous plumbing.</li>
</ul>
<p style="margin-bottom: 0in;">
<p style="margin-bottom: 0in;">If I understood correctly, each Netezza FPGA has two each of the engines in parallel.</p>
<p style="margin-bottom: 0in;"><em><strong>Related link</strong></em></p>
<ul>
<li>An August, 2009 post on <a href="http://www.dbms2.com/2009/08/08/netezza-fpga/" >what Netezza does in its FPGA</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/21/netezza-silicon-balance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The underlying technology of QlikView</title>
		<link>http://www.dbms2.com/2010/06/12/the-underlying-technology-of-qlikview/</link>
		<comments>http://www.dbms2.com/2010/06/12/the-underlying-technology-of-qlikview/#comments</comments>
		<pubDate>Sat, 12 Jun 2010 10:00:49 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2282</guid>
		<description><![CDATA[QlikTech* finally decided both to become a client and, surely not coincidentally, to give me more technical detail about QlikView than it had when last we talked a couple of years ago. Indeed, I got to spend a couple of hours on the phone not just with Anthony Deighton, but also with QlikTech&#8217;s Hakan Wolge, [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">QlikTech* finally decided both to become a client and, surely not coincidentally, to give me more technical detail about QlikView than it had <a href="../2008/08/04/qliktech-qlikview-update/">when last we talked a couple of years ago</a>. Indeed, I got to spend a couple of hours on the phone not just with Anthony Deighton, but also with QlikTech&#8217;s Hakan Wolge, who wrote 70-80% of the code in QlikView 1.0, and remains in effect QlikTech&#8217;s chief architect to this day.</p>
<p style="margin-bottom: 0in;"><em>*Or, as it now appears to be called, Qlik Technologies.</em></p>
<p style="margin-bottom: 0in;">Let&#8217;s start with some quick reminders:</p>
<ul>
<li>QlikTech <span style="font-weight: normal;">makes 	QlikView, a wi</span>dely popular business intelligence (BI) tool 	suite.</li>
<li>QlikView is distinguished by <strong>the 	flexibility of navigation</strong> through its user interface.</li>
<li>To support this flexibility, 	<strong>QlikView preloads all data you might want to query into memory.</strong></li>
</ul>
<p style="margin-bottom: 0in;">Let&#8217;s also dispose of one confusion right up front, namely QlikTech&#8217;s use of the word <strong>associative:  <span id="more-2282"></span><br />
</strong></p>
<ul>
<li>Notwithstanding QlikT<span style="font-style: normal;">ech&#8217;s 	repeated use of phrases like “</span><em><span style="font-style: normal;">QlikView&#8217;s</span></em><span style="font-style: normal;"> unique, patented in-</span><em><span style="font-style: normal;">memory 	associative</span></em><span style="font-style: normal;"> technology,” </span><span style="font-style: normal;"><strong>there is 	nothing “associative” about QlikView&#8217;s data structures.</strong></span><span style="font-style: normal;"> </span></li>
<li><span style="font-style: normal;">Rather, 	“associative” is a term that can reasonably be used to describe 	the functionality of QlikView&#8217;s user interface. In particular, 	QlikView can “associate” over fields that have the same name, in 	that it makes it easy for users to join across them.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">With that out of the way, let&#8217;s turn to some highlights of QlikView&#8217;s underlying technology.</span></p>
<ul>
<li><span style="font-style: normal;"><span style="font-weight: normal;">For the most part, QlikView&#8217;s in-memory 	data structures are quite simple. In particular:</span></span>
<ul>
<li><span style="font-style: normal;"><span style="font-weight: normal;">QlikView 	data is stored in a </span></span><strong><span style="font-style: normal;">straightforward 	tabular format.</span></strong></li>
<li><span style="font-style: normal;">QlikView 	data is compressed via what QlikTech calls a “symbol table,” but 	I generally call </span><span style="font-style: normal;"><strong>“dictionary” </strong></span><span style="font-style: normal;"><span style="font-weight: normal;">or</span></span><span style="font-style: normal;"><strong> “token” compression.</strong></span></li>
<li><span style="font-style: normal;">QlikView 	typically gets at its data via </span><span style="font-style: normal;"><strong>scans.</strong></span><span style="font-style: normal;"> There is very little in the way of precomputed aggregates, indexes, 	and the like. Of course, if the selection happens to be in line with 	the order in which the records are sorted, you can get great 	selectivity in a scan.</span></li>
<li><span style="font-style: normal;">One 	advantage of doing token compression is that all the fields in a 	column wind up being the same length. Thus, QlikView holds its data 	in nice </span><span style="font-style: normal;"><strong>arrays,</strong></span><span style="font-style: normal;"> so the addresses of individual rows can often be easily calculated.</span></li>
</ul>
</li>
<li><span style="font-style: normal;">To 	get its UI flexibil</span><span style="font-style: normal;"><span style="font-weight: normal;">ity, 	QlikView implicitly assumes a </span></span><span style="font-style: normal;"><strong>star/snowflake 	schema.</strong></span><span style="font-style: normal;"> That is, there 	should be no more and no less than </span><span style="font-style: normal;"><strong>one 	possible join path between any pair of tables.</strong></span><span style="font-style: normal;"> In some cases, this means one will want to rename fields as part of 	QlikView load scripts. For example,</span>
<ul>
<li><span style="font-style: normal;">If 	two keys are meant to be joined on, you might want to give them the 	same name.</span></li>
<li><span style="font-style: normal;">If 	two columns have the same name and mean different things (e.g., 	different kinds of dates), you can give them different names.</span></li>
<li><span style="font-style: normal;">You 	can mark which columns you do or don&#8217;t want to have “qualified” 	names – i.e., table-specific modifications that force the names to 	be unique.</span></li>
</ul>
</li>
<li><span style="font-style: normal;">QlikView 	is designed for </span><span style="font-style: normal;"><strong>gigabytes-scale 	databases.</strong></span><span style="font-style: normal;"> (More 	precisely, it&#8217;s constrained by how much RAM you can address in a 	single box, and that&#8217;s how the numbers currently work out.) In 	particular:</span>
<ul>
<li><span style="font-style: normal;">QlikTech 	recommends 2-4 gigabytes of compressed data per core. QlikTech says 	10X is a good rule of thumb for compression, although it sounded 	like that&#8217;s a little (not a lot) on the high side when compared 	simply to raw data.</span></li>
<li><span style="font-style: normal;">QlikTech 	further recommends RAM amounting to another 10% of data size be set 	aside for each concurrent user (e.g., for cache). However, Hakan 	said that&#8217;s really too pessimistic, and in most cases 5% would 	suffice.</span></li>
<li><span style="font-style: normal;">Bottom 	line: QlikView “comfortably” handles databases with </span><span style="font-style: normal;"><strong>10-20 	gigabytes of compressed data,</strong></span><span style="font-style: normal;"> at whatever product of record count and record length you like. 	(E.g., 1 billion relatively narrow records.) That&#8217;s on the order of </span><span style="font-style: normal;"><strong>100 gigabytes of raw 	data.</strong></span></li>
<li><span style="font-style: normal;">Indeed, 	several QlikView customers manage several billion records each.</span></li>
</ul>
</li>
<li><span style="font-style: normal;">The 	main ingredient of the performance secret sauce in QlikView is that </span><span style="font-style: normal;"><strong>selections are compiled 	straight into machine code.</strong></span><span style="font-style: normal;"> (QlikTech gave me the impression that this post is the first time 	that will be publicly revealed.) Notes on that include:</span>
<ul>
<li><span style="font-style: normal;">In 	the old days, QlikTech thought compilation gave a 10X performance 	benefit vs. interpreted code. However, 5X might be a more up-to-date 	figure.</span></li>
<li><span style="font-style: normal;">It&#8217;s 	not just code; part of the compilation is to create temporary lookup 	tables.</span></li>
<li><span style="font-style: normal;">A 	single calculation can use multiple cores. QlikTech thinks it&#8217;s done 	a very solid job of engineering efficient multicore parallelism. </span><span style="font-style: normal;"><em>(Note: So far as I could tell, Hakan was using 	“calculation” to refer both to queries and, well, calculations.)</em></span></li>
<li><span style="font-style: normal;">There&#8217;s 	a good reason QlikView runs only on Intel-compatible processors. A 	port would be painful.</span></li>
</ul>
</li>
<li><span style="font-style: normal;">In 	QlikView&#8217;s world, one set of users accesses one set of applications 	against one database on one machine. However, different subsets (or 	copies of the same subset) of the same underlying database(s) can of 	course be run on different machines.</span></li>
<li><span style="font-style: normal;">Naturally, 	QlikView caches results and tries to re-use them. One smart thing 	about QlikView&#8217;s caching algorithm is that it takes into account the 	cost of generating the calculated results. This has the happy effect 	that large result sets, which are often the ones most likely to be 	useful in a subsequent calculation, are the ones most likely to be 	retained. </span></li>
<li><span style="font-style: normal;">One 	thing I unfortunately forgot to ask about is loading QlikView data 	into memory, something that has at times been <a href="../2007/09/27/a-negative-take-on-qlikview/">problematic</a>.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">One last thing: QlikTech is going public. That means there is a <a href="http://sec.gov/Archives/edgar/data/1305294/000095012310031429/b80142sv1.htm" onclick="javascript:pageTracker._trackPageview('/sec.gov');">QlikTech S-1</a>, from which I learned, among other things, that QlikTech now seems to be called Qlik Technologies. Dave Kellogg offers an outstanding <a href="http://www.kellblog.com/2010/04/12/thoughts-on-the-qlik-technologies-qliktech-ipo/" onclick="javascript:pageTracker._trackPageview('/www.kellblog.com');">overview of the information in QlikTech&#8217;s filing(s)</a>. The points I&#8217;d add to Dave&#8217;s are primarily from the QlikTech balance sheet:</span></p>
<ul>
<li><span style="font-style: normal;">Deferred 	revenue, which Dave calls out as high in absolute terms, is also 	growing faster than revenue (or any major component of revenue).</span></li>
<li><span style="font-style: normal;">Accounts 	receivable are also growing faster than revenue or any major 	component thereof.</span></li>
<li><span style="font-style: normal;">One 	possible explanation is weirdness with international distributors, 	which is at least potentially consistent with what QlikTech says is 	a shift in geographical mix.</span></li>
<li><span style="font-style: normal;">Another 	explanation is increasing deal size/complexity, something that is 	anyway common among enterprise software companies gaining market 	share, and that is also consistent with what QlikTech says is a 	growing fraction of revenue coming from existing customers.</span></li>
</ul>
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/12/the-underlying-technology-of-qlikview/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>Ingres VectorWise technical highlights</title>
		<link>http://www.dbms2.com/2010/06/11/ingres-vectorwise-technical-highlights/</link>
		<comments>http://www.dbms2.com/2010/06/11/ingres-vectorwise-technical-highlights/#comments</comments>
		<pubDate>Fri, 11 Jun 2010 11:28:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Ingres]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[VectorWise]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2261</guid>
		<description><![CDATA[After working through problems w/ travel, cell phones, and so on, Peter Boncz of VectorWise finally caught up with me for a regrettably brief call. Peter gave me the strong impression that what I&#8217;d written in the past about VectorWise had been and remained accurate, so I focused on filling in the gaps. Highlights included:  [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">After working through problems w/ travel, cell phones, and so on, Peter Boncz of VectorWise finally caught up with me for a regrettably brief call. Peter gave me the strong impression that what <a href="http://www.dbms2.com/2009/08/04/vectorwise-ingres-and-monetdb/" >I&#8217;d written in the past about VectorWise</a> had been and remained accurate, so I focused on filling in the gaps. Highlights included:  <span id="more-2261"></span></p>
<ul>
<li>VectorWise is indeed a 	shared-everything analytic DBMS.</li>
<li>The VectorWise front-end is 	Ingres. Ingres VectorWise supports almost all SQL that Ingres does (there 	are a few edge-case exceptions).</li>
<li>Conversely, Ingres VectorWise 	doesn&#8217;t support any SQL Ingres doesn&#8217;t, most notably SQL-99 	Analytics. Naturally, SQL-99 Analytics is a roadmap item for 	Ingres/VectorWise.</li>
<li>Ingres VectorWise 1.0 is pretty 	purely columnar. There&#8217;s a bit of <a href="http://www.dbms2.com/2009/08/04/pax-analytica-row-and-column-stores-begin-to-come-together/" >PAX</a>, but it&#8217;s mainly 	automagic/under the covers. The one user-controlled exception I 	understood was that one can ensure that composite keys are stored 	together.</li>
<li>The main Ingres VectorWise 	performance secret sauce ingredients we touched on were:
<ul>
<li>Vectorization of operations (hence VectorWise&#8217;s name).</li>
<li>Compression that is tuned for 	speed rather than to minimize storage utilization.</li>
</ul>
</li>
<li>We unfortunately didn&#8217;t have time 	to revisit the other big part of the Ingres VectorWise performance 	story, namely clever design for modern microprocessor architectures. 	High-level generalities about that do pervade <a href="http://www.dbms2.com/2010/06/10/vectorwise-press-release/" >the Ingres 	VectorWise press release</a>,<span style="font-style: normal;"> but – 	well, they&#8217;re very high level.</span></li>
<li>Unlike Vertica but like most other 	columnar DBMS vendors, Ingres VectorWise wants you to store your 	data once. You can index-organize the data. You can also organize 	multiple tables in the same order, to make joins among them fast.</li>
<li>Support for actual join indexes is an Ingres VectorWise roadmap item.</li>
<li>As do ever more analytic DBMS, 	Ingres VectorWise has something akin to <a href="http://www.dbms2.com/2006/09/20/netezza-vs-conventional-data-warehousing-rdbms/" >Netezza zone maps</a>.</li>
<li>When I asked 	Peter what had changed most from the initial VectorWise development 	plan, other than the above, he basically said that their performance 	priorities had shifted a bit. Specifically, he said.
<ul>
<li>They had 	originally been “blinded” (his word) by the TPC-H benchmark, but 	figured out that they were overly focused on it. (<a href="http://www.dbms2.com/2009/06/22/the-tpc-h-benchmark-is-a-blight-upon-the-industry/" >Well, duh</a>.)</li>
<li>They learned 	about the importance of other things such as data loading speeds.</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/11/ingres-vectorwise-technical-highlights/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Algebraix</title>
		<link>http://www.dbms2.com/2010/06/05/algebraix/</link>
		<comments>http://www.dbms2.com/2010/06/05/algebraix/#comments</comments>
		<pubDate>Sat, 05 Jun 2010 08:54:59 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Algebraix]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2227</guid>
		<description><![CDATA[I talked Friday with Chris Piedemonte and Gary Sherman, respectively the Cofounder/CTO and Chief Mathematician of Algebraix, who hooked up together for this project back in 2003 or 2004. (Algebraix is the company formerly known as XSPRADA.) Algebraix makes an analytic DBMS, somewhat based on the ideas of extended set theory, that runs on SMP [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I talked Friday with Chris Piedemonte and Gary Sherman, respectively the Cofounder/CTO and Chief Mathematician of Algebraix, who hooked up together for this project back in 2003 or 2004. (Algebraix is the company formerly known as XSPRADA.) Algebraix makes an analytic DBMS, somewhat based on the ideas of <a href="../2010/06/05/extended-set-theory/">extended set theory</a>, that runs on SMP (Symmetric MultiProcessing) boxes. Like all analytic DBMS vendors, Algebraix has on some occasions run some queries orders of magnitude faster than they ran on the systems users were looking to replace.</p>
<p style="margin-bottom: 0in;">Algebraix&#8217;s secret sauce is that the DBMS keeps reorganizing and recopying the data on disk, to optimize performance in response to expected query patterns (automatically inferred from queries it&#8217;s seen so far). This sounds a lot like the Infobright story, with some of the more obvious differences being:  <span id="more-2227"></span></p>
<ul>
<li>Infobright has fixed data 	structures, with what serve for indexes and precomputed aggregates 	added on the fly. Algebraix apparently reorganizes everything, 	including data partitions. (I also presume that Algebraix&#8217;s indexes 	and aggregates look rather different than Infobright&#8217;s data packs.)</li>
<li>Infobright compresses data 	aggressively. Algebraix doesn&#8217;t yet compress.</li>
<li>Infobright&#8217;s product is presumably 	much more mature.</li>
</ul>
<p style="margin-bottom: 0in;">So far as I can tell, that&#8217;s about it. Experience teaches that when a small company claims that some big mathematical breakthrough lets it have huge product superiority in DBMS or analytic tools, the claim doesn&#8217;t pan out. Maybe the company has good technology somewhat inspired by the mathematics, but the more breathless claims are always overwrought. Examples include Infobright (<a href="../2007/10/22/infobright-brighthouse-mysql/">rough set theory</a>), CrossZ/QueryObject (fractals), and KXEN (<a href="http://www.monashreport.com/2006/10/04/kxen-and-verix-try-to-disrupt-the-data-mining-market/" onclick="javascript:pageTracker._trackPageview('/www.monashreport.com');">support vector machines</a>). Algebraix (extended set theory) shows every sign of being another such case. An extended discussion about whether or not join operations are commutative – without benefit of any examples of their supposed non-commutativity – did little to convince me otherwise.</p>
<p style="margin-bottom: 0in;">Finally, as per email from diligent PR guy David King,</p>
<ul>
<li>Algebraix has 22 employees in 	Austin and at its HQ in San Diego.</li>
<li>Algebraix&#8217;s biggest customer is 	the Department of Defense (through BAE Systems).</li>
<li>Algebraix&#8217;s product just became 	commercially available on March 1.</li>
<li>Algebraix has completed one round 	of funding for $11 million.</li>
<li>Algebraix is 100% angel-backed.</li>
<li>Algebraix&#8217;s target market is “any 	mid-large sized organization that needs rapid access to analytics on 	large volumes of data.”</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/05/algebraix/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>More on Sybase IQ, including Version 15.2</title>
		<link>http://www.dbms2.com/2010/05/23/sybase-iq-15/</link>
		<comments>http://www.dbms2.com/2010/05/23/sybase-iq-15/#comments</comments>
		<pubDate>Sun, 23 May 2010 08:34:28 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2186</guid>
		<description><![CDATA[Back in March, Sybase was kind enough to give me permission to post a slide deck about Sybase IQ. Well, I&#8217;m finally getting around to doing so. Highlights include but are not limited to:

Slide 2 has some market success figures and so on. (&#62;3100 copies at &#62;1800 users, &#62;200 sales last year)
Slides 6-11 give more [...]]]></description>
			<content:encoded><![CDATA[<p>Back in March, Sybase was kind enough to give me permission to post <a href="http://www.monash.com/uploads/Sybase-IQ-slides-March-2010.pdf" onclick="javascript:pageTracker._trackPageview('/www.monash.com');">a slide deck about Sybase IQ</a>. Well, I&#8217;m finally getting around to doing so. Highlights include but are not limited to:</p>
<ul>
<li>Slide 2 has some market success figures and so on. (&gt;3100 copies at &gt;1800 users, &gt;200 sales last year)</li>
<li>Slides 6-11 give more detail on Sybase&#8217;s indexing and data access methods than I put into my recent <a href="http://www.dbms2.com/2010/05/17/technical-basics-of-sybase-iq/" >technical basics of Sybase IQ</a> post.</li>
<li>Slide 16 reminds us that in-database data mining is quite competitive with what <a href="http://www.dbms2.com/2010/05/15/further-clarifying-in-database-mpp-sas/" >SAS has actually delivered with its DBMS partners</a>, even if it doesn&#8217;t have the nice architectural approach of <a href="http://www.dbms2.com/2010/02/22/netezza-twinfin/" >Aster or Netezza</a>. (I.e., Sybase IQ&#8217;s more-than-SQL advanced analytics story relies on C++ UDFs  &#8212; User Defined Functions &#8212; running in-process with the DBMS.) In particular, there&#8217;s a data mining/predictive analytics library &#8212; modeling and scoring both &#8212; licensed from a small third party.</li>
<li>A number of the other later slides also have quite a bit of technical crunch. (More on some of those points below too.)</li>
</ul>
<p>Sybase IQ may have a bit of a funky architecture (e.g., no MPP), but the age of the product and the substantial revenue it generates have allowed Sybase to put in a bunch of product features that newer vendors haven&#8217;t gotten around to yet.</p>
<p>More recently, Sybase volunteered permission for me to preannounce <strong>Sybase IQ Version 15.2</strong> by a few days (it&#8217;s scheduled to come out this week). <span id="more-2186"></span>Sybase IQ seems to be focused on large part on the government/intelligent market, with three major features being:</p>
<ul>
<li>A kind of <strong>data federation,</strong> querying external databases, that makes sense mainly in the context of rigorous security rules. (I find that confusing, since Sybase IQ&#8217;s indexes tend to hold all the information in the database, but I didn&#8217;t push the point.)</li>
<li>An upgrade to Sybase IQ&#8217;s built-in <strong>text indexing.</strong> I doubt anybody would confuse this with best-of-breed text search, but evidently that intelligence community is satisfied with less. But even before 15.2, Sybase IQ could do both LIKE and WHERE CONTAINS searching.</li>
<li>Improved LOB (Large OBject) management.</li>
</ul>
<p>One part of my Sybase IQ conversations I haven&#8217;t blogged yet in much details is <strong>scale-out, concurrency, </strong>and<strong> &#8220;multiplexing.&#8221;</strong></p>
<ul>
<li>Sybase feels that Sybase IQ&#8217;s competitive sweet spot, especially in terms of performance, is reached when there are 20 or more concurrent queries.</li>
<li>In general, Sybase asserts that a shared-everything architecture is great for concurrency &#8212; just run different queries on different boxes, all against the same data.</li>
<li>The ability to use a bunch of boxes run Sybase IQ is called &#8220;multiplexing.&#8221;  This is a chargeable option, without which one is limited to a single SMP box.</li>
<li>Just under 20% of the top 250 Sybase IQ customers have multi-node scale-out configuration (vs. single-node SMP scale-up). And around 8% have it overall.</li>
<li>Sybase IQ nodes can be heterogeneous (e.g., in compute power).</li>
<li>Sybase IQ nodes can be dedicated to be read-only, or can be read-write. Indeed, Sybase IQ nodes can change roles dynamically, for example becoming write-only during nightly batch load. (I didn&#8217;t clarify whether all this applies just to nodes-as-boxes, or if some parts apply to specific processors or cores within the same box.)</li>
<li>Sybase noted that data mart outsourcers can offer differentiated SLAs (Service Level Agreements) depending upon which nodes they give which customers access to.</li>
<li>Most Sybase IQ installations start at 8 cores or more. The Sybase IQ Small Business Edition, limited to 4 cores, is not a big seller.</li>
<li>Sybase IQ has a straightforward round-robin load-balancing story via third-party technology.</li>
</ul>
<p>Finally, along the way in the discussions I picked up various tidbits about the Sybase IQ user base. Unfortunately, Sybase is pretty vague in discussing database sizes &#8212; are they user data? Are they compressed? What do the numbers mean? With that huge caveat:</p>
<ul>
<li>By some metric or other, a couple of classified customers are approaching petabyte scale.</li>
<li>The largest commercial Sybase IQ customer &#8212; a credit card company &#8212; has a couple hundred terabytes or so.</li>
<li>The largest financial services Sybase IQ databases are 50-70 terabytes. This sounds low, frankly, so maybe those are compressed figures, with user data being 200+ terabytes. But I&#8217;m just speculating there.</li>
<li>Sybase IQ has a little less than 100 customers in the &#8220;data aggregator&#8221; market, which is a lot like what I call &#8220;data mart outsourcer.&#8221;</li>
<li><a href="http://www.dbms2.com/2009/08/25/sybase-iq-technical-highlights/" >Sybase IQ&#8217;s ILM technology</a> is a chargeable option, with Sybase being &#8220;cautious&#8221; about sales. Compliance is a big market driver for it.</li>
<li>Sybase IQ&#8217;s #1 vertical market is financial services. Other biggies are government, telecom, marketing services, and to some extent retail.</li>
<li>As of February, there were 40-45 production users of Sybase IQ 15.0 and 15.1.</li>
</ul>
<p><!-- 		@page { margin: 0.79in } 		P { margin-bottom: 0.08in } --></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/23/sybase-iq-15/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Technical basics of Sybase IQ</title>
		<link>http://www.dbms2.com/2010/05/17/technical-basics-of-sybase-iq/</link>
		<comments>http://www.dbms2.com/2010/05/17/technical-basics-of-sybase-iq/#comments</comments>
		<pubDate>Mon, 17 May 2010 05:18:44 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2163</guid>
		<description><![CDATA[The Sybase IQ folks had been rather slow about briefing me, at least with respect to crunch. They finally fixed that in February. Since then, I&#8217;ve been slow about posting based on those briefings. But what with Sybase being acquired by SAP, Sybase having an analyst meeting this week, and other reasons – well, this [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">The Sybase IQ folks had been rather slow about briefing me, at least with respect to crunch. They finally fixed that in February. Since then, I&#8217;ve been slow about posting based on those briefings. But what with <a href="http://www.dbms2.com/2010/05/13/sap-sybase-reactions/" >Sybase being acquired by SAP</a>, Sybase having an analyst meeting this week, and other reasons – well, this seems like a good time to post about Sybase IQ. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p style="margin-bottom: 0in;">For starters, Sybase IQ is not just a bitmapped system, but it&#8217;s also not all that closely akin to C-Store or Vertica. In particular,</p>
<ul>
<li>Sybase IQ stores data in <strong>columns</strong> – like, for example, Vertica.</li>
<li>Sybase IQ relies on <strong>indexes</strong> to retrieve data – unlike, for example, Vertica, in which the 	column pretty much is the index.</li>
<li>However, columns themselves can be 	used as indexes in the usual <a href="http://www.dbms2.com/2007/01/22/are-row-oriented-rdbms-obsolete/" >Vertica</a>-like way.</li>
<li>Most of Sybase IQ&#8217;s indexes are 	<strong>bitmaps</strong>, or a lot like bitmaps, ala&#8217; the original IQ product.</li>
<li>Some of Sybase IQ&#8217;s indexes are 	not at all like bitmaps, but more like <strong>B-trees.</strong></li>
<li>In general, 	Sybase recommends that you put multiple indexes on each column 	because &#8212; what the heck – each one of them is pretty small. (In 	particular, the bitmap-like indexes are highly compressible.) 	Together, indexes tend to take up &lt;10% of Sybase IQ storage 	space.</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-2163"></span>Sybase IQ is not immune to <a href="http://www.dbms2.com/2010/02/25/sybase-adaptive-server-enterprise-as/" >Sybase&#8217;s confusing choices in version numbering</a>. Thus:</p>
<ul>
<li>Sybase IQ Version 15.2 will be 	announced and released soon.</li>
<li>Sybase IQ Version 15.1 was a set 	of “binary replacements” rather than an “upgrade release” 	for Sybase IQ Version 15.0.</li>
<li>Sybase IQ Version 15.0 was 	launched in February, 2009 and released for general availability 	some time thereafter.</li>
<li>The prior version of Sybase IQ was 	12.7.</li>
<li>GA isn&#8217;t always GA, and some 	language localizations and so on weren&#8217;t ready for a while. 	Consequently, a lot of Sybase IQ sales continued to be of Version 	12.7 even in the second half of 2009.</li>
</ul>
<p style="margin-bottom: 0in;">Now let&#8217;s get down to some technical particulars.</p>
<p style="margin-bottom: 0in;"><strong>Sybase IQ columns are always stored in RowID order.</strong> However, RowIDs are logical and not physical, and hence take up little disk space. A small amount of per-page metadata lets you find the specific cell you want. (Cells are commonly fixed-width, in which case finding the cell of choice is a simple calculation.) So RowIDs are not much of an I/O overhead issue, although I&#8217;m not sure at what point they get unpacked and start needing to be carried around as the data travels through silicon.</p>
<p style="margin-bottom: 0in;">Sybase IQ has 9 or so kinds of indexes. <strong>The choice of index has a lot to do with cardinality.</strong> In the extreme low-cardinality case, a simple bitmap might do. With intermediate cardinality, you might go to a modified kind of bitmap – e.g., if there there are 2^16 possible values, you can represent a value in 16 bits, and bitmap operations are approximately 16 times as costly as if the number of possible values were only 2^1. For very high cardinality, there&#8217;s a B-tree-like index called “High Group”.</p>
<p style="margin-bottom: 0in;"><em>Note: Surely every Sybase index name, at some time, made sense to at least one engineer.</em></p>
<p style="margin-bottom: 0in;">Sybase IQ&#8217;s <strong>execution engine</strong> does seem to rely quite a bit on bitmaps. E.g., intermediate query results are stored as bitmaps, which helps them play nicely with each other and with many of the indexes. Sybase claims that Sybase IQ&#8217;s bitmap orientation often makes WHERE clauses execute very quickly. Sybase IQ reoptimizes queries after WHERE clauses are evaluated. Complex expressions are, when possible, evaluated once per unique value, not once per row.</p>
<p style="margin-bottom: 0in;">Speaking of unique values – Sybase IQ&#8217;s <strong>compression</strong> story doesn&#8217;t currently match that of some other columnar products, but it seems to stack up pretty well against row-based systems. In particular:</p>
<ul>
<li>Sybase says IQ compression is most 	commonly 50-70%.</li>
<li>Sybase further says that, in most 	cases, compression falls into the range 40-85%.</li>
<li>Page-level LZ compression is 	decompressed upon read (duh).</li>
<li>Dictionary/token compression may 	be decompressed later. For example, GROUP BYs are commonly done on 	tokens, and JOINs sometimes are.</li>
</ul>
<p style="margin-bottom: 0in;">Sybase IQ boasts <strong>pipelining,</strong> in two senses. First, IQ tries to read pages for multiple queries at the same time. Second, Sybase IQ tries to <strong>prefetch</strong> pages into cache before they&#8217;re needed. Sybase points out that these prefetched pages have the WHERE clauses already executed, and that no extra baggage is being dragged into cache that doesn&#8217;t need to be there.</p>
<p style="margin-bottom: 0in;">Highlights of Sybase IQ&#8217;s update and load story include:</p>
<ul>
<li>Sybase IQ is optimized for large 	bulk loads. No surprise there.</li>
<li>Sybase IQ has several options for 	microbatching and/or trickle feeds.
<ul>
<li>The coolest is <a href="http://www.dbms2.com/2010/02/05/sybase-aleri-rap/" >Sybase RAP</a>.</li>
<li>More generally, microbatching is 	based on Change Data Capture. Sybase has various ETL/replication 	technologies, creating a confusing array of options in that regard.</li>
<li>Sybase says that one customer is 	microbatching 1000s of rows with 1 minute latency.</li>
</ul>
</li>
<li>There&#8217;s something about 	snapshotting and hence loads not interfering with queries. I&#8217;m not 	clear on the details.</li>
<li>Assuming you have enough 	parallelism, you can dedicate some nodes to queries while others are 	dedicated to load. (Recall that Sybase IQ is shared-disk.)</li>
</ul>
<p style="margin-bottom: 0in;">I&#8217;ve lost track a little bit as to which “advanced analytics” functionality is in Sybase IQ 15.1, which will be in 15.2, and what&#8217;s a future beyond that, which is a great excuse for me to leave it out of what has already become a rather long post. But anyhow, except perhaps for the future stuff and/or some time series functionality, none of it seems terribly advanced. Sybase IQ does have two stored procedure languages, namely the ones for Sybase ASE (T-SQL) and for Sybase Anywhere or Adaptive Server Anywhere or whatever it&#8217;s called this week (Watcom SQL, which Sybase asserts is similar to the ANSI SQL stored procedure language).</p>
<p style="margin-bottom: 0in;">Similarly, I&#8217;ll leave a lot of other stuff out as well, and for now stop here.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>I haven&#8217;t repeated every detail 	here from my <a href="../2009/08/25/sybase-iq-technical-highlights/">August, 	2009 technical post about Sybase IQ</a></li>
<li>And here&#8217;s <a href="http://www.dbms2.com/2010/05/23/sybase-iq-15/" >more about Sybase IQ</a>, including some Sybase IQ 15.2 features, some market penetration info, and a slide deck</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/17/technical-basics-of-sybase-iq/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Thoughts on IBM&#8217;s anti-Oracle announcements</title>
		<link>http://www.dbms2.com/2010/04/07/ibm-anti-oracle-announcements/</link>
		<comments>http://www.dbms2.com/2010/04/07/ibm-anti-oracle-announcements/#comments</comments>
		<pubDate>Wed, 07 Apr 2010 15:28:15 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Solid-state memory]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1854</guid>
		<description><![CDATA[IBM is putting out a couple of press releases today that are obviously directed competitively at Oracle/Sun, and more specifically at Oracle&#8217;s Exadata-centric strategy. I haven&#8217;t been briefed, so I just have those to go on.
On the whole, the releases look pretty lame. Highlights seem to include:

Maybe a claim of enhanced data compression.
Otherwise, no obvious [...]]]></description>
			<content:encoded><![CDATA[<p>IBM is putting out a couple of press releases today that are obviously directed competitively at Oracle/Sun, and more specifically at Oracle&#8217;s <a href="http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/" >Exadata-centric strateg</a>y. I haven&#8217;t been briefed, so I just have those to go on.</p>
<p>On the whole, the releases look pretty lame. Highlights seem to include:</p>
<ul>
<li>Maybe a claim of enhanced data compression.</li>
<li>Otherwise, no obvious new technology except product packaging and bundling.</li>
<li>Aggressive plans to throw capital at the Sun channel to convert it to selling IBM gear. (A figure of $1/2 billion is mentioned, for financing.</li>
</ul>
<p>Disappointingly, IBM shows a lot of confusion between:</p>
<ul>
<li>Text data</li>
<li>Machine-generated data such as that from sensors</li>
</ul>
<p>While both highly important, those are <a href="http://www.dbms2.com/2010/01/17/three-broad-categories-of-data/" >very different things</a>. IBM has not in the past shown much impressive technology in either of those two areas, and based on these releases, I presume that trend is continuing.</p>
<p><em>Edits: </em></p>
<p><em>I see from press coverage that at least one new IBM model has some Fusion I/O solid-state memory boards in it. Makes sense.</em></p>
<p><em>A Twitter hashtag has a number of observations from the event. Not much substance I could detect except various kind of <a href="http://twitter.com/#search?q=%23ibmsmartsys" onclick="javascript:pageTracker._trackPageview('/twitter.com');">Oracle bashing</a>.<br />
</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/07/ibm-anti-oracle-announcements/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
	</channel>
</rss>
