<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: One vendor&#8217;s trash is another&#8217;s treasure</title>
	<atom:link href="http://www.dbms2.com/2009/02/02/one-vendors-trash-is-anothers-treasure/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com/2009/02/02/one-vendors-trash-is-anothers-treasure/</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 16:57:09 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
	<item>
		<title>By: Kevin Closson</title>
		<link>http://www.dbms2.com/2009/02/02/one-vendors-trash-is-anothers-treasure/#comment-109122</link>
		<dc:creator>Kevin Closson</dc:creator>
		<pubDate>Tue, 03 Feb 2009 21:47:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=676#comment-109122</guid>
		<description>Curt,

  So now that we&#039;ve finished trudging through the terminology, it should make a lot more sense now why I said that the effectiveness of Oracle hash partitioning to handle data skew is between 0% and 100%. 

  The moral of the story is hash partitioning is only effective at handling skew based on the data being loaded. For example, it isn&#039;t that good at evenly loading, say, 42 partitions if the partition key is something like a gender column. I know you are perfectly aware of how a hash function works, but I wanted to put that example out for the casual reader. I hope you&#039;ll indulge me on that...


P.S., On a tangent regarding terminology... I recall in Informix DSA version 6 table partitions where called fragments. The default placement was round-robin. I found it strange then as I do now that it was considered a good thing to have a database residing in &quot;round-robin fragmented storage&quot; considering the generally negative connotation of the word fragment in database-land.</description>
		<content:encoded><![CDATA[<p>Curt,</p>
<p>  So now that we&#8217;ve finished trudging through the terminology, it should make a lot more sense now why I said that the effectiveness of Oracle hash partitioning to handle data skew is between 0% and 100%. </p>
<p>  The moral of the story is hash partitioning is only effective at handling skew based on the data being loaded. For example, it isn&#8217;t that good at evenly loading, say, 42 partitions if the partition key is something like a gender column. I know you are perfectly aware of how a hash function works, but I wanted to put that example out for the casual reader. I hope you&#8217;ll indulge me on that&#8230;</p>
<p>P.S., On a tangent regarding terminology&#8230; I recall in Informix DSA version 6 table partitions where called fragments. The default placement was round-robin. I found it strange then as I do now that it was considered a good thing to have a database residing in &#8220;round-robin fragmented storage&#8221; considering the generally negative connotation of the word fragment in database-land.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Curt Monash</title>
		<link>http://www.dbms2.com/2009/02/02/one-vendors-trash-is-anothers-treasure/#comment-109120</link>
		<dc:creator>Curt Monash</dc:creator>
		<pubDate>Tue, 03 Feb 2009 20:39:39 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=676#comment-109120</guid>
		<description>Ghassan,

Actually, what would help me avert errors would be to post in less haste.  As to WHY I rushed this post and several others up between the time I submitted the first draft of the article and the time it was finally posted ... well, in the interest of peace and comity, I won&#039;t spell that out.  </description>
		<content:encoded><![CDATA[<p>Ghassan,</p>
<p>Actually, what would help me avert errors would be to post in less haste.  As to WHY I rushed this post and several others up between the time I submitted the first draft of the article and the time it was finally posted &#8230; well, in the interest of peace and comity, I won&#8217;t spell that out.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Curt Monash</title>
		<link>http://www.dbms2.com/2009/02/02/one-vendors-trash-is-anothers-treasure/#comment-109119</link>
		<dc:creator>Curt Monash</dc:creator>
		<pubDate>Tue, 03 Feb 2009 20:35:34 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=676#comment-109119</guid>
		<description>Greg,

I think you nailed it.  Hash distrbution is exactly what I meant.  Silly me!

CAM</description>
		<content:encoded><![CDATA[<p>Greg,</p>
<p>I think you nailed it.  Hash distrbution is exactly what I meant.  Silly me!</p>
<p>CAM</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ghassan salem</title>
		<link>http://www.dbms2.com/2009/02/02/one-vendors-trash-is-anothers-treasure/#comment-109117</link>
		<dc:creator>ghassan salem</dc:creator>
		<pubDate>Tue, 03 Feb 2009 20:29:04 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=676#comment-109117</guid>
		<description>Curt,
Oracle has something called partition-wise join, that works when you join 2 equi-hash-partitioned tables (i.e. same key partitioning, as well as same number of partitions) on the partitioning key. And in a parallel query, this is done in parallel. So, you can get the benefits of hash-join as you might get in a shared-nothing system.
Also, bear in mind that in Oracle, you can range partition a table on some column(s), and hash-partition it on anotherr column. So you get the benefits of range partitioning (e.g. ILM, fast purge of old data, partition pruning, ...) as well as partition-wise joins.

Have a look at the doc, it will make your posts when writing about Oracle less error-prone.

rgds</description>
		<content:encoded><![CDATA[<p>Curt,<br />
Oracle has something called partition-wise join, that works when you join 2 equi-hash-partitioned tables (i.e. same key partitioning, as well as same number of partitions) on the partitioning key. And in a parallel query, this is done in parallel. So, you can get the benefits of hash-join as you might get in a shared-nothing system.<br />
Also, bear in mind that in Oracle, you can range partition a table on some column(s), and hash-partition it on anotherr column. So you get the benefits of range partitioning (e.g. ILM, fast purge of old data, partition pruning, &#8230;) as well as partition-wise joins.</p>
<p>Have a look at the doc, it will make your posts when writing about Oracle less error-prone.</p>
<p>rgds</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Greg Rahn</title>
		<link>http://www.dbms2.com/2009/02/02/one-vendors-trash-is-anothers-treasure/#comment-109115</link>
		<dc:creator>Greg Rahn</dc:creator>
		<pubDate>Tue, 03 Feb 2009 20:15:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=676#comment-109115</guid>
		<description>Just an observation from the sideline...I think the confusion stems from your use of the phrase &lt;strong&gt;hash partitioning&lt;/strong&gt;.  I think you should have used the phrase &lt;strong&gt;hash &lt;em&gt;distribution&lt;/em&gt;&lt;/strong&gt; as you are discussing the physical locality of data vs. the logical grouping of data.

For example, in Netezza you distribute data to the SPUs by using the DDL phrase &lt;strong&gt;distribute on random&lt;/strong&gt; or you can use a phrase of &lt;strong&gt;distribute on [hash] (column(s))&lt;/strong&gt;.  Likewise DB2 has DDL clauses to do both data distribution (&lt;strong&gt;DISTRIBUTE BY&lt;/strong&gt;) and logical grouping (&lt;strong&gt;PARTITION BY &amp; ORGANIZE BY&lt;/strong&gt;).</description>
		<content:encoded><![CDATA[<p>Just an observation from the sideline&#8230;I think the confusion stems from your use of the phrase <strong>hash partitioning</strong>.  I think you should have used the phrase <strong>hash <em>distribution</em></strong> as you are discussing the physical locality of data vs. the logical grouping of data.</p>
<p>For example, in Netezza you distribute data to the SPUs by using the DDL phrase <strong>distribute on random</strong> or you can use a phrase of <strong>distribute on [hash] (column(s))</strong>.  Likewise DB2 has DDL clauses to do both data distribution (<strong>DISTRIBUTE BY</strong>) and logical grouping (<strong>PARTITION BY &amp; ORGANIZE BY</strong>).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Aldridge</title>
		<link>http://www.dbms2.com/2009/02/02/one-vendors-trash-is-anothers-treasure/#comment-109100</link>
		<dc:creator>David Aldridge</dc:creator>
		<pubDate>Tue, 03 Feb 2009 14:57:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=676#comment-109100</guid>
		<description>Curt,

Yes, I think that understanding the role of ASM is critical in some ways to understanding the reason why we use hash partitioning in Oracle, or rather it helps to explain what we do not use it for. Because, as you say, ASM spreads the data over all the available devices, hash partitioning is not used to associate data with particular storage devices in the way that it is for other platforms.

Rather it is a logical method of subdividing the data to allow more efficient processing. As I mentioned above it is used to reduce intra-slave messaging on hash joins, but Daniel&#039;s comment on indexing reminds me that it also enables parallel index range scans. To take a simple example, one might partition sales transactions according to the day of the sale, and then hash each day of data into 64 subpartitions. A local index on &quot;transaction dollar amount&quot; could then be scanned in parallel to isolate all sales with a transaction dollar amount in a particular range of values, for example &quot;more than $1,000&quot;, _if_ the optimizer estimated that to be more efficient than scanning table partition of the entire day of sales.</description>
		<content:encoded><![CDATA[<p>Curt,</p>
<p>Yes, I think that understanding the role of ASM is critical in some ways to understanding the reason why we use hash partitioning in Oracle, or rather it helps to explain what we do not use it for. Because, as you say, ASM spreads the data over all the available devices, hash partitioning is not used to associate data with particular storage devices in the way that it is for other platforms.</p>
<p>Rather it is a logical method of subdividing the data to allow more efficient processing. As I mentioned above it is used to reduce intra-slave messaging on hash joins, but Daniel&#8217;s comment on indexing reminds me that it also enables parallel index range scans. To take a simple example, one might partition sales transactions according to the day of the sale, and then hash each day of data into 64 subpartitions. A local index on &#8220;transaction dollar amount&#8221; could then be scanned in parallel to isolate all sales with a transaction dollar amount in a particular range of values, for example &#8220;more than $1,000&#8243;, _if_ the optimizer estimated that to be more efficient than scanning table partition of the entire day of sales.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Curt Monash</title>
		<link>http://www.dbms2.com/2009/02/02/one-vendors-trash-is-anothers-treasure/#comment-109081</link>
		<dc:creator>Curt Monash</dc:creator>
		<pubDate>Tue, 03 Feb 2009 09:33:52 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=676#comment-109081</guid>
		<description>Daniel,

Oracle said that even hash partitions are striped across disks, courtesy of ASM. That was in my notes right next to the observation that hash partitions don&#039;t have their usual somewhat-pseudo-random distribution benefit in the case of Oracle, because the data&#039;s already pseudo-randomly distributed via other mechanisms.  (Those are, of course, the notes I should have checked before incorrectly saying Oracle doesn&#039;t do hash partitioning in Exadata at all.)

CAM</description>
		<content:encoded><![CDATA[<p>Daniel,</p>
<p>Oracle said that even hash partitions are striped across disks, courtesy of ASM. That was in my notes right next to the observation that hash partitions don&#8217;t have their usual somewhat-pseudo-random distribution benefit in the case of Oracle, because the data&#8217;s already pseudo-randomly distributed via other mechanisms.  (Those are, of course, the notes I should have checked before incorrectly saying Oracle doesn&#8217;t do hash partitioning in Exadata at all.)</p>
<p>CAM</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Abadi</title>
		<link>http://www.dbms2.com/2009/02/02/one-vendors-trash-is-anothers-treasure/#comment-109038</link>
		<dc:creator>Daniel Abadi</dc:creator>
		<pubDate>Tue, 03 Feb 2009 00:02:25 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=676#comment-109038</guid>
		<description>One thing to be aware of is that &quot;partition pruning&quot; sometimes means &quot;no parallelism&quot;.

SELECT sum(sales)
FROM table
WHERE store_id = 5

If you hash on store_id, only one partition is involved in answering the query, which means the query runs at the speed of the one disk which contains that partition. If you use round robin partitioning and have an index on store_id on each partition, the query runs at the speed of all the disks reading (in parallel) the (much smaller) number of &#039;store-id = 5&#039; tuples from the index.

Of course, if you don&#039;t have an index on store-id on each node, you&#039;d probably prefer to just get one disk involved in the query, even if all disks can be scanned in parallel.</description>
		<content:encoded><![CDATA[<p>One thing to be aware of is that &#8220;partition pruning&#8221; sometimes means &#8220;no parallelism&#8221;.</p>
<p>SELECT sum(sales)<br />
FROM table<br />
WHERE store_id = 5</p>
<p>If you hash on store_id, only one partition is involved in answering the query, which means the query runs at the speed of the one disk which contains that partition. If you use round robin partitioning and have an index on store_id on each partition, the query runs at the speed of all the disks reading (in parallel) the (much smaller) number of &#8216;store-id = 5&#8242; tuples from the index.</p>
<p>Of course, if you don&#8217;t have an index on store-id on each node, you&#8217;d probably prefer to just get one disk involved in the query, even if all disks can be scanned in parallel.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Curt Monash</title>
		<link>http://www.dbms2.com/2009/02/02/one-vendors-trash-is-anothers-treasure/#comment-109008</link>
		<dc:creator>Curt Monash</dc:creator>
		<pubDate>Mon, 02 Feb 2009 20:03:30 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=676#comment-109008</guid>
		<description>As for the other -- yeah, I was using the term &quot;hash partitioning&quot; too narrowly. Once again, I&#039;m sorry for the excitement.</description>
		<content:encoded><![CDATA[<p>As for the other &#8212; yeah, I was using the term &#8220;hash partitioning&#8221; too narrowly. Once again, I&#8217;m sorry for the excitement.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Curt Monash</title>
		<link>http://www.dbms2.com/2009/02/02/one-vendors-trash-is-anothers-treasure/#comment-109007</link>
		<dc:creator>Curt Monash</dc:creator>
		<pubDate>Mon, 02 Feb 2009 19:56:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.dbms2.com/?p=676#comment-109007</guid>
		<description>Between 0% and 100%.  Wow. Way to not go out on a limb there, Kevin.</description>
		<content:encoded><![CDATA[<p>Between 0% and 100%.  Wow. Way to not go out on a limb there, Kevin.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

