<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS2 -- DataBase Management System Services &#187; Data integration and middleware</title>
	<atom:link href="http://www.dbms2.com/category/data-integration-middleware/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Sat, 13 Mar 2010 22:47:06 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Two cornerstones of Oracle’s database hardware strategy</title>
		<link>http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/</link>
		<comments>http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/#comments</comments>
		<pubDate>Fri, 22 Jan 2010 08:59:23 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cache]]></category>
		<category><![CDATA[DBMS product categories]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1429</guid>
		<description><![CDATA[After several months of careful optimization, Oracle managed to pick the most inconvenient* day possible for me to get an Exadata update from Juan Loaiza. But the call itself was long and fascinating, with the two main takeaways being:

Oracle      thinks flash memory is the most important hardware technology of the [...]]]></description>
			<content:encoded><![CDATA[<p>After several months of careful optimization, Oracle managed to pick the most inconvenient* day possible for me to get an Exadata update from Juan Loaiza. But the call itself was long and fascinating, with the two main takeaways being:</p>
<ul>
<li>Oracle      thinks <strong>flash memory is the most important hardware technology of the      decade,</strong> one that could lead to Oracle being “bumped off” if they don’t      get it right.</li>
<li>Juan      believes <strong>the “bulk” of Oracle’s business will move over to Exadata-like      technology over the next 5-10 years. </strong>Numbers-wise, this seems to be based more      on Exadata being a platform for consolidating an enterprise’s many Oracle databases than it is on Exadata running a few Especially Big Honking Database      management tasks.</li>
</ul>
<p>And by the way, Oracle doesn’t make its storage-tier software available to run on anything than Oracle-designed boxes.  At the moment, that means Exadata Versions 1 and 2. Since Exadata is by far Oracle’s best DBMS offering (at least in theory), that means <strong>Oracle’s best database offering only runs on specific Oracle-sold hardware platforms.<span id="more-1429"></span></strong> <em></em></p>
<p><em>*E.g., I was sitting upstairs in my parents’ apartment in </em><em>Columbus</em><em>, </em><em>OH</em><em> having the call while their doctor, who I’ve never met, was visiting downstairs. He offered to make a special trip back Saturday afternoon because he missed me Wednesday, but he’s notorious for not coming when he says he will.</em> <em>Update: He didn&#8217;t come Saturday. On Saturday he said he&#8217;d come Sunday. He didn&#8217;t do that either. </em></p>
<p>Other high- and lowlights of our conversation included:</p>
<ul>
<li>Flash      is the main new hardware element in Exadata Version 2. Otherwise, Exadata      2 is just an annual refresh of Exadata Version 1 to include updated      components (Nehalem chips, bigger disk drives, etc.)</li>
<li>Juan      thinks it’s suboptimal to use flash memory through the bottleneck of disk      controllers, favoring PCIe cards instead. (I emphatically agree.)</li>
<li>Juan      resolutely ducked questions about <a href="../../../../../2009/09/25/the-hunt-for-oracle-exadata-production-references/">actual      Exadata production deployment</a>. Literally the only fact he shared in      that regard is that there are at least 2 Exadata production systems      running that each have 2 or more racks cabled together.</li>
<li>Juan      stressed that Exadata runs apps written over Oracle DBMS unchanged.</li>
<li>When      making mixed-workload claims for Exadata 2, Juan stressed consolidation of      multiple databases, some OLTP and some analytic. He didn’t really argue      with my skepticism about <a href="../../../../../2009/09/29/integration-oltp-data-warehousing-exadata-2/">integrating      OLTP and analytics in the same database</a>, with one exception:</li>
<li>Juan      pointed out that in major OLTP apps such as ERP systems, there often is      actually more processing going on in reporting and other batch stuff than      there is in true OLTP.</li>
<li>Exadata      2’s flash memory is designed as a disk cache, smarter than LRU (Least      Recently Used). The two examples Juan gave of “smarter than LRU” are that      backups and table scans don’t flush the cache.</li>
<li>I      forget whether this is new in Exadata 2 (I think it is), but anyhow –      Exadata has a “Storage Index” that’s a lot like a <a href="../../../../../2006/09/20/netezza-vs-conventional-data-warehousing-rdbms/">Netezza      zone map</a>. I.e., for each megabyte or so of data it stores the min and      max value of every column; if a query predicate rules out those ranges,      that megabyte is never retrieved.</li>
<li>Oracle      has long offered what sounds like flexible workload management capability,      and this has now been extended to specifically include I/O resources on      the storage tier.</li>
<li>This      isn’t Exadata-specific, but Oracle has built a file system on top of its      DBMS, optimized for speed, which helps with, e.g., ELT      (Extract/Load/Transform). Evidently, it’s not at all the same thing as      Mark Benioff’s 1990s Microsoft-annoying IFS (Internet File System)      project, which seems to have morphed into a content management SDK.</li>
</ul>
<p>Highlights specifically in the area of parallelization included:</p>
<ul>
<li>Juan      stressed that all databases consolidated onto an Exadata machine      are/should be striped across all storage units.</li>
<li>On the      other hand, Juan said that different databases should be confined to      specific cores or CPUs on the database tier.</li>
<li>But on      the third hand, Juan also stressed – in what could be called a “private      cloud” pitch – that there’s great elasticity as to which databases are      matched to which server CPUs.</li>
<li>Contrary      to what <a href="../../../../../2008/09/28/exadata-oracle-database-machine-parallelization/">I      thought he and/or his colleagues told me a year ago</a>, Juan said RAC      (Real Application Clusters) is a big part of Oracle’s data warehouse      processing.</li>
<li>However,      Juan says that what I regard(ed) as a major objection to Oracle’s      database-tier parallelization &#8212; the need to manually specify “degrees of      parallelism” &#8212; has now been obviated by automation. Juan thinks that few      data warehouse DBAs will now need to manually tune parallelism, with minor      exceptions. One exception he cites is that if a nightly report really is      non-urgent, it can just be forced to run on a single core with no chance      to grab more resources. (However, Juan thinks manual tuning of parallelism      will continue to play a greater role in OLTP.)</li>
</ul>
<p>OK. That’s all I can get done tonight (see above re: inconvenience of timing). Follow-on subjects I’d like to and indeed plan to post about include:</p>
<ul>
<li>What      Juan said about hybrid columnar compression</li>
<li>Oracle’s      delightfully non-confidential slide deck, and a few comments about same</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Webinar on MapReduce for complex analytics (Thursday, December 3, 10 am and 2 pm Eastern)</title>
		<link>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/</link>
		<comments>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/#comments</comments>
		<pubDate>Wed, 02 Dec 2009 20:57:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1267</guid>
		<description><![CDATA[The second in my two-webinar series for Aster Data will occur tomorrow, twice (both live), at 10 am and 2 pm Eastern time. The other presenters will be Jonathan Goldman, who was Principal Scientist at LinkedIn but now has joined Aster himself, and Steve Wooledge of Aster (playing host). Key links are:

Registration for tomorrow&#8217;s webinars
Replay [...]]]></description>
			<content:encoded><![CDATA[<p>The second in my two-webinar series for Aster Data will occur tomorrow, twice (both live), at 10 am and 2 pm Eastern time. The other presenters will be Jonathan Goldman, who was Principal Scientist at LinkedIn but now has joined Aster himself, and Steve Wooledge of Aster (playing host). Key links are:</p>
<ul>
<li>Registration for <a href="http://www.asterdata.com/wc_091203_masteringmapreduce/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');">tomorrow&#8217;s webinars</a></li>
<li>Replay of the <a href="http://www.asterdata.com/masteringmapreduce2/" onclick="javascript:pageTracker._trackPageview('/www.asterdata.com');"> first webinar</a></li>
<li>My slides from the <a href="http://www.dbms2.com/2009/10/15/mapreduce-webinar-slides/" >first webinar</a></li>
</ul>
<p>The main subjects of the webinar will be:</p>
<ul>
<li>Some review of material from the first webinar (all three presenters)</li>
<li>Discussion of how MapReduce can help with three kinds of analytics:
<ul>
<li>Pattern matching (Jonathan will give detail)</li>
<li>Number-crunching (I&#8217;ll cover that, and it will be short)</li>
<li>Graph analytics (I haven&#8217;t written the slides yet, but my starting point will be some of the <a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/" >relationship analytics</a> ideas we discussed in August)</li>
</ul>
</li>
</ul>
<p>Arguably, aspects of data transformation fit into each of those three categories, which may help explain why data transformation has been so prominent among the early applications of MapReduce.</p>
<p>As you can see from Aster&#8217;s title for the webinar (which they picked while I was on vacation), at least their portion will be focused on customer analytics, e.g. web analytics.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/12/02/mapreduce-for-complex-analytics-webina/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Aster Data 4.0 and the evolution of &#8220;advanced analytic(s) servers&#8221;</title>
		<link>http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/</link>
		<comments>http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/#comments</comments>
		<pubDate>Sat, 31 Oct 2009 01:56:55 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1198</guid>
		<description><![CDATA[Since Linda and I are leaving on vacation in a few hours, Aster Data graciously gave me permission to morph its “12:01 am Monday, November 2” embargo into “late Friday night.”
Aster Data is officially announcing the 4.0 release of nCluster. There are two big pieces to this announcement:

Aster is 	offering a slick vision for integrating [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;"><em>Since Linda and I are leaving on vacation in a few hours, Aster Data graciously gave me permission to morph its “12:01 am Monday, November 2” embargo into “late Friday night.”</em></p>
<p style="margin-bottom: 0in; font-style: normal;">Aster Data is officially announcing the 4.0 release of nCluster. There are two big pieces to this announcement:</p>
<ul>
<li>Aster is 	offering a slick vision for integrating big-database management and 	general analytic processing on the same MPP cluster, under the 	not-so-slick name “Data-Application Server.”</li>
<li>Aster is also 	offering a sophisticated vision for workload management.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">In addition, Aster has matured nCluster in various ways, for example cleaning up a performance problem with single-row updates.</p>
<p style="margin-bottom: 0in; font-style: normal;">Highlights of the Aster “Data-Application Server” story include:<span id="more-1198"></span></p>
<ul>
<li>At its core, 	the Aster “Data-Application Server” is the Aster nCluster MPP 	analytic DBMS, enhanced with basic application server functionality 	(I didn&#8217;t ask for details of that part), running on the same 	nCluster worker nodes that answer SQL queries.</li>
<li>Thus, Aster is 	eliminating a lot of the data movement that plagues three-tier 	architectures and other less-integrated approaches.</li>
<li>The Aster 	“Data-Application Server” further offers integrated workload 	management for applications and queries; more on that below.</li>
<li>The Aster 	“Data-Application Server” requires applications to be 	parallelized and invoked via Aster&#8217;s <a href="../2009/10/15/mapreduce-webinar-slides/">SQL/MapReduce.</a></li>
<li>As befits a 	MapReduce-based system, the Aster “Data-Application Server” lets 	you write your applications in lots of different languages (the 	usual suspects, and it also does .NET).</li>
<li>The Aster 	“Data-Application Server” runs applications in their own process 	spaces, protecting the DBMS server from crashes and other damaging 	behavior.</li>
<li>The Aster 	“Data-Application Server” allows applications to manage memory 	themselves, persistently, and not just via relational constructs. 	Thus, if you want your application to maintain a graph, mini rules 	engine, and/or finite state machine, you can, without doing SQL 	contortions.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">In a compelling proof point for the Aster Data-Application Server&#8217;s slickness, Aster has leapfrogged Teradata and Netezza in the extent to which SAS functionality is integrated into Aster&#8217;s DBMS. (Aster and SAS both say that you can do full SAS modeling in parallel on Aster, but even so I wouldn&#8217;t be surprised to discover there were some parts of SAS&#8217; system that turned out to be exceptions.) Of course, Aster is hardly the only analytic DBMS vendor to have the idea of explicitly enhancing general analytic processing; that&#8217;s why we see lots of MapReduce announcements, and it&#8217;s also why Teradata enhanced its UDFs (User-Defined Functions) to have some kind of persistent memory.* But I don&#8217;t know of anybody else whose approach is quite so elegant and general at this time.</p>
<p style="margin-bottom: 0in;"><em>*Unfortunately, I don&#8217;t yet know much about Teradata&#8217;s UDF enhancements. I neglected to drill down on Global Persistent Memory when it was mentioned a couple of times at Teradata Partners last week, and Teradata was unable to accommodate my request this week for a rapid follow-up briefing on the subject.</em></p>
<p style="margin-bottom: 0in; font-style: normal;">Aster&#8217;s approach to workload management is similarly stylish. The idea is:</p>
<ul>
<li>Lots of 	variables are available to be taken into account (e.g., user role, 	expected query duration, actual duration of a running query, etc.)</li>
<li>SQL statements 	can be written against any of these variables.</li>
<li>The SQL 	statements serve as rules to set query/task priorities.</li>
<li>There seem to 	be a few different ways to measure priority, including explicit 	allocation of CPU or I/O resources, as well as more conventional 	“This group of queries gets higher priority than that one” 	kinds of metrics.</li>
<li>The whole 	thing provides integrated workload management for queries, 	applications, load jobs, data redistribution, and so on.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">Right now the interface is – well, you&#8217;re manipulating a SQL table. A more conventional workload management GUI is slated for the second quarter of 2010.</p>
<p style="margin-bottom: 0in; font-style: normal;">Discussing subjects such as mirroring and ILM (Information Lifecycle Management) with Aster can be tricky, as Aster uses the word “partition” in confusing ways. Anyhow, Aster has a few different levels of compression, and the ability to apply different levels of compression to different partitions, to change compression levels via ALTER TABLE, and to alter (presumably increase) compression on the fly when doing online backup. Aster is also part of a growing trend to eschew RAID, instead doing mirroring in its own software.  (Other examples of this strategy would be <span><a href="http://www.dbms2.com/2009/10/06/oracle-and-vertica-on-compression-and-other-physical-data-layout-features/" >Vertica</a>, <a href="http://www.dbms2.com/2008/09/28/oracle-database-machine-performance-and-compression/" >Oracle Exadata/ASM</a>, and <a href="http://www.dbms2.com/2009/10/25/teradata-hardware-strategy-and-tactics/" >Teradata Fallback</a>.) </span><span>Prior to nCluster 4.0, this caused a problem, in that the block sizes for mirroring were so large as to create a lag in transactional updating. But Aster says this problem is now solved, and indeed claims that nCluster 4.0 is superior to most rivals in transactional efficiency.</span></p>
<p style="margin-bottom: 0in;">And finally, while I was talking w/ Aster Data anyway, I checked up on cloud and MapReduce customer penetration. The answers were:</p>
<ul>
<li>Aster has two serious production 	cloud users, both of which have been disclosed for a while, namely:
<ul>
<li>ShareThis, which runs Aster 		nCluster on Amazon EC2</li>
<li>Didit, which runs Aster nCluster 		on AppNexus</li>
</ul>
</li>
<li>Outside of those two, Aster sees 	some cloud use for test, development, prototyping, etc.</li>
<li>Every single Aster customer uses 	<a href="../2009/10/15/mapreduce-webinar-slides/">SQL/MapReduce</a> &#8212; i.e., they invoke MapReduce via Aster nCluster SQL queries.</li>
<li>Some of those customers use MapReduce for ETL, some use it 	for actual analytics.</li>
</ul>
<p style="margin-bottom: 0in; font-style: normal;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/30/aster-data-application-server-ncluster/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Teradata&#8217;s nebulous cloud strategy</title>
		<link>http://www.dbms2.com/2009/10/27/teradatas-nebulous-cloud-strategy/</link>
		<comments>http://www.dbms2.com/2009/10/27/teradatas-nebulous-cloud-strategy/#comments</comments>
		<pubDate>Tue, 27 Oct 2009 19:41:47 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1180</guid>
		<description><![CDATA[As the pun goes, Teradata&#8217;s cloud strategy is – well, it&#8217;s somewhat nebulous. More precisely, for the foreseeable future, Teradata&#8217;s cloud strategy is a collection of rather disjointed parts, including:

What Teradata calls the Teradata 	 Agile Analytics Cloud, which is a combination of previously 	existing technology plus one new portlet called the Teradata 	Elastic Mart(s) [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">As the pun goes, Teradata&#8217;s cloud strategy is – well, it&#8217;s somewhat nebulous. More precisely, for the foreseeable future, Teradata&#8217;s cloud strategy is a collection of rather disjointed parts, including:</p>
<ul>
<li>What Teradata calls the <em>Teradata 	 Agile Analytics Cloud, </em>which is a combination of previously 	existing technology plus one new portlet called the <em>Teradata 	Elastic Mart(s) Builder.</em> (Teradata&#8217;s <em>Elastic Mart(s) Builder 	Viewpoint</em><span style="font-style: normal;"> portlet is avail</span>able 	for <span style="font-style: normal;">download from <a href="../2009/05/26/teradata-developer-exchange-devx-begins-to-emerge/">Teradata&#8217;s 	Developer Exchange</a>.)</span></li>
<li><em>Teradata Data Mover 2.0,</em> coming “Soon”, which will ease copying (ETL without any 	significant “T”) from one Teradata system to another.</li>
<li><em>Teradata Express</em> DBMS 	crippleware (1 terabyte only, no production use), now available on 	Amazon EC2 and VMware. (I don&#8217;t see where this has much connection to the rest of Teradata&#8217;s cloud strategy, except insofar as it serves to fill out a slide.)</li>
<li>Unannounced (and so far as I can 	tell largely undesigned) future products.</li>
</ul>
<p style="margin-bottom: 0in;">Teradata openly admits that its direction is heavily influenced by Oliver Ratzesberger at <a href="../2009/04/30/ebays-two-enormous-data-warehouses/">eBay</a>. Like Teradata, Oliver and eBay favor virtual data marts over physical ones. That is, Oliver and eBay believe that the ideal scenario is that every piece of data is only stored once, in an integrated Teradata warehouse. But eBay believes and Teradata increasingly agrees that users need a great deal of control over their use of this data, including the ability to import additional data into private sandboxes, and join it to the warehouse data already there.<span id="more-1180"></span></p>
<p style="margin-bottom: 0in;">The <em>Teradata Elastic Mart(s) Builder Viewpoint</em> portlet automates the inclusion of outside data. If you&#8217;re already an authorized Teradata data warehouse user, you can fill in a very short form (three or so fields) and add authorization to import outside data, e.g. from a .CSV file. No fuss, little bother. Trivial as that sounds, when you combine it with Teradata&#8217;s pre-existing robust workload management tools, it creates a pretty good <em>virtual data mart</em> story.</p>
<p style="margin-bottom: 0in;">Spinning out and maintaining consistency with physical data marts is a different matter. Teradata doesn&#8217;t seem too sure it believes in those. And while Teradata is obviously planning to increase its capability in that regard anyway, I didn&#8217;t get a lot of detail beyond the reference to Data Mover 2.0.</p>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li>My Greenplum-inspired post on <a href="../2009/06/08/the-future-of-data-marts/">the 	future of data marts</a>, outlining issues in “private cloud” 	data warehousing.</li>
<li>eBay&#8217;s “<a href="http://www.xlmpp.com/articles/16-articles/39-analytics-as-a-service" onclick="javascript:pageTracker._trackPageview('/www.xlmpp.com');">Analytics 	as a Service</a>” pitch (about 1 ½ years old)</li>
<li><a href="http://developer.teradata.com/database/articles/what-is-the-teradata-agile-analytics-cloud" onclick="javascript:pageTracker._trackPageview('/developer.teradata.com');">A 	post by Teradata&#8217;s Dan Graham</a> explaining the <em>Teradata Agile 	Analytics Cloud</em><span style="font-style: normal;"> and </span><em>Elastic 	Mart(s) Builder Viewpoint</em> portlet</li>
<li>Home page and complete screen shot 	for the <a href="http://developer.teradata.com/download/viewpoint/elastic-marts-builder" onclick="javascript:pageTracker._trackPageview('/developer.teradata.com');"><em>Teradata 	Elastic Mart(s) Builder Viewpoint</em> portlet</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/27/teradatas-nebulous-cloud-strategy/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>This week at the Teradata Partners user conference</title>
		<link>http://www.dbms2.com/2009/10/19/teradata-partners-2009/</link>
		<comments>http://www.dbms2.com/2009/10/19/teradata-partners-2009/#comments</comments>
		<pubDate>Mon, 19 Oct 2009 13:07:31 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1150</guid>
		<description><![CDATA[Teradata tells me that its press embargoes are ending at 9:00 this morning. Here are some highlights of what&#8217;s going on, although names, dates, and details will have to await conversations and press releases this week.

Teradata is productizing 	“private cloud,” under names including “Teradata 	Enterprise Analytics Cloud,” “Teradata Agile Analytics Cloud,” 	and “Teradata Elastic Mart [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Teradata tells me that its press embargoes are ending at 9:00 this morning. Here are some highlights of what&#8217;s going on, although names, dates, and details will have to await conversations and press releases this week.</p>
<ul>
<li><strong>Teradata is productizing 	“private cloud,”</strong> under names including “Teradata 	Enterprise Analytics Cloud,” “Teradata Agile Analytics Cloud,” 	and “Teradata Elastic Mart Builder.” I.e., Teradata hopes to 	leapfrog Greenplum in its “<a href="../2009/06/08/the-future-of-data-marts/">Enterprise 	Data Cloud</a>” strategy. This is only fair, in that Greenplum 	lifted the idea from Teradata and eBay in the first place. It also 	provides major support for what I think is an extremely sensible 	trend. Give or take issues of who announces and ships what a couple 	months before or after a competitor, my early thinking is that the 	main differences between Greenplum and Teradata in this regard will 	be:
<ul>
<li>Virtual as opposed to just 	physical data marts, based on robust workload management software. 	(Advantage: Teradata)</li>
<li>Pricing, deployment options. 	(Advantage: Greenplum)</li>
<li>Features that don&#8217;t directly 	relate to enterprise/private cloud. (Advantage: Either, often 	Teradata.)</li>
</ul>
</li>
<li><strong>Teradata is generally 	strengthening its data movement technology</strong>, e.g. for making 	various appliances work in sync. I&#8217;m not too clear yet on the 	details of that. I think this is what Teradata&#8217;s phrase “ecosystem 	management” refers to.</li>
<li><strong>Teradata is (pre-)announcing – 	at least as a statement of direction &#8212; an appliance based on 	solid-state drives (SSDs). </strong>I&#8217;ve thought for a while that 	Teradata was a leader in thinking through <a href="../2008/10/23/teradata-solid-state-drives-ssd/">the 	issues around solid-state memory in data warehousing</a>, so it 	makes sense that they&#8217;re among the leaders in actually coming to 	market as well. I plan to say more after meeting with, e.g., Carson 	Schmidt.</li>
<li><strong>Teradata has achieved a 300%ish 	speed-up in geospatial processing</strong>. I gather this is largely a 	byproduct of the parallel analytics work Teradata did around 	strengthening its SAS integration. However, there don&#8217;t seem to be a 	lot of Teradata geospatial users yet.</li>
<li><span>Teradata 	Express, </span><strong>Teradata&#8217;s free Windows-based crippleware, is being 	ported to Amazon EC2 and VMware</strong> as well. Presumably to avoid 	cannibalizing Teradata product sales, there are quite a few 	limitations on Teradata Express, including system capacity, database 	size, and “no production use.”</li>
<li><strong>Teradata continues to extend 	its optimizations 	to handle queries issued by business intelligence tools. </strong><span>Previously, the focus of what 	Teradata discussed in this regard was <a href="../2009/08/02/teradata-13-focuses-on-advanced-analytic-performance/">query 	rewrite</a>. But soon automatic recommendation and creation of 	Aggregate Join Indexes – i.e.., materialized views – will be 	included as well.</span></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/19/teradata-partners-2009/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>How 30+ enterprises are using Hadoop</title>
		<link>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/</link>
		<comments>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/#comments</comments>
		<pubDate>Sat, 10 Oct 2009 10:19:29 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Native XML]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Text]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1073</guid>
		<description><![CDATA[MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of Hadoop World, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera&#8217;s files. Facts and metrics ranged widely, of course:

Some are in heavy production with 	Hadoop, and closely engaged with Cloudera. [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">MapReduce is definitely gaining traction, especially but by no means only in the form of Hadoop. In the aftermath of <a href="http://www.dbms2.com/2009/10/01/mapreduce-tidbits/" >Hadoop World</a>, Jeff Hammerbacher of Cloudera walked me quickly through 25 customers he pulled from Cloudera&#8217;s files. Facts and metrics ranged widely, of course:</p>
<ul>
<li>Some are in heavy production with 	Hadoop, and closely engaged with Cloudera. Others are active Hadoop 	users but are very secretive. Yet others signed up for initial 	Hadoop training last week.</li>
<li>Some have Hadoop clusters in the 	thousands of nodes. Many have Hadoop clusters in the 50-100 node 	range. Others are just prototyping Hadoop use. And one seems to be 	&#8220;OEMing&#8221; a small Hadoop cluster in each piece of equipment 	sold.</li>
<li>Many export data from Hadoop to a 	relational DBMS; many others just leave it in HDFS (Hadoop 	Distributed File System), e.g. with <a href="http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/" >Hive</a> as the query 	language, or in exactly one case Jaql.</li>
<li>Some are household names, in web 	businesses or otherwise. Others seem to be pretty obscure.</li>
<li>Industries include financial 	services, telecom (Asia only, and quite new), bioinformatics (and 	other research), intelligence, and lots of web and/or 	advertising/media.</li>
<li>Application areas mentioned &#8212; and 	these overlap in some cases &#8212; include:
<ul>
<li>Log and/or clickstream analysis of 	various kinds</li>
<li>Marketing analytics</li>
<li>Machine learning and/or 	sophisticated data mining</li>
<li>Image processing</li>
<li>Processing of XML messages</li>
<li>Web crawling and/or text 	processing</li>
<li>General archiving, including of 	relational/tabular data, e.g. for compliance</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;"><span id="more-1073"></span>We went over this list so quickly that we didn&#8217;t go into much detail on any one user. But one example that stood out was of an ad serving firm that had an &#8220;aggregation pipeline&#8221; consisting of 70-80 MapReduce jobs.</p>
<p style="margin-bottom: 0in;">I also talked yesterday again w/ Omer Trajman of Vertica, who surprised me by indicating a high single-digit number of Vertica&#8217;s customers were in production with Hadoop &#8212; i.e., over 10% of Vertica&#8217;s production customers.  (Vertica recently made its 100th sale, and of course not all those buyers are in production yet.) <a href="http://www.dbms2.com/2009/08/04/verticas-version-of-mapreduce-integration/" >Vertica/Hadoop</a> usage seems to have started in Vertica&#8217;s financial services stronghold &#8212; specifically in financial trading &#8212; with web analytics and the like coming on afterwards. Based on current prototyping efforts, Omer expects bioinformatics to be the third production market for Vertica/Hadoop, with telecommunications coming in fourth.</p>
<p style="margin-bottom: 0in;">Unsurprisingly, the general Vertica/Hadoop usage model seems to be:</p>
<ul>
<li>Do something to the data in Hadoop</li>
<li>Dump it into Vertica to be queried</li>
</ul>
<p style="margin-bottom: 0in;">What I did find surprising is that the data often isn&#8217;t reduced by this analysis, but rather exploded in size.  E.g., a complete store of mortgage trading data might be a few terabytes in size, but Hadoop-based post processing can increase that by 1 or 2 orders of magnitude. (Analogies to the importance and magnitude of <em>&#8220;cooked&#8221; data</em> in scientific data processing come to mind.)</p>
<p style="margin-bottom: 0in;">And finally, I talked to Aster a few days ago about the usage of its nCluster/Hadoop connector. Aster characterized Aster/Hadoop users&#8217; Hadoop usage as being of the batch/ETL variety, which is the classic use case one concedes to Hadoop even if one believes that MapReduce should commonly be done right in the DBMS.</p>
<p style="margin-bottom: 0in;"><em><strong>Related link</strong></em></p>
<ul>
<li><a href="../2008/08/26/known-applications-of-mapreduce/">An 	August, 2008 round-up of MapReduce applications</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Issues in scientific data management</title>
		<link>http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/</link>
		<comments>http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/#comments</comments>
		<pubDate>Sat, 03 Oct 2009 05:51:50 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[SciDB]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Specific users]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=993</guid>
		<description><![CDATA[In the opinion of the leaders of the XLDB and SciDB efforts, key requirements for scientific data management include:

A data model based on multidimensional arrays, not 	sets of tuples
A storage model based on versions and not update in 	place
Built-in 	support for provenance (lineage), workflows, and 	uncertainty
Scalability to 100s of 	petabytes and 1,000s of nodes with [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">In the opinion of the leaders of <a href="../2009/09/12/xldb-scid/">the XLDB and SciDB efforts</a>, key <a href="http://scidb.org/about/history.php" onclick="javascript:pageTracker._trackPageview('/scidb.org');">requirements for scientific data management</a> include:</p>
<ul>
<li>A data model based on <strong>multidimensional arrays,</strong> not 	sets of tuples</li>
<li>A storage model based on <strong>versions</strong> and not update in 	place</li>
<li><span>Built-in 	support for </span><strong>provenance (lineage), workflows, and 	uncertainty</strong></li>
<li>Scalability to <strong>100s of 	petabytes and 1,000s of nodes</strong> with high degrees of <strong>tolerance 	to failures</strong></li>
<li>Support for <strong>&#8220;external&#8221; 	data objects</strong> so that data sets can be queried and manipulated 	without ever having to be loaded into the database</li>
<li><strong>Open source</strong> in order to foster a community of 	contributors and to insure that data is never &#8220;locked up&#8221; 	— a critical requirement for scientists</li>
</ul>
<p style="margin-bottom: 0in;">However:<span id="more-993"></span></p>
<ul>
<li><strong>I think that&#8217;s a dream/wish 	list.</strong> A lot of good could be done without meeting each of those 	six requirements in full.</li>
<li>I think at least some of the 	XLDB/SciDB leaders realize this.</li>
<li>In my opinion, <strong>a highly useful 	subset of the dream/wish list is achievable in the 	reasonably-intermediate future,</strong> in either of two ways:
<ul>
<li><strong>Through a Hadoop-centric open 	source effort,</strong> especially since <a href="../2009/09/13/hadoopdb/">HadoopDB 	opens up the possibility of letting DBMS creators offload MPP 	scaling challenges to somebody else</a>.</li>
<li><strong>From commercial MPP 	software-only</strong><span> (as opposed to 	appliance) </span><strong>DBMS vendors.</strong> I think they can develop the 	needed technology. I also think it could be in their business 	interest to make licensing arrangements of the sort that the 	scientific and research communities would need.</li>
</ul>
</li>
<li>Talking about &#8220;scientific&#8221; 	big data is unhelpfully vague. Let&#8217;s just focus on <strong>multi-dimensional 	measurement- or model-centric data,</strong> from disciplines such as 	seismology (under the Earth&#8217;s surface), climatology (over the 	surface), and astronomy (outer space). That would also include 	disciplines whose three-spatial-dimensions-plus-time data comes from 	inside a laboratory or other man-made environment, such as 	high-energy physics, fluid dynamics, and so on.</li>
<li>One place in all that where there 	should be a commercial-company market is in <strong>oil/gas extraction.</strong><span> And by the way, the energy industry is increasing its uptake of data 	warehousing technology faster these days than any other sector I can 	think of, except perhaps for &#8230;</span></li>
<li>&#8230; web companies that do <strong>log 	file analysis.</strong> Facebook&#8217;s log data has arrays-within-arrays 	reminiscent of the scientists&#8217;. eBay has been a major backer of 	XLDB/SciDB. It&#8217;s far from fully known yet just how much overlap 	there is between log-file-analyzers&#8217; data management needs and those 	of big-data scientists. But there clearly are at least some 	commonalities.</li>
<li>I don&#8217;t get the impression that 	scientists focused on modeling &#8212; e.g. climate-predictors &#8212; have 	been big participants in XLDB. That&#8217;s a pity for at least two 	reasons. First, modeling is at the heart of some of the most 	important global issues scientists address (e.g., climate change). 	Second, it might be an area of particularly rich overlap with 	commercial data management needs.</li>
</ul>
<p style="margin-bottom: 0in;">Now let&#8217;s step back and consider approximately what is meant by the requirements listed above.</p>
<ul>
<li>The requirements for an <strong>array</strong> structure are evidently pretty deep. You can glean some of the 	reasons from the <a href="http://scidb.org/use/" onclick="javascript:pageTracker._trackPageview('/scidb.org');">scientific database 	use cases</a> posted on the SciDB website. In particular:
<ul>
<li>Coordinate data naturally fits 	into arrays.</li>
<li>Coordinate data also naturally 	fits into geospatial ranges and the like.</li>
<li>The &#8220;grid&#8221; for the array 	can be imprecise &#8212; or calculated via transformation &#8212; for a whole 	lot of different reasons.</li>
<li>Different measurements may be 	available for different points in the array. (I think this may be 	the essence of the array-valued-arrays requirement.)</li>
</ul>
</li>
<li>Some reasons scientists want 	<strong>versioning</strong> and support for <strong>data provenance</strong> are pretty 	obvious &#8212; you never want to lose the record of what the instrument 	readings said, or ever were believed to say. But it goes further. 	Data is &#8220;cooked&#8221; &#8212; i.e., transformed/reduced &#8212; and 	stored in huge volumes. So you&#8217;d like to later on be able to go back 	to the raw data and re-cook it.</li>
<li>The <strong>workflow</strong> requirement 	seems to stem in many cases from data movement needs, that in turn 	in some cases stem from political issues. I haven&#8217;t yet understood 	why workflow would actually need to be baked into a scientific DBMS.</li>
<li>By the time the database 	management systems we&#8217;re talking about could conceivably be ready, 	the need will be at least in the 10s of petabytes. <strong>100s of 	petabytes</strong> is a reasonable design goal.</li>
<li>Not that I&#8217;ve run any numbers on 	the matter, but it seems plausible that <a href="../2009/09/13/fault-tolerant-queries/"><strong>query 	fault-tolerance</strong></a> will be needed, at least in some cases.</li>
<li>In many sciences (astronomy seems 	to be an exception), the default choice is to keep data in files 	rather than a DBMS. For example, CERN has a 10 terabyte or so Oracle 	database holding just the metadata for a vastly larger collection of 	data files. Even if the pendulum swings toward greater use of DBMS, 	the ongoing need for <strong>external file access</strong> is pretty obvious.</li>
<li>I suspect that the insistence on 	<strong>open source</strong> is part legitimate, part knee-jerk excessive.
<ul>
<li>&#8220;Free&#8221; is the best 	possible price, of course.</li>
<li>Beyond cash cost, scientists want 	data access to be free of licensing encumbrance. There are two main 	reasons. First, people might want to manage subsets or copies of 	data remotely from its central repository, for a variety of reasons. 	Not all of those reasons are easy to overcome, so any closed-source 	licensing would have to be very comprehensive (e.g., global or at 	least continent-wide &#8220;site&#8221; licensing).</li>
<li>Second, they want assurance that 	data will always be accessible, even if licenses expire. That seems 	a little overwrought. Yes, moving data from one multi-petabyte 	repository to another could be a bit slow. But it&#8217;s not an 	eventuality to panic about.</li>
<li>As for actual community 	development &#8212; scientists sure have a variety of exotic data 	management needs. But I&#8217;m not sure how much talent or resource there 	is among scientists to do true DBMS development (as opposed to, say, 	refining some UDFs). Yes, one XLDB attendee was both an astronomer 	and a PostgreSQL Major Contributor, but he seemed like an exception. 	On the other hand, it&#8217;s not entirely implausible that, in the right 	framework, some people with database talent could be recruited to 	donate some time to the general advancement of science.</li>
</ul>
</li>
<li>I don&#8217;t know much about management 	of <strong>uncertain data,</strong> and will duck that subject for now.</li>
</ul>
<p><em><strong>Related links</strong></em></p>
<ul>
<li><a href="http://www.dbms2.com/2009/10/03/martin-kersten-on-issues-in-scientific-data-management/" >Martin Kersten&#8217;s response</a></li>
<li><a href="http://www.dbms2.com/2009/10/04/jacek-becla-on-issues-in-scientific-data-management/" >Jacek Becla&#8217;s response</a></li>
<li><a href="http://www.dbms2.com/2009/10/10/scientific-data-sharing/" >Scientific data sharing</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Sybase IQ technical highlights</title>
		<link>http://www.dbms2.com/2009/08/25/sybase-iq-technical-highlights/</link>
		<comments>http://www.dbms2.com/2009/08/25/sybase-iq-technical-highlights/#comments</comments>
		<pubDate>Tue, 25 Aug 2009 09:16:07 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=871</guid>
		<description><![CDATA[General highlights of the Sybase IQ technical story include:

Sybase IQ is an analytic DBMS with 	a columnar/column-store architecture
Unlike most analytic DBMS, Sybase 	IQ has a shared-disk architecture.
The Sybase IQ indexing story is a 	bit complicated, with a bunch of different index kinds. Most are 	focused on columns with low cardinality, and it least in some [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">General highlights of the Sybase IQ technical story include:</p>
<ul>
<li>Sybase IQ is an analytic DBMS with 	a columnar/column-store architecture</li>
<li>Unlike most analytic DBMS, Sybase 	IQ has a shared-disk architecture.</li>
<li>The Sybase IQ indexing story is a 	bit complicated, with a bunch of different index kinds. Most are 	focused on columns with low cardinality, and it least in some cases 	are a lot like bitmaps. (Sybase IQ when first introduced was a pure 	bitmap index product, with a single index type “Fast Project”.) 	But one index kind, “High Group” &#8212; designed for columns with 	high cardinality – is an exception to most generalities about 	other Sybase IQ index kinds, and instead is more akin to a b-tree.</li>
<li>Unlike Vertica, Sybase stores each 	column of data only once.  I don&#8217;t see how it would make sense to 	have multiple indexes on the same column, but I didn&#8217;t actually ask 	whether doing so is possible or common.</li>
<li>Sybase estimates that Sybase IQ 	requires ¼ the DBA effort of, say, Oracle. (Frankly, that&#8217;s 	not a particularly good figure.) Obviously, this is just a 	broad-brush average.</li>
<li>Sybase recently repurposed an 	acquired ETL tool to be focused on Sybase IQ. IQ of course also 	works with various third-party tools, certified or otherwise.</li>
<li>Sybase&#8217;s Power Designer CASE 	(Computer-Aided Software Engineering)/database design tool works 	with Sybase IQ.</li>
<li><a href="http://blogs.sybase.com/sybaseiq/2009/07/sybase-iq-151-more-than-meets-the-eye%E2%80%A6/" onclick="javascript:pageTracker._trackPageview('/blogs.sybase.com');">Sybase 	is proud of Sybase IQ&#8217;s new in-database analytics capabilities</a>, 	but I haven&#8217;t yet grasped what, if anything, is differentiated about 	them.</li>
<li>Sybase has an ILM (Information 	Lifecycle Management) story built around the point that different 	columns can be stored on different kinds of media.</li>
</ul>
<p style="margin-bottom: 0in;">Highlights of the Sybase IQ compression story include:<span id="more-871"></span></p>
<ul>
<li>Sybase IQ applies compression to 	both columns and pages</li>
<li>A (the?) major kind of column 	compression is called “projection” &#8212; why? &#8212; but boils down to 	token/dictionary compression. Tokens can be 1, 2, or 3 bytes or 	length – whichever is the best fit for the column&#8217;s cardinality.</li>
<li>I don&#8217;t have details about the 	other kinds of compression.</li>
<li>Data is kept compressed in memory 	“until the latest point possible.”</li>
</ul>
<p style="margin-bottom: 0in;">Highlights of the Sybase IQ update and load story include:</p>
<ul>
<li>Sybase claims that only the “High 	Group” index is costly to update.  Specifically, “High Group” 	costs about as much to update as the database itself. Other indexes 	are fairly trivial to update. (Upon reflection, I don&#8217;t immediately 	see why that makes sense.)</li>
<li>There&#8217;s pipelining of some sort 	when a High Group index is updated.</li>
<li>Sybase claims that bulk loads of 	Sybase IQ are very fast.</li>
<li>Loading Sybase IQ doesn&#8217;t block 	queries. Rather, Sybase IQ has some kind of versioning system in 	which a query just executes against older data.</li>
<li>Sybase IQ updating is done in 	parallel. (That would be parallel among servers, of course, since 	Sybase IQ is shared-disk.)</li>
<li>Trickle feed loading of Sybase IQ 	is slow. When you need to do microbatch loading with latency in the 	2-15 minute range, Sybase recommends staging via an OLTP DBMS, 	whether from Sybase or otherwise. Sybase PowerDesigner generates 	scripts for this, and Sybase Replication Server helps with the 	execution.</li>
</ul>
<p style="margin-bottom: 0in;">Highlights of the Sybase IQ concurrency, scalability, and workload management story include:</p>
<ul>
<li>Sybase points out that, because of 	Sybase IQ&#8217;s shared-disk architecture, queries can execute on a 	single server in the “grid.” Thus, if you have enough cores, it 	can be possible to isolate long-running queries from shorter ones.</li>
<li>Similarly, Sybase notes that you 	can meet different SLAs by putting different users&#8217; queries on more- 	or less-crowded Sybase IQ servers.</li>
<li>Sybase further observes that not 	having to move data among nodes saves Sybase IQ from a lot of 	overhead true MPP systems endure.</li>
<li>Sybase makes the usual claim that, 	because Sybase IQ is so efficient, queries finish quickly, and hence 	there&#8217;s less stress on concurrency than one might otherwise think.</li>
<li>I don&#8217;t get the sense that Sybase 	IQ actually boasts a lot of direct workload management features. 	However, there are such features in Sybase&#8217;s flagship ASE product, 	so hopefully adding something similar to Sybase IQ is a product 	future.</li>
</ul>
<p style="margin-bottom: 0in;"><em><strong>Related links</strong></em></p>
<ul>
<li><a href="http://www.dbms2.com/2009/08/25/sybase-iq-business-notes/" >Sybase IQ business notes</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/08/25/sybase-iq-technical-highlights/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Teradata 13 focuses on advanced analytic performance</title>
		<link>http://www.dbms2.com/2009/08/02/teradata-13-focuses-on-advanced-analytic-performance/</link>
		<comments>http://www.dbms2.com/2009/08/02/teradata-13-focuses-on-advanced-analytic-performance/#comments</comments>
		<pubDate>Sun, 02 Aug 2009 23:08:01 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=855</guid>
		<description><![CDATA[Last October I wrote about the Teradata 13 release of Teradata&#8217;s database management software.  Teradata 13, which will be used across the various Teradata product lines, has now been announced for GCA (General Customer Availability)*. So far as I can tell, there were two main points of emphasis for Teradata 13:

Performance (of course, 	performance [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;"><a href="../2008/10/14/teradata-announcements/">Last</a> <a href="../2008/10/14/teradata-geospatial-datatype-extensibility/">October</a> I wrote about the Teradata 13 release of Teradata&#8217;s database management software.  Teradata 13, which will be used across <a href="../2008/10/23/teradata-appliance-product-lines/">the various Teradata product lines</a>, has now been announced for GCA (General Customer Availability)*. So far as I can tell, there were two main points of emphasis for Teradata 13:</p>
<ul>
<li><strong>Performance</strong> (of course, 	performance is a point of emphasis for almost any release of any 	analytic DBMS product), especially but not only in the areas of 	aggregates, ETL (Extract/Transform/Load), and UDFs.</li>
<li><strong>UDFs (User Defined Functions),</strong> especially but not only in the areas of data mining and geospatial 	analysis.</li>
</ul>
<p style="margin-bottom: 0in;">To put it even more concisely, the focus of Teradata 13 is on <strong>advanced analytic performance,</strong> although there of course are some enhancements in simple query performance and in analytic functionality as well.<span id="more-855"></span></p>
<p style="margin-bottom: 0in;"><em>*Teradata development chief Scott Gnau</em><em> said a couple of customers have already received Teradata 13, although this was recent enough that presumably nobody has it in production.  But let&#8217;s not take all that too literally, since &#8212; for example &#8212; I heard nothing about the length or breadth of the beta cycle.</em></p>
<p style="margin-bottom: 0in;">As just one example, when I asked Scott what was different between Teradata 13 as it is shipping now vs. Teradata 13 as it was foreshadowed back in October, he cited:</p>
<ul>
<li>Improved performance</li>
<li>Additional &#8220;content,&#8221; 	including:
<ul>
<li>Faster loading (sounds like an 	aspect of performance to me)</li>
<li>In-database data mining 	initiatives (these fit in both the &#8220;UDF&#8221; and &#8220;performance&#8221; 	buckets).</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">But the parts of Teradata 13 that Scott already discussed back in October, 2008 largely boil down to performance and/or UDFs as well.</p>
<p style="margin-bottom: 0in;">Scott also foreshadowed an area of emphasis for future Teradata releases &#8212; <strong>temporal data analysis.</strong> Teradata 13 offers a new PERIOD datatype, which Scott thinks is a &#8220;sleeper&#8221; on its own for the value customers will find in it. And Scott made it clear that Teradata plans much more functionality for temporal data analysis in the future.</p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">As I understand it, PERIOD works like this: S</span>uppose you have a table that maintains, say, address or employment status. When you update it, you naturally create a Start_Date and End_Date for the validity of certain information. Teradata&#8217;s PERIOD datatype automagically uses this to maintain a Period where information was true, even when that period is wholly in the past. Thus when you update a row with new information, you wind up with two rows &#8212; the newly changed row, and also a second row with the old information and an effectiveness period for same.</p>
<p style="margin-bottom: 0in;"><em>Note: I have no further detail about Teradata&#8217;s PERIOD datatype at this time. Even what I said includes enough guesswork that there are probably at least small errors in it.</em></p>
<p style="margin-bottom: 0in;">The<strong> Teradata 13 UDF, in-database data mining, and SAS integration stories</strong> seem to go something like this:</p>
<ul>
<li>Teradata offered UDF support in C 	before Teradata 13. With Teradata 13 it supports Java UDFs as well.</li>
<li>Any Teradata UDF is automatically 	parallel, running across all nodes, etc.</li>
<li>Teradata 13 cleans up a variety of 	UDF issues, including:
<ul>
<li>Allowing the use of UDFs in certain 	aggregates that didn&#8217;t support them before.</li>
<li>Recursion, whatever that means in 	this context. (Perhaps the prior point is a hint.)</li>
<li>Extended memory management/making 	more memory available to UDFs.</li>
</ul>
</li>
<li>Teradata&#8217;s work to enhance <a href="../2007/10/10/sas-goes-mpp-on-teradata-first/">SAS 	integration</a> has been focused on its general UDF framework. The 	memory management extensions seem to have particularly important to 	running SAS. <em>(Note: That link refers to putting SAS on a &#8220;single 	node&#8221; in a Teradata grid. Scott gave me the impression that no 	such thing was possible. So I&#8217;m a bit confused. I&#8217;m also not sure it 	matters much.)</em></li>
<li>Teradata 	expects this same general UDF framework to support integration with 	a variety of analytic technologies. But  the only examples actually 	discussed were SAS and geospatial.</li>
<li>Actually, we 	didn&#8217;t really discuss geospatial much either, so I&#8217;ll just refer you 	back to my October, 2008 post (already linked above) about 	<a href="../2008/10/14/teradata-geospatial-datatype-extensibility/">Teradata&#8217;s 	geospatial datatype</a>.</li>
</ul>
<p style="margin-bottom: 0in;">Besides UDFs, the other <strong>performance</strong> focus in Teradata 13 seems to be <strong>aggregations and OLAP. </strong>One Teradata 13 performance boost lies in aggressive <strong>query rewriting.</strong> Business intelligence tools, written to support multiple analytic DBMS (including non-current versions), can produce very messy SQL queries. Teradata 13 takes an optimizing compiler mindset to those, and in some cases can get significant speedup as a results. I get the impression there was work on <strong>other OLAP and aggregation speed-ups </strong>as well.</p>
<p style="margin-bottom: 0in;">Also, Teradata 13 added a feature for load performance that Scott cites as being useful in the cases of heavy ETL (actually, it sounded more like ELT &#8212; Extract/Load/Transform) and OLAP aggregate-building. Namely, for the first time Teradata lets you <strong>turn off hash distribution.</strong> Teradata still wants you to hash-distribute whatever you&#8217;re going to persist to disk. But if you&#8217;re just creating a temporary table that will be dropped as soon as the load process completes, you&#8217;re now allowed to skip the hash distribution step.  Scott says this can lead to &gt;30% improvements in load performance.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/08/02/teradata-13-focuses-on-advanced-analytic-performance/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>While I&#8217;m venting about benchmarks</title>
		<link>http://www.dbms2.com/2009/07/08/while-im-venting-about-benchmarks/</link>
		<comments>http://www.dbms2.com/2009/07/08/while-im-venting-about-benchmarks/#comments</comments>
		<pubDate>Wed, 08 Jul 2009 23:58:48 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=837</guid>
		<description><![CDATA[Late last year, Vertica made hoo-hah about what it called a world-record data warehouse load speed benchmark.  I wrote at the time that this showed Vertica wasn&#8217;t painfully slow at loading, always a concern with column stores. But otherwise I mocked the idea that there was something useful to be learned from the whole exercise.
Well, [...]]]></description>
			<content:encoded><![CDATA[<p>Late last year, Vertica made hoo-hah about what it called a <a href="http://www.dbms2.com/2008/12/02/data-warehouse-load-speeds-in-the-spotlight/" >world-record data warehouse load speed benchmark</a>.  I wrote at the time that this showed Vertica wasn&#8217;t painfully slow at loading, always a concern with column stores. But otherwise I mocked the idea that there was something useful to be learned from the whole exercise.</p>
<p>Well, guess what?  In a throwaway line in a comment on <a href="http://dbmsmusings.blogspot.com/2009/07/paraccel-and-their-puzzling-tpc-h.html" onclick="javascript:pageTracker._trackPageview('/dbmsmusings.blogspot.com');">Daniel Abadi&#8217;s blog</a>, Barry Zane of ParAccel pointed out</p>
<blockquote><p>we posted a load rate of almost 9TB/hour, which is, of course record breaking on its own</p></blockquote>
<p>Quite right.</p>
<p>I hope the nonsense stops there, but I&#8217;m not optimistic &#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2009/07/08/while-im-venting-about-benchmarks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
