<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Data integration and middleware</title>
	<atom:link href="http://www.dbms2.com/category/data-integration-middleware/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 02 Sep 2010 09:06:44 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>The Workday architecture &#8212; a new kind of OLTP software stack</title>
		<link>http://www.dbms2.com/2010/08/22/workday-technology-stack/</link>
		<comments>http://www.dbms2.com/2010/08/22/workday-technology-stack/#comments</comments>
		<pubDate>Sun, 22 Aug 2010 10:20:08 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Workday]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2865</guid>
		<description><![CDATA[One of my coolest company visits in some time was to  SaaS  (Software as a Service) vendor Workday, Inc., earlier this month. Reasons included:

Workday has 	forward-thinking ideas about SaaS enterprise 	applications and the integration of business intelligence into same.
Workday has highly 	innovative ideas in how it manages data.
Companies founded by 	Dave Duffield tend [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;"><span style="font-size: small;">One of my coolest company visits in some time was to </span><span style="font-size: small;"> SaaS  (Software as a Service) vendor</span><span style="font-size: small;"> Workday, Inc., earlier this month. Reasons included:</span></p>
<ul>
<li><span style="font-size: small;">Workday has 	forward-thinking ideas about SaaS enterprise 	applications and the integration of business intelligence into same.</span></li>
<li><span style="font-size: small;">Workday has highly 	innovative ideas in how it manages data.</span></li>
<li><span style="font-size: small;">Companies founded by 	Dave Duffield tend to feature smart, likeable people who talk to one</span><span style="font-size: small;"><span style="font-style: normal;"> pleasantly and forthrightly. Workday is no exception; CTO Stan Swete 	and the other Workday folks present were a delight to talk with.</span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">I&#8217;d 	invited Merv Adrian to come along with me. He asked great questions, 	and I could gather myself a bit despite how sleep-deprived I was for 	the first part of that trip.</span></span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-size: small;"><span style="font-style: normal;">Workday kindly allowed me to post this </span></span><span style="font-size: small;"><a href="http://www.monash.com/uploads/Workday-August-2010.ppt" onclick="javascript:pageTracker._trackPageview('/www.monash.com');">Workday slide deck</a>.</span><span style="font-size: small;"><span style="font-style: normal;"> Otherwise, I&#8217;ve split out a quick </span></span><a href="http://www.dbms2.com/2010/08/22/workday-inc-company-overview/" ><span style="font-size: small;">Workday, Inc. company overview</span></a><span style="font-size: small;"><span style="font-style: normal;"> into a separate post.</span></span></p>
<p style="margin-bottom: 0in;"><span style="font-size: small;"><span style="font-style: normal;">The biggie for me was the data and object management part. Specifically:  <span id="more-2865"></span><br />
</span></span></p>
<ul>
<li><span style="font-size: small;"><span style="font-style: normal;"><strong>Workday&#8217;s 	applications run entirely in-memory,</strong></span></span><span style="font-size: small;"><span style="font-style: normal;"> in a highly object-oriented structure. Persistence is mainly for the 	sake of data safety …</span></span></li>
<li>… <span style="font-size: small;"><span style="font-style: normal;">but 	not entirely. In earlier releases, Workday kept absolutely 	everything in RAM. However, certain things are kept only on disk, 	such as:</span></span>
<ul>
<li><span style="font-size: small;"><span style="font-style: normal;">Audit 	files.</span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">Certain 	documents (notably resumes).</span></span></li>
</ul>
</li>
<li><strong><span style="font-size: small;"><span style="font-style: normal;">Workday&#8217;s 	whole database</span></span></strong><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;"> – data and metadata alike – is persisted to disk in </span></span></span><strong><span style="font-size: small;"><span style="font-style: normal;">&lt;10 	MySQL/InnoDB tables. </span></span></strong><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">MySQL 	is basically just being used as a </span></span></span><strong><span style="font-size: small;"><span style="font-style: normal;">key-value 	store, </span></span></strong><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">albeit 	one with </span></span></span><strong><span style="font-size: small;"><span style="font-style: normal;">ACID 	transactional support. </span></span></strong>
<ul>
<li><span style="font-size: small;">There <span style="font-weight: normal;">are </span><strong>3 main tables: attributes, relationships, instances.</strong></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">When 	I suggested this might be like an entity-attribute-value model, 	Workday said it would be even better to think in terms of</span><span style="font-style: normal;"><strong> instanceID-attribute-value.</strong></span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">As 	you might expect for a database that simple, its schema doesn&#8217;t 	change much.</span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">By 	way of comparison, Workday estimates that if its software were 	written relationally, </span></span></span><span style="font-size: small;"><span style="font-style: normal;">there 	would b</span></span><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">e </span></span></span><span style="font-size: small;"><span style="font-weight: normal;"><a href="http://www.dbms2.com/2010/08/22/workday-stan-swete-database-architecture/" >1000s 	of tables</a>,</span></span><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;"> which</span></span></span><span style="font-size: small;"><span style="font-style: normal;"> would take up 10-100X as much disk space. </span></span></li>
</ul>
</li>
<li><span style="font-size: small;"><span style="font-style: normal;">All 	write transactions are banged immediately into the MySQL database. 	I.e., RAM and disk are never allowed to get out of sync.</span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">Workday&#8217;s 	database is append-only. This is exploited for effective dating 	(pretty heavily, it seems, perhaps because that&#8217;s a useful concept 	in human resources) and snapshotted reporting.</span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">Workday&#8217;s 	built-in BI doesn&#8217;t have a lot of choice but to do scans, traversing 	the object model. This turns out to be fast enough.</span></span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-size: small;">Other notes on Workday&#8217;s data and object management strategy include:</span></p>
<ul>
<li><span style="font-size: small;">Workday is 	object-oriented through and through – no object-relational mapping 	&#8211; <a href="http://en.wikipedia.org/wiki/Turtles_all_the_way_down" onclick="javascript:pageTracker._trackPageview('/en.wikipedia.org');">turtles 	all the way down</a>. On average, a class has about 2 attributes.</span></li>
<li><span style="font-size: small;">94% of requests are 	reads, traversing the object hierarchy.</span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">Workday 	databases are pretty small.</span></span>
<ul>
<li><span style="font-size: small;"><span style="font-style: normal;">The 	biggest database Workday supports uses 17 gigabytes of RAM. </span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">Workday 	databases are much smaller on disk than in RAM.</span></span></li>
</ul>
</li>
<li><span style="font-size: small;">Workday&#8217;s “dream” 	is to move from disk to solid-state memory. </span></li>
<li><span style="font-size: small;">Workday uses GPLed 	MySQL/InnoDB. So there&#8217;s no software license reason to ever move 	away (e.g., to a pure key-value store).</span></li>
<li><span style="font-size: small;">Disaster recove</span><span style="font-size: small;"><span style="font-style: normal;">ry 	is based on local and remote MySQL slaves. </span></span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-size: small;"><span style="font-style: normal;">Obviously, serious apps have been built before in object-oriented and/or key-value ways, with the resulting objects then being banged to disk (or in some cases kept in memory). Examples include:</span></span></p>
<ul>
<li><span style="font-size: small;"><span style="font-style: normal;">Numerous 	applications are built on <a href="../2010/01/15/intersystems-cache-highlights/">object-oriented 	DBMS</a>. Generally they go against disk, although <a href="../2005/11/14/defining-and-surveying-memory-centric-data-management/">memory-centric 	implementations can save a lot of pointer-chasing</a>. Often they&#8217;re 	queried via SQL.</span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">Basho&#8217;s 	website says that its key-value store Riak was originally conceived 	in connection with a planned salesforce automation product, but I 	don&#8217;t think that the application part of that plan ever got built. </span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">SAP 	has <a href="../2005/12/09/36/">longstanding</a> doubts about relational dogma, although not nearly to Workday&#8217;s 	extreme.</span></span></li>
<li><span style="font-size: small;">Obviously, 	some major internet applications just bang data into key-value 	stores.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-size: small;"><span style="font-style: normal;">Still, perhaps because it wholly object-oriented yet doesn&#8217;t even bother with anything like a real object-oriented DBMS, Workday&#8217;s approach seems particularly cool. </span></span></p>
<p style="margin-bottom: 0in;"><span style="font-size: small;"><span style="font-style: normal;">Other highlights of Workday, Inc.&#8217;s technical story include:</span></span></p>
<ul>
<li><span style="font-size: small;"><span style="font-style: normal;">Workday 	has settled into a schedule of three releases per year, and has 	pretty much lived up to that for &gt;2 years.</span></span>
<ul>
<li><span style="font-size: small;"><span style="font-style: normal;">Every 	user is always on the latest Workday release.</span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">You 	can delay turning on significant new Workday software functionality 	if you want to.</span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">Pure 	UI changes to the Workday software are handled much as they are on 	various websites today. Sometimes you have no choice but to live 	with them; sometimes the prior version of the UI remains available 	to you for a while.</span></span></li>
</ul>
</li>
<li><span style="font-size: small;"><span style="font-style: normal;">Workday&#8217;s 	navigational approaches look pretty cool.</span></span>
<ul>
<li><span style="font-size: small;"><span style="font-style: normal;">The 	core concept is a list of actions you can perform now, rather than 	more standard menus.</span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">Roles/permissions 	are of course central to this.</span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">Reports 	have lots of actionable links in them. (More than just drilldown, 	although specific examples have slipped my memory.)</span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;">Alternatively, 	you can navigate via a search box, searching both on names of 	objects (e.g. users, divisions) or on names of tasks. This is 	somewhat reminiscent of <a href="http://www.texttechnologies.com/2007/02/28/sap%E2%80%99s-%E2%80%9Csearch%E2%80%9D-strategy-isn%E2%80%99t-about-search/" onclick="javascript:pageTracker._trackPageview('/www.texttechnologies.com');">an 	approach SAP was considering a few years ago</a>.</span></span></li>
</ul>
</li>
<li><span style="font-size: small;">Workday says it has 	four key design premises:</span>
<ul>
<li><span style="font-size: small;"><em>Web-Familiar 	Experience.</em> I&#8217;d say that&#8217;s true to to the extent it makes sense. 	In many ways, the web needs to catch up to Workday.</span></li>
<li><span style="font-size: small;"><em>Enterprise 	Reporting.</em> The idea is that you get a report, then take actions 	based on it. Hence the report-centric options for navigation.</span></li>
<li><span style="font-size: small;"><em>Integration 	On-Demand.</em> That&#8217;s a fancy way of saying “Plays nicely with 	others.”</span></li>
<li><span style="font-size: small;"><em>Configurable 	Business Processes.</em><span style="font-style: normal;"> Duh. That&#8217;s 	pretty essential if you want to do serious SaaS applications.</span></span></li>
</ul>
</li>
<li><span style="font-size: small;"><span style="font-style: normal;">Workday 	maintains a strong separation between application logic and UI 	development. Developer do no screen layouts. Instead, Uis are 	automatically generated for:</span></span>
<ul>
<li><span style="font-size: small;">Flash/FLEX</span></li>
<li><span style="font-size: small;">iPhone</span></li>
<li><span style="font-size: small;">Mobile HTML</span></li>
<li><span style="font-size: small;">PDF export</span></li>
<li><span style="font-size: small;">Excel export</span></li>
</ul>
</li>
<li><span style="font-size: small;"><span style="font-style: normal;">Workday 	only talks to the outside world via web services.</span></span>
<ul>
<li><span style="font-size: small;"><span style="font-style: normal;">Workday 	is heavily </span></span><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">into 	SOAP (Simple Object Access Protocol). </span></span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">The 	acquisition of OEM partner CapeClear gave Workday an Integration 	Service (i.e., enterprise service bus) that translates SOAP into 	whatever else might be needed for integration, and also does 	reliable delivery. </span></span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">All 	that said, Stan Swete sees integration among various SaaS offerings 	as an area needing significant future attention.</span></span></span></li>
</ul>
</li>
<li><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">Workday&#8217;s 	business intelligence ideas are interesting, but I think there&#8217;s a 	long way for that technology still to go.</span></span></span>
<ul>
<li><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">Workday&#8217;s 	BI seems to be focused on report/drilldown kinds of functionality.</span></span></span>
<ul>
<li><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">You 	can slice by up to 2 dimensions at once.</span></span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">Then 	you can keep slicing, however, by more dimensions, as many times as 	you like.</span></span></span></li>
</ul>
</li>
<li><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">While 	you can take actions straight from reports, some of the specific 	BI/app integration ideas we discussed are still futures. (E.g., 	analyzing spend at the time of expense report data entry or 	approval.)</span></span></span></li>
<li><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">Of 	course, Workday&#8217;s web services interface lets you export Workday 	data into 3rd-party tools. Indeed, if you want to integrate data 	from Workday and some other source(s), that&#8217;s your only choice.</span></span></span></li>
</ul>
</li>
<li><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">Workday 	offers a clever metaphor to illustrate that your data may be more 	secure offsite than on – the bank vault. (I have no idea whether 	that&#8217;s a SaaS industry standard, but I hadn&#8217;t heard it before.) Of 	course, that metaphor does beg some issues specific to the remote 	data case, such as:</span></span></span>
<ul>
<li><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">When 	your data is on premises, you know whether the government has 	insisted on looking at it.</span></span></span></li>
<li><span style="font-size: small;">More than cash, data keeps traveling back and forth to 	the remote location, which creates at least a theoretical risk of 	interception.</span></li>
</ul>
</li>
<li><span style="font-size: small;"><span style="font-style: normal;"><span style="font-weight: normal;">Workday 	says the toughest part of globalization is the issue of which 	personal data is or is not maintained. For example, in the US you&#8217;re 	not allowed to not ask a job applicant&#8217;s religion, but in the UK 	you&#8217;re not only permitted but indeed required to.</span></span></span></li>
</ul>
<p><em><strong>This post is part of a three-post series</strong></em></p>
<ul>
<li><a href="http://www.dbms2.com/2010/08/22/workday-inc-company-overview/" >Workday Inc. company overview</a> (brief)</li>
<li><a href="http://www.dbms2.com/2010/08/22/workday-technology-stack/" >Workday Inc. technology overview</a> (detailed)</li>
<li>Workday Inc. CTO Stan Swete&#8217;s <a href="http://www.dbms2.com/2010/08/22/workday-stan-swete-database-architecture/" >comments on database strategy</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/08/22/workday-technology-stack/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Cloudera Enterprise and Hadoop evolution</title>
		<link>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/</link>
		<comments>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/#comments</comments>
		<pubDate>Wed, 30 Jun 2010 17:22:27 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2440</guid>
		<description><![CDATA[I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I&#8217;d say:  

If you are or want to be a serious 	MapReduce user – and you&#8217;re past the “play around over the 	weekend” stage &#8212; you probably should have either:

A serious non-DBMS MapReduce 	distribution.
MapReduce integrated into your [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I&#8217;d say:  <span id="more-2440"></span></p>
<ul>
<li>If you are or want to be a serious 	MapReduce user – and you&#8217;re past the “play around over the 	weekend” stage &#8212; you probably should have either:
<ul>
<li>A serious non-DBMS MapReduce 	distribution.</li>
<li>MapReduce integrated into your 	analytic DBMS.</li>
<li>Both.</li>
</ul>
</li>
<li>The obvious choice for non-DBMS 	MapReduce is Hadoop.</li>
<li>The obvious choice for a Hadoop 	distribution is <strong>Cloudera Enterprise.</strong></li>
<li>Cloudera Enterprise has three main 	aspects, in an inseparable bundle:
<ul>
<li>Distributions for a double-digit 	number of open source projects. It&#8217;s nice having all that in one 	package – unless, of course, you like playing with Tinkertoys.</li>
<li>Proprietary Cloudera code.</li>
<li>Cloudera support.</li>
</ul>
</li>
<li>Cloudera says its proprietary code 	is and in the future is planned to be concentrated – at least in 	large part &#8212; on integrating open source technology with closed 	source products. This has the virtue of being targeted directly at 	that segment of the market which has proven it&#8217;s actually willing to 	pay money for software.</li>
<li>Cloudera Enterprise areas of 	focus, now and in the presumed future, include:
<ul>
<li><strong>Core Hadoop engine,</strong> which 	Cloudera says is quite predictably and appropriately evolving more 	slowly than the tools around it.</li>
</ul>
<ul>
<li><strong>Development, management and 	administrative tools,</strong> including:
<ul>
<li><strong>Pig</strong> and <strong>Hive</strong>. Cloudera says &gt;70% 	of Facebook Hadoop jobs are initiated through Hive, and the same is 	true of Yahoo and Pig.</li>
<li>Connectivity to commercial tools.</li>
<li>The product formerly known as 	“Cloudera Desktop.”</li>
</ul>
</li>
<li><strong>Workflow</strong>, which in this context 	refers to letting you create a Hadoop application as a sequence of 	small steps, rather than forcing you to kluge it into being one 	unwieldy thing. At the moment, this is much less widely adopted than 	Pig and Hive, but Cloudera has high hopes for it, because of its 	obvious benefits in modularity and manageability.</li>
<li><strong>Quasi-DBMS technology.</strong> Besides Hive and Pig, this includes <strong>HBase.</strong> Cloudera says there has 	been considerable demand for HBase, and it is pleased that project 	is now mature enough to ship. Cloudera stresses that it intends 	HBase not for OLTP, but as an adjunct to analytic processing. E.g., 	Cloudera suggests HBase would be a fine vehicle for replicating 	dimension tables across each node of a cluster.</li>
<li><strong>Data connectivity, </strong><span style="font-weight: normal;">e.g. 	to MySQL or to sensor log files.</span></li>
</ul>
</li>
<li>Cloudera Enterprise pricing is 	well below DBMS prices – not by a full order of magnitude, if I&#8217;m 	right about everybody&#8217;s quantity discount policies, but even so by a 	lot. Details are NDA.</li>
</ul>
<p style="margin-bottom: 0in;">Cloudera sometimes sends confusing signals about its beliefs and strategies. For example, one can get different stories depending on whether one talks to:</p>
<ul>
<li>Somebody at Cloudera who comes 	primarily from the user and open source communities.</li>
<li>Somebody at Cloudera who has 	actually worked at a software company before.</li>
</ul>
<p style="margin-bottom: 0in;">But I predict that Cloudera will now stick for a while with more or less the strategy outlined above.</p>
<p style="margin-bottom: 0in;">Naturally, we also talked about Hadoop adoption. Highlights of that part – no doubt somewhat biased towards Cloudera&#8217;s own customer base &#8212; included:</p>
<ul>
<li>Notwithstanding <a href="http://www.dbms2.com/2009/04/14/ebay-thinks-mpp-dbms-clobber-mapreduce/" >eBay&#8217;s prior 	skepticism about MapReduce</a>, it is quoted saying nice things in a Cloudera press release, 	and has apparently become quite a large Hadoop user, starting out 	with a search-quality use case.</li>
<li>Typical Hadoop deployment sizes 	are 10 nodes or so when experimenting, 80-500+ in production.</li>
<li>10 terabytes/node – I&#8217;m pretty 	sure Cloudera meant of user data &#8212; is not inconceivable, so a 	cost-conscious 500-node user could have 5 petabytes of data managed 	by Hadoop.</li>
<li>Cloudera has half a dozen 	customers at the 75+ node production level.</li>
<li>Web and financial services are the 	two vertical markets moving most aggressively into Hadoop 	production. The government is also in significant Hadoop production, 	but the details of that are classified.</li>
<li>Web uses for Hadoop include:
<ul>
<li>Clickstream – sessionization, 	etc. – that&#8217;s a super-mainstream use.</li>
<li>Search – analyzing search 	attempts in conjunction with structured data.</li>
<li>Machine learning (for ad serving, 	etc.).</li>
</ul>
</li>
<li>Financial services uses for Hadoop 	include:
<ul>
<li>Internal trading rule 	enforcement/fraud detection.</li>
<li>Complex ETL.</li>
<li>Portfolio risk assessment 	(typically overnight).</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">None of this is inconsistent with previous surveys of <a href="http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/" >Hadoop use cases</a>.</p>
<p style="margin-bottom: 0in; font-style: normal;">Various users talked at the Hadoop Summit this week. I wasn&#8217;t there, and won&#8217;t write about their stories for now. That said, <a href="http://www.slideshare.net/kevinweil/hadoop-at-twitter-hadoop-summit-2010" onclick="javascript:pageTracker._trackPageview('/www.slideshare.net');">Twitter&#8217;s slide deck</a> from same has some interesting stuff, including:</p>
<ul>
<li><span style="font-style: normal;">7 	TB/day ETLed from MySQL.</span></li>
<li><span style="font-style: normal;">Petabytes-being-stored 	accordingly coming soon.</span></li>
<li><span style="font-style: normal;">Open 	sourcing their ETL tool Crane.</span></li>
<li><span style="font-style: normal;">3-4X 	LZO compression at little CPU cost.</span></li>
<li><span style="font-style: normal;">HBase 	is a more usable for them than HDFS, which isn&#8217;t mutable enough.</span></li>
<li><span style="font-style: normal;">Pig 	= 5% of code and coding effort vs. vanilla Hadoop at 30% or less 	performance hit.</span></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/30/cloudera-enterprise-hadoop-evolution/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Netezza&#8217;s version of EnterpriseDB-based Oracle compatibility</title>
		<link>http://www.dbms2.com/2010/06/26/netezza-migrator/</link>
		<comments>http://www.dbms2.com/2010/06/26/netezza-migrator/#comments</comments>
		<pubDate>Sat, 26 Jun 2010 12:17:11 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Emulation, transparency, portability]]></category>
		<category><![CDATA[EnterpriseDB and Postgres Plus]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Oracle]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2397</guid>
		<description><![CDATA[EnterpriseDB has some deplorable business practices (my stories of being screwed by EnterpriseDB have been met by &#8220;Well, you&#8217;re hardly the only one&#8221;). But a couple of more successful DBMS vendors have happily partnered with EnterpriseDB even so, to help pick off Oracle users. IBM&#8217;s approach was in the vein of an EnterpriseDB-infused version of [...]]]></description>
			<content:encoded><![CDATA[<p>EnterpriseDB has some deplorable business practices (my stories of being screwed by EnterpriseDB have been met by &#8220;Well, you&#8217;re hardly the only one&#8221;). But a couple of more successful DBMS vendors have happily partnered with EnterpriseDB even so, to help pick off Oracle users. IBM&#8217;s approach was in the vein of an <a href="http://www.dbms2.com/2009/04/24/ibms-oracle-emulation-strategy-reconsidered/" >EnterpriseDB</a>-<a href="http://www.dbms2.com/2009/04/22/dbms-transparency-layers-never-seem-to-sell-well/" >infused</a> <a href="http://www.dbms2.com/2010/04/07/ibm-anti-oracle-announcements/" >version</a> of SQL handling within DB2.* Netezza just announced an EnterpriseDB-based Netezza Migrator that is rather different.</p>
<p><em>*The comment threads are the most informative parts of those posts.</em></p>
<p>I&#8217;m a little unclear as to the Netezza Migrator details, not least because Netezza folks don&#8217;t seem to care too much about Netezza Migrator themselves. That said, the core ideas of Netezza Migrator are:  <span id="more-2397"></span></p>
<ul>
<li>Netezza Migrator is an enhanced (?) version of EnterpriseDB&#8217;s Postgres Plus Advanced Server DBMS. (Recall that Postgres Plus is PostgreSQL-based and fairly <a href="http://www.dbms2.com/2008/07/07/enterprisedbf-oracle-compatibility/" >Oracle-compatible</a>.)</li>
<li>Netezza Migrator does not run on Netezza appliances, but rather on conventional computers off to the side.</li>
<li>Netezza Migrator generally farms out queries to Netezza appliances, but can also manage data itself. (That latter part could supposedly come in handy for small tables one might want to execute stored procedures against.)</li>
<li>Netezza Migrator does a better job of farming out queries (and also inserts/updates/loads) to Netezza appliances than an Oracle DBMS would. The two biggest examples of that are:
<ul>
<li>Oracle will farm out SELECTs, but not JOINs.</li>
<li>Oracle won&#8217;t invoke Netezza&#8217;s parallel/bulk load capabilities.</li>
</ul>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/26/netezza-migrator/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>Flash is coming, well &#8230;</title>
		<link>http://www.dbms2.com/2010/06/25/flash-is-coming-well/</link>
		<comments>http://www.dbms2.com/2010/06/25/flash-is-coming-well/#comments</comments>
		<pubDate>Fri, 25 Jun 2010 16:42:26 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2389</guid>
		<description><![CDATA[I really, really wanted to title this post &#8220;Flash is coming in a flash.&#8221; That seems a little exaggerated &#8212; but only a little.

Netezza now intends to come out with a Flash-based appliance earlier than it originally expected.
Indeed, Netezza has suspended &#8212; by which I mean &#8220;scrapped&#8221; &#8212; prior plans for a RAM-heavy disk-based appliance. [...]]]></description>
			<content:encoded><![CDATA[<p>I really, really wanted to title this post &#8220;Flash is coming in a flash.&#8221; That seems a little exaggerated &#8212; but only a little.</p>
<ul>
<li>Netezza now intends to come out with a Flash-based appliance earlier than it originally expected.</li>
<li>Indeed, Netezza has suspended &#8212; by which I mean &#8220;scrapped&#8221; &#8212; prior plans for a RAM-heavy disk-based appliance. It will use a RAM/Flash combo instead.*</li>
<li>Tim Vincent of IBM told me that customers seem ready to adopt solid-state memory. One interesting comment he made is that Flash isn&#8217;t really all that much more expensive than high-end storage area networks.</li>
</ul>
<p>Uptake of solid-state memory (i.e. Flash) for analytic database processing will probably stay pretty low in 2010, but in 2011 it should be a notable (b)leading-edge technology, and it should get mainstreamed pretty quickly after that.  <span id="more-2389"></span></p>
<p><em>*So far as I can tell, that&#8217;s one of the two significant roadmap changes between the 2009 and 2010 editions of <a href="http://www.dbms2.com/2010/06/23/my-talk-this-morning/" >Enzee Universe</a>. The other one is that </em><em>the robust form of</em><em> appliance-to-appliance replication technology is coming out later than Netezza had originally planned and hoped.</em></p>
<p>There also is increasing reason to think that the issues with Flash memory wearing out are overwrought.  And by the way, the entire history of enterprise solid-state memory use is basically shorter than the time in which these products supposedly will wear out, so it&#8217;s not as if there have been a lot of real-life failures out there.)</p>
<ul>
<li>First, clever things are being done in the area of error correction codes, although for the most part I defer that part of the discussion to Petascan&#8217;s Camuel Gilyadov. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  E.g., this seems to be the idea behind Anobit.</li>
<li>Second, analytic DBMS are pretty much an ideal use case for Flash reliability. Suppose, as is the case for many products and implementations, you only write things in big blocks. Then you are, ipso facto, resetting the Flash bits only in big blocks. Thus, at least in theory, you automatically have pretty perfect wear leveling.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/06/25/flash-is-coming-well/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>VoltDB finally launches</title>
		<link>http://www.dbms2.com/2010/05/25/voltdb-finally-launches/</link>
		<comments>http://www.dbms2.com/2010/05/25/voltdb-finally-launches/#comments</comments>
		<pubDate>Tue, 25 May 2010 07:15:04 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Michael Stonebraker]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[VoltDB and H-Store]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2201</guid>
		<description><![CDATA[VoltDB is finally launching today. As is common for companies in sectors I write about, VoltDB &#8212; or just &#8220;Volt&#8221; &#8212; has discovered the virtues of embargoes that end 12:01 am. Let&#8217;s go straight to the technical highlights:

VoltDB is based on the H-Store technology, which I wrote about in February, 2009. Most of what I [...]]]></description>
			<content:encoded><![CDATA[<p>VoltDB is finally launching today. As is common for companies in sectors I write about, VoltDB &#8212; or just &#8220;Volt&#8221; &#8212; has discovered the virtues of embargoes that end 12:01 am. Let&#8217;s go straight to the technical highlights:</p>
<ul>
<li>VoltDB is based on the <a href="http://www.dbms2.com/2008/02/19/h-store-architecture/" >H-Store</a> technology, which I wrote about in February, 2009. Most of what I said about H-Store then applies to VoltDB today.</li>
<li>VoltDB is a no-apologies ACID relational DBMS, which runs entirely in RAM.</li>
<li>VoltDB has rather limited SQL. (One example: VoltDB can&#8217;t do SUMs in SQL.) However, VoltDB guy Tim Callaghan (Mark Callaghan&#8217;s lesser-known but nonetheless smart brother) asserts that if you code up the missing functionality, it&#8217;s almost as fast as if it were present in the DBMS to begin with, because there&#8217;s no added I/O from the handoff between the DBMS and the procedural code. (The data&#8217;s in RAM one way or the other.)</li>
<li>VoltDB&#8217;s Big Conceptual Performance Story is that it does away with most locks, latches, logs, etc., and also most context switching.</li>
<li>In particular, you&#8217;re supposed to partition your data and architect your application so that most transactions execute on a single core. When you can do that, you get VoltDB&#8217;s performance benefits. To the extent you can&#8217;t, you&#8217;re in two-phase-commit performance land. (More precisely, you&#8217;re doing 2PC for multi-core writes, which is surely a major reason that multi-core reads are a lot faster in VoltDB than multi-core writes.)</li>
<li>VoltDB has a little less than one DBMS thread per core. When the data partitioning works as it should, you execute a complete transaction in that single thread. Poof. No context switching.</li>
<li>A transaction in VoltDB is a Java stored procedure. (The early idea of Ruby on Rails in lieu of the Java/SQL combo didn&#8217;t hold up performance-wise.)</li>
<li>Solid-state memory is not a viable alternative to RAM for VoltDB. Too slow.</li>
<li>Instead, VoltDB lets you snapshot data to disk at tunable intervals. &#8220;Continuous&#8221; is one of the options, wherein a new snapshot starts being made as soon as the last one completes.</li>
<li>In addition, VoltDB will also spool a kind of transaction log to the target of your choice. (Obvious choice: An analytic DBMS such as Vertica, but there&#8217;s no such connectivity partnership actually in place at this time.)</li>
</ul>
<p><span id="more-2201"></span>I should also note that when Tim Callaghan described architectural options to get around 2PC performance issues, they sounded a lot like eventual consistency. Maybe tunable <a href="http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/" >RYW consistency</a> isn&#8217;t in the cards, but at least there&#8217;s a NoSQL-like possibility with VoltDB.</p>
<p>VoltDB&#8217;s open source strategy is:</p>
<ul>
<li>VoltDB will be open sourced.</li>
<li>Community VoltDB will be GPLed. Professional Edition VoltDB has a non-GPL license.</li>
<li>The VoltDB Professional Edition won&#8217;t start out with features beyond the Community Edition ones, but will gain such later on. I didn&#8217;t get the sense the plans for those features were completely baked yet, but ideas mentioned included:
<ul>
<li>Management/monitoring tools.</li>
<li>Integration with expense closed-source enterprise software products, such as ones in the management/monitoring area.</li>
<li>Yet more &#8220;extreme&#8221;/edge-case performance.</li>
</ul>
</li>
<li>Before VoltDB decided for sure that it wasn&#8217;t selling licenses, it sold a license to Getco, which also seems to be an investor in the company.</li>
</ul>
<p>VoltDB had a beta test with about 150 participants. None is in production yet, although at least a few are clearly headed there. Most VoltDB beta testers are in some kind of online business, with a particular concentration in everybody&#8217;s new favorite market, online gaming. Most of the rest are in investment/trading &#8212; a major target market for at least three different Mike Stonebraker companies &#8212; and a few are in telecom. VoltDB assures me that some of the beta users are companies one actually has heard of before, but VoltDB is not in a position to name any of those.</p>
<p>VoltDB is not ideally suited for a classic order management system, since you&#8217;d want to partition both on CustomerID and SKU, the latter because you&#8217;d constantly updating inventory stock levels. However, this argument doesn&#8217;t apply in the case of virtual goods. Virtual goods that are sold for real money &#8212; and hence need ACID levels of transaction integrity &#8212; are thus a clear target market for VoltDB. (The example that came up was in, you guessed it, online gaming.) The other interesting use case that Tim highlighted was low-latency analytics/ELT. For reasons I didn&#8217;t totally grasp, Tim likes to call this &#8220;Stateful ELT.&#8221; (Given that the data goes into the VoltDB database before much else happens to it, I&#8217;m pretty sure I heard &#8220;ELT&#8221; correctly. But I guess I might have been mishearing &#8220;ETL&#8221;.)</p>
<p>VoltDB company highlights include:</p>
<ul>
<li>VoltDB has about a dozen employees, all but two of whom are technical. (However, I&#8217;m not sure they&#8217;re counting Andy Ellicott against the two. But then, last I heard he wasn&#8217;t full time at VoltDB.)</li>
<li>VoltDB&#8217;s venture funding status is, if I may paraphrase, &#8220;Mumble mumble.&#8221;</li>
<li>Although long separate from Vertica, VoltDB is still located in Vertica&#8217;s offices.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/25/voltdb-finally-launches/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>8 not very technical problems with analytic technology</title>
		<link>http://www.dbms2.com/2010/05/08/8-not-very-technical-problems-with-analytic-technology/</link>
		<comments>http://www.dbms2.com/2010/05/08/8-not-very-technical-problems-with-analytic-technology/#comments</comments>
		<pubDate>Sat, 08 May 2010 12:30:35 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Liberty and privacy]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2084</guid>
		<description><![CDATA[In a couple of talks, including last Thursday&#8217;s, I&#8217;ve rattled off a list of eight serious problems with analytic technology, all of them human or organizational much more than purely technical. At best, these problems stand in the way of analytic success, and at least one is a lot worse than that.
The bulleted list in [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">In a couple of talks, including <a href="http://www.dbms2.com/2010/05/07/implications-onew-analytic-technology/" >last Thursday&#8217;s</a>, I&#8217;ve rattled off a list of eight serious problems with analytic technology, all of them human or organizational much more than purely technical. At best, these problems stand in the way of analytic success, and at least one is a lot worse than that.</p>
<p style="margin-bottom: 0in;">The bulleted list in my notes is:</p>
<ul>
<li>
<p style="margin-bottom: 0in;">Individual-human</p>
<ul>
<li>
<p style="margin-bottom: 0in;">Expense of expertise</p>
</li>
<li>
<p style="margin-bottom: 0in;">Limited numeracy</p>
</li>
</ul>
</li>
<li>
<p style="margin-bottom: 0in;">Organizational</p>
<ul>
<li>
<p style="margin-bottom: 0in;">Limited budgets</p>
</li>
<li>
<p style="margin-bottom: 0in;">Legacy systems</p>
</li>
<li>
<p style="margin-bottom: 0in;">General inertia</p>
</li>
</ul>
</li>
<li>
<p style="margin-bottom: 0in;">Political</p>
<ul>
<li>
<p style="margin-bottom: 0in;">Obsolete systems</p>
</li>
<li>
<p style="margin-bottom: 0in;">Clueless lawmakers</p>
</li>
<li>
<p style="margin-bottom: 0in;">Obsolete legal framework</p>
</li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">I shall explain.<span id="more-2084"></span></p>
<p style="margin-bottom: 0in;"><strong>The expense of expertise.</strong> <a href="http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/" >Highly skilled Oracle DBAs are expensive</a>. The same can be said for many other categories of people, whether in IT or business units, needed to exploit the opportunities of analytic technology. Newer, simpler approaches to analytic database management can clearly help. The verdict is more mixed so far on newer business intelligence or data mining technologies.</p>
<p style="margin-bottom: 0in;"><strong>Limited numeracy. </strong>If you&#8217;re reading this, you&#8217;re very likely more numerate (i.e., more capable with numbers, mathematics, and analysis) than the average person, or indeed than the average knowledge worker. A typical knowledge worker can understand an analytic claim, on some level – but can he think critically about it? Can he make valid analytic arguments of his own? That&#8217;s less clear. Maybe it&#8217;s possible to build a new company in which analytic competence is a prerequisite for employment (Google comes to mind as one candidate.) But uniform analytic ability is almost inconceivable at an established enterprise. That&#8217;s one of the biggest reasons <a href="http://www.dbms2.com/2009/01/10/some-reasons-business-intelligence-is-in-a-funk/" >converting an ongoing concern into a top-to-bottom “analytic enterprise” is a lot harder</a> than business intelligence gurus sometimes suggest.</p>
<p style="margin-bottom: 0in;">Actually, I do think practical numeracy goes up each decade. Go, for example, to a message board discussing sports, and the chances are high you&#8217;ll pretty soon see concepts like “small sample size” used perfectly correctly. That would have been much less likely 30 years ago, even aside from the fact that in those days no such things as “message boards” happened to exist. I credit this favorable trend to a multitude of factors, from the available of BI and related technologies, to the promotion of probability and statistics at multiple levels of at least the United States educational curriculum. If nothing else – when I was in school, it was very rare for a girl to be serious about learning calculus or any kind of advanced mathematics.* Thankfully, that seems to have changed.</p>
<p style="margin-bottom: 0in;"><em>*I do recall my main college girlfriend cursing her way through a physical chemistry course, which implies some working knowledge of partial differential equations. But she was pretty unusual for her day.</em></p>
<p style="margin-bottom: 0in;">Even so, that&#8217;s not enough. Get me in a discussion about politics or charity, and I&#8217;ll argue that few things are more important than boosting numeracy through the education system. But that&#8217;s hardly an answer in a business-today time frame. If your enterprise is trying to deploy analytics universally across the organization today – well, that&#8217;s a very hard challenge.</p>
<p style="margin-bottom: 0in;"><strong>Limited budgets.</strong> For every benefit there is a cost. And when it comes to analytics, the costs (at least some of them) can be a lot easier to quantify than the benefits. This can make it hard to get the budget to do proper analytics.</p>
<p style="margin-bottom: 0in;"><strong>Legacy systems.</strong> In many cases, the best and most cost-effective analytic products are fairly new ones. But companies tend to have the older and costlier ones already in place. Yes, I&#8217;m seeing a reasonable number of “escape from Oracle” or even “escape from DB2” projects. But a lot more enterprises just try to work within what they have.</p>
<p style="margin-bottom: 0in;">And that&#8217;s even before we start to talk about stovepipes, silos, or data integration. It&#8217;s also before we consider the difficulty of modernizing OLTP systems to either gather more data for analysis or incorporate more analytics into operational business processes.</p>
<p style="margin-bottom: 0in;"><strong>General inertia.</strong> Budget limitations and legacy limitations can both be excuses for doing nothing. But something nothing gets done even without the aid of such excuses.</p>
<p style="margin-bottom: 0in;"><strong>Obsolete systems.</strong> In the particular case of government, the legacy systems problem can be ridiculously bad. In large part, this is due to <a href="http://www.networkworld.com/community/node/34946" onclick="javascript:pageTracker._trackPageview('/www.networkworld.com');">broken procurement processes</a>. Computer systems are bundled up into specific government contracts, which are then bid and awarded in a long process, whose results can be challenged, making the process even longer. When the contract is finally awarded, a long implementation cycle begins. Systems are years out of date before they are put in. By the time they are replaced, they are a lot more ancient than that. And, as big as they are, they are isolated projects, with little cross-government standardization, and with little effort put into facilitating data integration across systems.</p>
<p style="margin-bottom: 0in;">The consequences are horrific. Better collaboration tools might have averted the 9/11 attacks. Weak analytics allow all kinds of fraud to continue. And considering just the costs of data integration projects, untold billions of dollars are directly spent due to government computer failings.</p>
<p style="margin-bottom: 0in;"><strong>Obsolete legal framework.</strong> The procurement process is just one major example of obsolete technology-related laws. The <a href="http://www.dbms2.com/2010/04/04/privacy-liberty-continued/" >liberty and privacy</a> problem is even more profound. Intellectual property rights and censorship,* of course, are other major areas of nonsense, even in countries that are basically pretty free. Part of the problem is that legal structures for a less advanced world are sometimes ill-suited for the power of the modern technology.</p>
<p style="margin-bottom: 0in;"><em>*Of course, those areas are not particularly tied to analytics. But I thought I&#8217;d throw them in anyway while I was on a roll.</em></p>
<p style="margin-bottom: 0in;"><strong>Technology-challenged lawmakers.</strong> These issues are hard, and even technologically savvy lawmakers would struggle with them. At least, I think so; to my knowledge, that hypothesis has never come close to being tested. What I do know is that when laws meet technology, nonsense commonly ensues.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/08/8-not-very-technical-problems-with-analytic-technology/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>IBM puts Cast Iron Systems out of its misery</title>
		<link>http://www.dbms2.com/2010/05/03/ibm-puts-cast-iron-systems-out-of-its-misery/</link>
		<comments>http://www.dbms2.com/2010/05/03/ibm-puts-cast-iron-systems-out-of-its-misery/#comments</comments>
		<pubDate>Mon, 03 May 2010 16:02:28 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cast Iron Systems]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[IBM and DB2]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=2024</guid>
		<description><![CDATA[Long ago, the first enterprise application integration (EAI) vendors offered pairwise integrations between different specific packaged applications. That was, for example what was going on at Katrina Garnett&#8217;s Crossworlds/Crossroads, which eventually became one of IBM&#8217;s first data integration software acquisitions. Years later, Cast Iron Systems tried what seemed to be pretty much the same thing, [...]]]></description>
			<content:encoded><![CDATA[<p>Long ago, the first enterprise application integration (EAI) vendors offered pairwise integrations between different specific packaged applications. That was, for example what was going on at Katrina Garnett&#8217;s Crossworlds/Crossroads, which eventually became one of IBM&#8217;s first data integration software acquisitions. Years later, Cast Iron Systems tried what seemed to be pretty much the same thing, only <a href="http://www.dbms2.com/2007/04/26/more-on-cast-iron-systems/" >better implemented</a>. Recently, however, Cast Iron has been pretty hard to get a hold of, and I also couldn&#8217;t find anybody (competitor, friend of management, whatever) who believed Cast Iron was doing particularly well. So today&#8217;s news that <strong>IBM is acquiring Cast Iron Systems</strong> comes as no big surprise.</p>
<p><span id="more-2024"></span>Cast Iron sold an integration appliance, most focused on <a href="http://www.dbms2.com/2008/03/21/cast-iron-systems-focuses-on-saas-data-integration/" >integrations that involved SaaS applications such as Salesforce</a>, with an option for doing all this purely in the <a href="http://www.dbms2.com/2008/10/09/cloud-data-integration/" >cloud</a>. IBM is accordingly spinning Cast Iron as a major cloud player, which is something of an exaggeration.</p>
<p>IBM will surely get value from whatever specific connectors Cast Iron does a better job at than IBM&#8217;s current offerings do. What I&#8217;m more curious about is whether Cast Iron&#8217;s core technology will survive in a form that continues it&#8217;s core message of &#8220;simplicity, simplicity, simplicity.&#8221;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/05/03/ibm-puts-cast-iron-systems-out-of-its-misery/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ITA Software and Needlebase</title>
		<link>http://www.dbms2.com/2010/04/21/ita-software-needlebase-google/</link>
		<comments>http://www.dbms2.com/2010/04/21/ita-software-needlebase-google/#comments</comments>
		<pubDate>Wed, 21 Apr 2010 16:54:56 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Oracle]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1949</guid>
		<description><![CDATA[Rumors are flying that Google may acquire ITA Software. I know nothing of their validity, but I have known about ITA Software for a while. Random notes include:

ITA Software builds huge OLTP systems that it runs itself on behalf of airlines.
Very, very unusually, ITA Software builds these huge OLTP systems in LISP.
ITA Software is an [...]]]></description>
			<content:encoded><![CDATA[<p>Rumors are flying that <a href="http://www.bloomberg.com/apps/news?pid=newsarchive&amp;sid=aJXdCOdgJmw4" onclick="javascript:pageTracker._trackPageview('/www.bloomberg.com');">Google may acquire ITA Software</a>. I know nothing of their validity, but I have known about ITA Software for a while. Random notes include:</p>
<ul>
<li>ITA Software builds huge OLTP systems that it runs itself on behalf of airlines.</li>
<li>Very, very unusually, ITA Software builds these <a href="http://www.networkworld.com/community/node/29552" onclick="javascript:pageTracker._trackPageview('/www.networkworld.com');">huge OLTP systems in LISP</a>.</li>
<li><a href="http://www.dbms2.com/2008/01/24/mysql-database/" >ITA Software is an Oracle shop</a> (see Dan Weinreb&#8217;s comment).</li>
<li><a href="http://www.dbms2.com/2008/01/31/ellen-rubin-is-leaving-netezza/" >ITA Software is run by a techie</a> (again, see Dan Weinreb&#8217;s comment).</li>
<li>ITA Software has an interesting screen-scraping/web ETL project called Needlebase</li>
</ul>
<p>ITA&#8217;s software does both price/reservation lookup/checking and reservation-making. I&#8217;ve had trouble keeping it straight, but I think the lookup is ITA&#8217;s actual business, and the reservation-making is ITA&#8217;s Next Big Thing. This is one of the ultimate federated-transaction-processing applications, because it involves coordinating huge OLTP systems run, in some cases, by companies that are bitter competitors with each other. Network latencies have to allow for intercontinental travel of the data itself.</p>
<p><em>Indeed, airline reservation systems are pretty much the OLTP ultimate in themselves. As the story goes, transaction monitors were pretty much invented for airline reservation systems in the 1960s.</em></p>
<p>A really small project for ITA Software is Needlebase. I stopped by ITA to look at Needlebase in January, and what it is is a very smart and hence interesting screen-scraping system. The idea is people publish database information to the web, and you may want to look at their web pages and recover the database records it is based on. Applications of this to the airline industry, which has 100s of 1000s of price changes per day &#8212; and I may be too low by one or two orders of magnitude when I say that &#8212; should be fairly obvious. ITA Software has aspirations of applying Needlebase to other sectors as well, or more precisely having users who do so. Last I looked, ITA hadn&#8217;t put significant resources behind stimulating Needlebase adoption &#8212; but Google might well change that.</p>
<p><em>Edit: I just re-found <a href="http://danweinreb.org/blog/the-failure-of-lisp-a-reply-to-brandon-werner" onclick="javascript:pageTracker._trackPageview('/danweinreb.org');">an old characterization of (some of) what ITA Software does</a> by &#8212; who else? &#8212; Dan Weinreb:</em></p>
<blockquote><p>I am working on our new product, an airline reservation system.  It’s an online transaction-processing system that must be up 99.99% of the time, maintaining maximum response time (e.g. on www.aircanada.com).  It’s a very, very complicated system.  The presentation layer is written in Java using conventional techniques.  The business rule layer is written in Common Lisp; about 500,000 lines of code (plus another 100,000 or so of open source libraries).  The database layer is Oracle RAC.  We operate our own data centers, some here in Massachusetts and a disaster-recovery site in Canada (separate power grid).</p></blockquote>
<p><em><strong>Related links</strong></em></p>
<ul>
<li><a href="http://www.itasoftware.com" onclick="javascript:pageTracker._trackPageview('/www.itasoftware.com');">ITA Software</a> and <a href="http://www.needlebase.com" onclick="javascript:pageTracker._trackPageview('/www.needlebase.com');">Needlebase</a> websites</li>
<li><a href="http://www.dbms2.com/2008/03/07/lisp-humor/" >More about LISP</a> <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/21/ita-software-needlebase-google/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Introduction to Datameer</title>
		<link>http://www.dbms2.com/2010/04/16/introduction-to-datameer/</link>
		<comments>http://www.dbms2.com/2010/04/16/introduction-to-datameer/#comments</comments>
		<pubDate>Sat, 17 Apr 2010 03:50:43 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Datameer]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1909</guid>
		<description><![CDATA[Elder care issues have flared up with a vengeance, so I&#8217;m not going to be blogging much for a while, and surely not at any length. That said, my first post about Datameer was never going to be very long, so lets get right to it:

Datameer offers a business 	intelligence and analytics stack that runs [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Elder care issues have flared up with a vengeance, so I&#8217;m not going to be blogging much for a while, and surely not at any length. That said, my first post about Datameer was never going to be very long, so lets get right to it:</p>
<ul>
<li>Datameer offers a business 	intelligence and analytics stack that runs on any distribution of 	Hadoop.</li>
<li>Datameer is still building a lot 	of features that it talks about, for target release in (I think) the 	fall.</li>
<li>Datameer&#8217;s pride and joy is its 	user interface. Very laudably for a software start-up, Datameer 	claims to have spent considerable time with professional user 	interface designers.</li>
<li>Datameer&#8217;s core user interface 	metaphor is formula definition via a spreadsheet.</li>
<li>Datameer includes 124 functions one can use in these formulae, ranging from math stuff to text tokenization.</li>
<li>Datameer does some straight BI, 	with 4 kinds of “visualization” headed for 20 kinds later. But 	if you want to do hard-core BI, use Datameer to dump data into an 	RDBMS and then use the BI tool of your choice. (Datameer&#8217;s messaging does 	tend to obscure or even contradict that point.)</li>
<li>Rather, Datameer seems to be 	designed for <span style="font-style: normal;">the classic </span><a href="http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/" >MapReduce 	use cases</a> of ETL and heavy data crunching.</li>
<li>Datameer&#8217;s messaging includes a 	bit about “Datameer is real-time, even though <a href="http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/" >Hadoop is generally 	thought of as batch</a>.” So far as I can tell, what that boils down 	to is …</li>
<li>… Datameer will let you examine 	sample and/or partial query results before a full Hadoop run is 	over. Apparently, there are three different ways Datameer lets you 	do this:
<ul>
<li>You can truly query against a 	sample of the data set.</li>
<li>You can query against intermediate 	results, when only some stages of the Hadoop process have already 	been run.</li>
<li>You can drill down into a 	“distributed index,” whatever the heck that means when Datameer says it.</li>
</ul>
</li>
<li>Datameer will let you import data 	from 15 or so different kinds of sources, SQL, NoSQL, and file 	system alike.</li>
</ul>
<p style="margin-bottom: 0in;">
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/16/introduction-to-datameer/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Greenplum Chorus and Greenplum 4.0</title>
		<link>http://www.dbms2.com/2010/04/12/greenplumchorus/</link>
		<comments>http://www.dbms2.com/2010/04/12/greenplumchorus/#comments</comments>
		<pubDate>Mon, 12 Apr 2010 11:54:39 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Market share]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1887</guid>
		<description><![CDATA[Greenplum is making two product announcements this morning. Greenplum 4.0 is a revision of the core Greenplum database technology. In addition, Greenplum is announcing Greenplum Chorus, which is the first product release instantiating last year&#8217;s EDC (Enterprise Data Cloud) vision statement and marketing campaign.
Greenplum 4.0 highlights and related observations include:

For the most part, Greenplum 	4.0 [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">Greenplum is making two product announcements this morning. Greenplum 4.0 is a revision of the core Greenplum database technology. In addition, Greenplum is announcing Greenplum Chorus, which is the first product release instantiating last year&#8217;s <a href="http://www.dbms2.com/2009/06/08/the-future-of-data-marts/" >EDC (Enterprise Data Cloud) vision statement and marketing campaign</a>.</p>
<p style="margin-bottom: 0in;">Greenplum 4.0 highlights and related observations include:<span id="more-1887"></span></p>
<ul>
<li>For the most part, <strong>Greenplum 	4.0 is focused on general robustness catch-up and </strong><a href="http://www.dbms2.com/2009/08/21/bottleneck-whack-a-mole/" ><strong>Bottleneck Whack-A-Mole</strong></a><strong>,</strong><span style="font-weight: normal;"> much 	like the latest rel</span>eases from fellow analytic DBMS vendors 	<a href="http://www.dbms2.com/2010/02/22/data-warehouse-dbms-news-roundup/" >Vertica and Aster Data</a>.</li>
<li>Greenplum has switched its 	replication approach from logical (execute transactions against two 	different disks) to block-level (just ship over the blocks that were 	changed by the original transaction). This seems to increase a 	Greenplum database&#8217;s robustness/performance/uptime in the face of 	disk/node failure. It also provides Greenplum with an ongoing 	performance advantage in that data only has to be compressed once in 	total for both disk writes.</li>
<li>The Greenplum DBMS now has 	something called “tablespaces,” which sounds as if it extends 	<a href="http://www.dbms2.com/2009/10/14/greenplum-hybrid-columnar/" >Greenplum&#8217;s “polymorphic storage”</a> to accommodate different kinds 	of storage device. Everybody has to do and for the most part is 	doing this, e.g. <a href="http://www.dbms2.com/2008/10/14/teradata-virtual-storage/" >Teradata</a><span style="font-style: normal;"> and </span><a href="http://www.dbms2.com/2009/08/25/sybase-iq-technical-highlights/" >Sybase</a>. At least for now, you need to have the 	same mix of storage technology at every Greenplum node. That said, 	while Greenplum&#8217;s customers will surely want solid-state storage in 	the future, that&#8217;s not quite yet a major current issue.</li>
<li>The timetable on Greenplum 4.0 is 	a salami-thin-slicer&#8217;s delight:
<ul>
<li>Greenplum 4.0 has been used in 	POCs (Proofs of Concept) for a while.</li>
<li>Greenplum 4.0 has been in early 	access for a few weeks.</li>
<li>Greenplum 4.0 controlled 	availability is planned for the end of April.</li>
<li>Greenplum 4.0 general availability 	is planned around the end of May or early June.</li>
<li>(Note: Everything in Greenplum 4.0 	has been built, and is undergoing QA).</li>
</ul>
</li>
<li>Greenplum has put together a nice 	list of big-name customers, including <a href="http://www.dbms2.com/2009/03/05/fox-interactive-medias-multi-hundred-terabyte-database-running-on-greenplum/" >Fox/MySpace</a>, <a href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/" >eBay</a>, Sears, and T-Mobile. While Fox/MySpace never got to the <a href="http://www.dbms2.com/2008/08/25/greenplum-is-in-the-big-leagues/" >predicted</a> 1-petabyte level of user data, T-Mobile is loosely projected to 	indeed get there. The same 1-petabyte projection is made more 	confidently about another Greenplum telecom customer (unnamed), 	which seems to be in the process of acquiring a 300-node Greenplum 	system.</li>
</ul>
<p style="margin-bottom: 0in;">The really interesting part of this announcement, however, is Greenplum Chorus. Greenplum agrees with my assertion that <strong>Greenplum Chorus is a new kind of data integration/ETL technology.</strong> In particular, Greenplum Chorus is designed around a stance I agree with, namely <a href="http://www.dbms2.com/2010/04/12/enterprise-data-warehouse-edw-myt/" >it&#8217;s unrealistic to put everything into a single enterprise data warehouse (EDW)</a>; you need to manage data marts as well, preferably in a coordinated way. Mainstream data integration/ETL (Extract/Integration/Load) vendors such as Informatica<span style="font-style: normal;"> or </span><a href="http://www.dbms2.com/category/products-and-vendors/ab-initio-software-corporation/" >Ab Initio</a><span style="font-style: normal;"> would surely say “That&#8217;s often quite true, and our technology can handle such scenarios just as it handles single-EDW-data-sink environments.” But Greenplum Chorus offers three capabilities not generally found in traditional data integration products (and offers only those three capabilities), namely:</span></p>
<ul>
<li><span style="font-style: normal;">Spin 	out data marts, whether by recopying the data or by creating a 	virtual data mart inside another data warehouse/mart.</span></li>
<li><span style="font-style: normal;">Find/discover 	data in databases across your enterprise.</span></li>
<li><span style="font-style: normal;">Do 	social networking around databases/data marts.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">Greenplum Chorus is heading into early access soon, with general availability slated around midyear. Also in the mix is a Greenplum “Hypervisor” that can somehow relate to an almost unlimited number of nodes or databases; however, I didn&#8217;t get a lot of details on the Greenplum Hypervisor technology or on the target dates for delivering and integrating the Hypervisor with other parts of Greenplum&#8217;s technology.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">When Greenplum first talked about about the enterprise data cloud (EDC) idea, it emphasized <a href="../2009/06/08/the-future-of-data-marts/">the spinning out of physical data marts in an easy way</a></span>, as opposed to the virtual d<span style="font-style: normal;">ata marts pushed by <a href="../2009/10/27/teradatas-nebulous-cloud-strategy/">Oliver Ratzesberger and Teradata</a>. Greenplum Chorus, however, supports both kinds (as, at least directionally, does Teradata), specifically letting you choose between:</span></p>
<ul>
<li>“<span style="font-style: normal;">Independent 	sandboxes” – physical copies of the data, in a separate 	Greenplum database instance.</span></li>
<li>“<span style="font-style: normal;">Satellite 	sandboxes” – virtual data marts, of course managed by the same 	Greenplum database instance.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">Actually, if you want to recopy data in the same Greenplum database instance, you can do that too, via something called “data sets,” but that&#8217;s not the main focus. Either option, I presume, can be configured to provide either or both of the two main benefits of spun-out data marts, namely:</span></p>
<ul>
<li><span style="font-style: normal;">Control 	over the performance and SLAs (Service-Level Agreements) of your 	analytic workload</span></li>
<li><span style="font-style: normal;">Ability 	to mix in new raw data and/or new aggregations</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">in either case without messing up the performance, SLAs, security, or “one truth-ness” of the existing database.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">To provide those capabilities in an analytic DBMS, you need sufficiently robust parallel data movement (for the physical sandboxes) and workload management (for the virtual ones). Greenplum obviously believes it has both. Teradata makes the same claim. Other vendors would make similar assertions, and presumably will offer similar capabilities soon. You also want some kind of ability to ingest data from foreign databases, but that can be pretty routine stuff; e.g., in Release 1 of Chorus, Greenplum is content to offer ODBC access to Oracle, SQL Server, et al.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">The “data discovery” and “social networking” aspects of Greenplum Chorus seem to be quite Release 1 as well. Basically, Greenplum lets people post discussion threads about databases and data marts, discussing what value can be derived from them. I guess somebody could include links to web-technology reports based on those databases, but otherwise there&#8217;s no integration with business intelligence tools and their collaboration capabilities. Even so, Greenplum reports that business executives liked this capability in early access testing.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">Greenplum Chorus is ETL without a lot of T, and without a lot of performance optimizations either. That may not be much of a problem in its paradigmatic use case, spinning out a data mart quickly for some analysis to see if valuable conclusions can be drawn. Presumably, in the most successful cases, business and technical processes would emerge after the fact to pipe up-to-date versions of that analysis into operational systems, mooting any ETL deficiencies in the initial exploration moot. In a world where “data exploration” is becoming an increasingly important concept, something like Greenplum Chorus may suffice to provide significant customer value. But whether Greenplum Chorus&#8217;s capabilities are eventually co-opted by more fully-featured data integration suites remains an open question for the future.</span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/04/12/greenplumchorus/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
