<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Cloud computing</title>
	<atom:link href="http://www.dbms2.com/category/software-as-a-service-database-saas/cloud-computing/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Analytic trends in 2012: Q&amp;A</title>
		<link>http://www.dbms2.com/2011/11/21/analytic-trends-in-2012-qa/</link>
		<comments>http://www.dbms2.com/2011/11/21/analytic-trends-in-2012-qa/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 11:00:23 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EMC]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[HP and Neoview]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[SAP AG]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Tableau Software]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5692</guid>
		<description><![CDATA[As a new year approaches, it&#8217;s the season for lists, forecasts and general look-ahead. Press interviews of that nature have already begun. And so I&#8217;m working on a trilogy of related posts, all based on an inquiry about hot analytic trends for 2012. This post is a moderately edited form of an actual interview. Two [...]]]></description>
			<content:encoded><![CDATA[<p>As a new year approaches, it&#8217;s the season for lists, forecasts and general look-ahead. Press interviews of that nature have already begun. And so I&#8217;m working on a trilogy of related posts, all based on an inquiry about hot analytic trends for 2012.</p>
<p>This post is a moderately edited form of an actual interview. Two other posts cover analytic trends to watch (planned) and <a href="http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/">analytic vendor execution challenges to watch</a> (already up).</p>
<p><span id="more-5692"></span><strong>Question</strong>: What do you think will happen next year with the Tableaus of the world?</p>
<p><strong>Answer:</strong></p>
<ul>
<li>I think adoption of flexible-visualization business intelligence tools will continue to be rapid.</li>
<li>I think enterprise-friendly features will be increasingly important as a basis of competition.</li>
</ul>
<p><strong>Question</strong>: What do you mean by &#8220;enterprise-friendly&#8221;?</p>
<p><strong>Answer</strong>: An example would be <a href="http://www.dbms2.com/2011/11/16/qlikview-collaborative-business-intelligence/">QlikTech no longer forcing you to use their native ETL</a>, but rather working with Informatica and soon other third-party products. Also important can be:</p>
<ul>
<li>Database size.</li>
<li>Concurrency.</li>
<li>A full-featured development cycle for analytic applications.</li>
</ul>
<p><strong>Question</strong>: What does HP have to do to be relevant in analytics/data warehousing?</p>
<p><strong>Answer</strong>: Avoid stupidity. HP Vertica is already relevant.</p>
<p><strong>Question</strong>: OK. But what can HP do to build on Vertica?</p>
<p><strong>Answer</strong>: HP &#8212; which botched Exadata 1 hardware &#8212; could do a good job with SAP HANA or other kinds of appliance products.</p>
<p>However:</p>
<ul>
<li>I don&#8217;t think trying to force Vertica beyond its natural growth &#8212; <a href="http://www.dbms2.com/2011/04/16/unpacking-the-emc-greenplum-q1-sales-disaster-rumors/">the way EMC is with Greenplum</a> &#8212; is necessarily a good idea. Natural growth in Vertica&#8217;s case is plenty fast anyway.</li>
<li>Obviously, making good Vertica hardware would be nice. But being hardware-independent is crucial to Vertica, not least because of cloud deployment, an option many buyers want to at least have in their hip pockets.</li>
</ul>
<p><strong>Question</strong>: You expressed some skepticism toward mobile BI/use cases. Why so?</p>
<p><strong>Answer</strong>: The form factor hurts functionality a lot, so it&#8217;s only worthwhile in cases where timeliness is key.</p>
<p>And without more refined alert-setting functionality, it&#8217;s hard to think of that many cases.</p>
<p><em>Note: My views on mobile BI haven&#8217;t changed much since <a href="../../../../../2010/07/15/mobile-business-intelligence/">July, 2010</a>.</em></p>
<p><strong>Question</strong>: What about the idea of an enterprise being able to pay-per-drink to run jobs on an analytic cluster. Do you expect that concept to have any legs in 2012?</p>
<p><strong>Answer</strong>: While other kinds of SaaS (Software as a Service) BI might make sense, remote computing BI that focuses on hardware cost sharing is problematic. Moving data in and out of the cluster is a big part of the overall cost, at least if you plan to process it only occasionally once it gets there. I haven&#8217;t seen a plan yet that gets around that point.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/21/analytic-trends-in-2012-qa/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Are there any remaining reasons to put new OLTP applications on disk?</title>
		<link>http://www.dbms2.com/2011/09/19/oltp-disk-solid-state/</link>
		<comments>http://www.dbms2.com/2011/09/19/oltp-disk-solid-state/#comments</comments>
		<pubDate>Mon, 19 Sep 2011 18:07:07 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[dbShards and CodeFutures]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5257</guid>
		<description><![CDATA[Once again, I&#8217;m working with an OLTP SaaS vendor client on the architecture for their next-generation system. Parameters include: 100s of gigabytes of data at first, growing to &#62;1 terabyte over time. High peak loads. Public cloud portability (but they have private data centers they can use today). Simple database design &#8212; not a lot [...]]]></description>
			<content:encoded><![CDATA[<p>Once again, I&#8217;m working with an OLTP SaaS vendor client on the architecture for their next-generation system. Parameters include:</p>
<ul>
<li>100s of gigabytes of data at first, growing to &gt;1 terabyte over time.</li>
<li>High peak loads.</li>
<li>Public cloud portability (but they have <strong>private data centers they can use today).</strong></li>
<li>Simple database design &#8212; not a lot of tables, not a lot of columns, not a lot of joins, and everything can be distributed on the same customer_ID key.</li>
<li>Stream the data to a data warehouse, that will grow to a few terabytes. (Keeping only one year of OLTP data online actually makes sense in this application, but of course everything should go into the DW.)</li>
</ul>
<p>So I&#8217;m leaning to saying:   <span id="more-5257"></span></p>
<ul>
<li>They should go with a scalable, MySQL-based solution.
<ul>
<li>Lots of third-party software works with MySQL, in case that&#8217;s helpful.</li>
<li>Yes, any one vendor is small and not yet firmly established, but there are numerous vendors around with interesting MySQL scaling stories.</li>
<li>In a vendor emergency, just going with Oracle&#8217;s MySQL stuff would probably work &#8230;</li>
<li>&#8230; especially because there are these lovely things in the world called <strong>solid-state drives.</strong></li>
<li>There&#8217;s also good escapability if one wants to move away from MySQL, because everybody knows how to handle MySQL data.</li>
</ul>
</li>
<li>The first product to look at is dbShards, because it meets all the topology needs:
<ul>
<li>Local scale-out (<a href="http://www.dbms2.com/2011/02/24/transparent-sharding/">transparent sharding</a>).</li>
<li><a href="http://www.dbms2.com/2011/02/09/clarification-on-dbshards-shard-replication/">Local high availability</a>.</li>
<li>Remote disaster recovery (details of that are underway).</li>
</ul>
</li>
<li>The first analytic DBMS to look at is Infobright.
<ul>
<li>Yes, I know Infobright is focused more on machine-generated data these days, but this client&#8217;s analytic needs are so straightforward Infobright should pass with flying colors.</li>
<li>The MySQL-to-MySQL aspect should make ETL dead simple.</li>
<li>Again, there&#8217;s escapability.</li>
</ul>
</li>
</ul>
<p>Mainly, this is all fine. But I&#8217;m getting pushback on the solid-state aspect, for fear that it will compromise public cloud portability.</p>
<p>Am I missing something here? As far as I&#8217;m concerned, <strong>if you&#8217;re planning an OLTP system with a many-year lifespan today, </strong>of course <strong>you should assume solid-state storage.</strong> Maybe you scale out just as far as you would with disk, striping indexes or entire databases across the RAM of multiple servers. It that case, having solid-state backing reduces the risk of bottlenecks. Maybe you don&#8217;t scale out as far as you would with disk. In that case, solid-state backing saves you money.</p>
<p><strong>As for public-cloud support for solid-state storage, that&#8217;s coming fast, right? </strong>(Actually, I have data points in support of that theory, but they&#8217;re a bit tenuous.) A large fraction of web businesses with private data centers seem to be using solid-state storage &#8212; from Facebook on down &#8212; or so the NoSQL/NewSQL/<a href="http://www.dbms2.com/2011/03/02/short-request-processing/">short-request</a> DBMS guys tell me. Surely a number of public cloud vendors are close behind.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/19/oltp-disk-solid-state/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Remote machine-generated data</title>
		<link>http://www.dbms2.com/2011/07/26/remote-machine-generated-data/</link>
		<comments>http://www.dbms2.com/2011/07/26/remote-machine-generated-data/#comments</comments>
		<pubDate>Tue, 26 Jul 2011 08:45:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Truviso]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5012</guid>
		<description><![CDATA[I refer often to machine-generated data, which is commonly generated inexpensively and in log-like formats, and is often best aggregated in a big bit bucket before you try to do much analysis on it. The term has caught on, to the point that perhaps it&#8217;s time to distinguish more carefully among different kinds of machine-generated [...]]]></description>
			<content:encoded><![CDATA[<p>I refer often to <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>, which is commonly generated inexpensively and in log-like formats, and is often best aggregated in a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> before you try to do much analysis on it. The term has caught on, to the point that perhaps it&#8217;s time to distinguish more carefully among different <em>kinds</em> of machine-generated data. In particular, I think it may be useful to distinguish between:</p>
<ul>
<li><strong>Log-stream machine-generated data,</strong> when what you&#8217;re looking at &#8212; at least initially &#8212; is the entire output of verbose logging systems.</li>
<li><strong>Remote machine-generated data.</strong></li>
</ul>
<p>Here&#8217;s what I&#8217;m thinking of for the second category. I rather frequently hear of cases in which data is generated by large numbers of remote machines, which occasionally send messages home. For example:  <span id="more-5012"></span></p>
<ul>
<li>I heard yesterday about a case with 10s of millions of machines, phoning home every 5 minutes, and another case with 10s of 1000s calling in every 5 seconds, both of them sending data initially to MySQL. (Application details weren&#8217;t given.)</li>
<li>I heard not long ago about a set-top box case that the vendor hoped would also grow to 10s of millions of machines, which I guessed might send a small number of messages per hour each.</li>
<li>I also heard recently about a remote security monitoring case whose data was destined for (probably) Netezza, although in that case I&#8217;m not sure about the &#8220;occasionally&#8221; aspect of the communication.</li>
<li>The last time I visited Splunk, I got the sense that energy-sensor use cases (especially in the electric grid) had finally emerged. I believe these sensors are periodic message senders &#8212; they wake up, take their temperature (figuratively or literally as the case may be), send a message, snooze, repeat.</li>
<li>I would guess that the <a href="../../../../../2009/10/14/infobright-notes/">energy use cases</a> Infobright talked about in 2009 were of a similar kind.</li>
<li>An April, 2010 comment on the post linked above talks about <a href="../../../../../2010/04/08/machine-generated-data-example/#comment-165006">many kinds of sensor data</a>.</li>
<li>Back in 2007, <a href="../../../../../2007/08/12/applications-for-not-so-low-latency-cep/">Coral8</a> talked of a truck phone-home use case (on-board GPS data and also, e.g., refrigeration level, sending messages once a minute or so). Truviso seemed to have one similar deal before one of its frequent changes in strategic direction, and not coincidentally cites UPS as an investor.</li>
<li>In principle, there are a lot of RFID use cases out there, even if I rarely seem to hear of any. (That would be a shorter &#8220;phone call&#8221; home than most of the other examples, of course, but might be otherwise technically similar.)</li>
</ul>
<p>Many technologies can be used to collect and manage remote machine-generated data, but a few common points are worth nothing.</p>
<ul>
<li>If a device takes the trouble to send a message across a wide-area network, that message may be somewhat more valuable than the average piece of log-vomit. Perhaps such information doesn&#8217;t need to be stored in the cheapest possible way.</li>
<li>Similarly, a message that is sent occasionally over time, or upon a specified event, may be more structured than a random log entry. Perhaps such data is suitable for sending straight to a <strong>relational database</strong>.</li>
<li>If there&#8217;s no central place the data originates, there may also be no favored place for the data to end up. It may make great sense to collect and analyze remote machine-generated data in the <strong>cloud. </strong>(Exceptions may of course arise if you want to use the data in connection with other information, and you hence want to bring it to that other information&#8217;s location.)</li>
<li>In a number of use cases, the whole point is to identify anomalies, and respond to them rapidly. I.e., remote machine-generated data use cases commonly raise challenges in low-latency <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integration of short-request and analytic processing</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/26/remote-machine-generated-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 2)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:18:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Rainstor]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4867</guid>
		<description><![CDATA[In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/">Part 1</a> of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list match that is even less clear.  <span id="more-4867"></span></p>
<p><strong><em>Bit bucket</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Logs, other technical/external</li>
<li><em>Likely use styles:</em> Staging/ETL, investigative</li>
<li><em>Canonical example: </em>Log files in a Hadoop cluster<em> </em></li>
<li><em>Stresses:</em> TCO, scale-out, transform/big-query performance, ETL functionality</li>
</ul>
<p>With the explosion of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> has come the need for a place to put it all, sometimes called the <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a>. This is like the investigative data mart for big databases, but more <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured</a>. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.</p>
<p>The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.</p>
<p><strong><em>Archival data store</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Operational, CDR (call detail record), security log</li>
<li><em>Likely use styles:</em> Archival, reporting (for compliance), possibly also investigative</li>
<li><em>Examples:</em> Any long-term detailed historical store</li>
<li><em>Stresses: </em>TCO, compression, scale-out, performance (if multi-use)<em> </em></li>
</ul>
<p><em> </em></p>
<p>Analytic DBMS vendors have been insulting each other with the claim &#8220;that&#8217;s just an archival data store,&#8221; dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only <a href="../../../../../2010/06/11/rainstor-update/">Rainstor</a> truly embraces the archival positioning, and I&#8217;ve become pretty dubious about their technical claims and their company alike.</p>
<p>Still, there&#8217;s a legitimate need for data stores &#8212; especially relational analytic DBMS that:</p>
<ul>
<li>Store data cheaply, with high rates of compression.</li>
<li>Have decent performance if you do want to query the data.</li>
<li>May have archiving/compliance-specific features as well.</li>
</ul>
<p>Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.</p>
<p><strong><em>Outsourced data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Traditional BI, investigative analytics, staging/ETL</li>
<li><em>Examples:</em> Advertising tracking, SaaS CRM</li>
<li><em>Stresses:</em> Performance, TCO, reliability, concurrency</li>
</ul>
<p>Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I&#8217;ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that&#8217;s an analytic data management challenge. The possibilities expand from there.</p>
<p>Data outsourcers are in the IT business, and so their IT development is &#8212; hopefully! &#8212; more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.*  <a href="../../../../../2011/06/26/what-to-think-about-before-you-make-a-technology-decision/">Multitenancy</a> is commonly an issue, as is running in the cloud.<em> </em></p>
<p><em>*Even so, there&#8217;s often That Guy who doesn&#8217;t want to migrate away from Oracle, no matter what.<strong> </strong></em></p>
<p>Vertica gets the nod in a number of these cases; it&#8217;s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you&#8217;re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.</p>
<p><strong><em>Operational analytic(s) server</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> Customer-centric, log, financial trade</li>
<li><em>Likely use styles:</em> Advanced operational analytics</li>
<li><em>Examples:</em>
<ul>
<li>Lower latency: Web or call-center personalization, anti-fraud</li>
<li>Higher latency: Customer profiling, Basel 3 risk analysis</li>
</ul>
</li>
<li><em>Stresses:</em> Performance, reliability, analytic functionality, perhaps concurrency</li>
</ul>
<p>Even with eight different choices, I need a &#8220;catch-all&#8221; category; this is it.</p>
<p>Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrating short-request and analytic processing</a>. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you&#8217;re set. Otherwise, you may want to pipe <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">derived data</a> into a more &#8220;industrial-strength&#8221; DBMS, ideally the one that runs your operational apps anyway</p>
<p>Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you&#8217;re starting out with the data in a convenient bit bucket.</p>
<p>Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.</p>
<p><em>So did I get them all? Or are there yet other analytic data management use cases that I don&#8217;t fit into my eight categories?</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>What to think about BEFORE you make a technology decision</title>
		<link>http://www.dbms2.com/2011/06/26/what-to-think-about-before-you-make-a-technology-decision/</link>
		<comments>http://www.dbms2.com/2011/06/26/what-to-think-about-before-you-make-a-technology-decision/#comments</comments>
		<pubDate>Sun, 26 Jun 2011 18:51:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4835</guid>
		<description><![CDATA[When you are considering technology selection or strategy, there are a lot of factors that can each have bearing on the final decision &#8212; a whole lot. Below is a very partial list. In almost any IT decision, there are a number of environmental constraints that need to be acknowledged. Organizations may have standard vendors, [...]]]></description>
			<content:encoded><![CDATA[<p>When you are considering technology selection or strategy, there are a lot of factors that can each have bearing on the final decision &#8212; a whole lot. Below is a very partial list.</p>
<p>In almost any IT decision, there are a number of <strong>environmental constraints</strong> that need to be acknowledged. Organizations may have <strong>standard vendors</strong>, favored vendors, or simply vendors who give them <a href="../../../../../2011/06/24/observations-on-oracle-pricing/">particularly deep discounts</a>. <strong>Legacy systems</strong> are in place, application and system alike, and may or may not be open to replacement. Enterprises may have on-premise or off-premise preferences; SaaS (Software as a Service) vendors probably have <strong>multitenancy</strong> concerns. Your organization can determine which aspects of your system you&#8217;d ideally like to see be tightly <strong>integrated </strong>with each other, and which you&#8217;d prefer to keep only loosely coupled. You may have biases for or against <strong>open-source software.</strong> You may be pro- or anti-<strong>appliance.</strong> Some applications have a substantial need for elastic scaling. And some kinds of issues cut across multiple areas, such as <strong>budget</strong>, <strong>timeframe, security, </strong>or<strong> trained personnel.</strong></p>
<p>Multitenancy is particularly interesting, because it has numerous implications. <span id="more-4835"></span>If you&#8217;re a SaaS vendor supporting multiple customers, you must keep each customer&#8217;s data inaccessible to other users* &#8212; even if you offer high levels of flexibility or customization. You probably also want to keep data logically partitioned by user, in a way that the DBMS recognizes; you may also want that partition to hunt as a pack for caching purposes, especially if no one customer occupies a large part of your database. Administratively, you need a way to measure customer-specific metrics of the sort that might go into SLAs (Service-Level Agreements).</p>
<p><em>*Of course, there are exceptions. One of my clients is a SaaS vendor facilitating commerce; the whole point of their app is to let two different customers see and update the same records.</em></p>
<p>Getting more specific now, I&#8217;m usually called upon to <a href="http://www.monash.com/adviseusers.html">advise users</a> in two categories &#8212; those that already know they want to upgrade analytic functionality, and those that quickly realize they do once I remind them of it. Even so, many organizations struggle with the question &#8220;What do you want to do analytically?&#8221; It&#8217;s tough to blame them, for the question is distressingly circular; <strong>a big part of analytics is figuring out which kinds of analytics are worth doing.</strong> Also, SaaS vendors often struggle with the same question for a different reason, responding &#8220;Well, we know we&#8217;ve only been giving them basic stuff to date. What else do you think they would like?&#8221;</p>
<p>There&#8217;s no perfect solution to those difficulties, but a good way to start the evaluation is by assessing:</p>
<ul>
<li>The<strong> nature and value of your decisions that analytics could reasonably affect.</strong></li>
<li>Your <strong>realistic scope for automation of analytic decisions.</strong></li>
<li>The <strong>number and training of your &#8220;full-time analysts&#8221;</strong> &#8212; statisticians, SQL jocks who can program, SQL jocks who can&#8217;t really program, full-time users of BI tools, whatever.</li>
<li>The <strong>number and training of your &#8220;part-time analysts&#8221;</strong> &#8212; normal business users who can get something out of a dashboard, and perhaps even drill down into it.</li>
</ul>
<p>That should at least tell you which broad categories of analytics you want to engage in, and roughly how advanced in those areas you should try to be.</p>
<p><em>Basic business intelligence/dashboarding? Surely. Visualization-centric BI? If nothing else, it demos well. Basic predictive modeling? Hmm, are you sure nobody will want that? Advanced predictive modeling? Um, are you sure your users can handle that, or that the results will be worth the investment?</em></p>
<p>When I talk with users, there&#8217;s usually a data management problem in the mix too. In such cases, I quickly ask about <strong>data-related metrics</strong>, starting with database size, ingest volumes (batch, if relevant, but especially continuous), and simultaneous query load /concurrent user count. Similarly important are requirements for various kinds of <a href="http://www.dbms2.com/2009/09/10/analytic-speed-latency/">latency</a>, the big two being <strong>query response time</strong> and <strong>how long it takes for data to first be available for query. </strong>Less numeric questions in a similar vein boil down to &#8220;What kinds of requests will you make against the database, in what volume?&#8221;</p>
<p><em>And this loops back to the analytic-user inventory. Suppose you had a near-real-time dashboard &#8212; would anybody actually look at it minute to minute?</em></p>
<p>Specialized metrics I request when considering analytic DBMS include &#8220;How many columns are there in your widest table?&#8221; and &#8220;How many joins &#8212; or lines of SQL &#8212; are there in your most complex query?&#8221;, both of which are tools for assessing &#8220;Is your use case naturally columnar?&#8221;. Another, more general <strong>&#8220;natural structure of data&#8221;</strong> kind of consideration is what structure the data is in before it gets to the database being discussed; candidates include relational batch, XML stream, log file, and many more.</p>
<p>Also crucial are requirements for <strong><a href="http://www.dbms2.com/2010/05/01/ryw-read-your-writes-consistency/">consistency</a>, availability, </strong>and<strong> data integrity.</strong> Those tell you your needs in <strong>high availability </strong>and<strong> disaster recovery,</strong> and perhaps even how picky you have to be about your brands of hardware, software, or cloud/hosting provider. They also indicate how much you should care about relational or ACID properties, and where you should come down on <a href="http://www.dbms2.com/2010/03/12/some-nosql-links/">CAP Theorem</a> trade-offs.</p>
<p><em>I could go on even longer, but those seem like a pretty good set of initial questions with which to start discussions of data management, data integration, and analytic tools and architectures. What do you think I left out? And what do you think I could make substantially clearer by just adding a few more words? Any comments will be much appreciated.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/26/what-to-think-about-before-you-make-a-technology-decision/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Quick thoughts on Oracle-on-Amazon</title>
		<link>http://www.dbms2.com/2011/05/24/quick-thoughts-on-oracle-on-amazon/</link>
		<comments>http://www.dbms2.com/2011/05/24/quick-thoughts-on-oracle-on-amazon/#comments</comments>
		<pubDate>Tue, 24 May 2011 13:16:32 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Amazon and its cloud]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Oracle]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4557</guid>
		<description><![CDATA[Amazon has a page up for what it calls Amazon RDS for Oracle Database. You can rent Amazon instances suitable for running Oracle, and bring your own license (BYOL), or you can rent a &#8220;License Included&#8221; instance that includes Oracle Standard Edition One (a cheap version of Oracle that is limited to two sockets). My [...]]]></description>
			<content:encoded><![CDATA[<p>Amazon has a page up for what it calls <a href="http://aws.amazon.com/rds/oracle/">Amazon RDS for Oracle Database</a>. You can rent Amazon instances suitable for running Oracle, and bring your own license (BYOL), or you can rent a &#8220;License Included&#8221; instance that includes Oracle Standard Edition One (a cheap version of Oracle that is limited to two sockets).</p>
<p>My quick thoughts start:</p>
<ul>
<li>Mainly, this isn&#8217;t for production usage. But exceptions might arise when:
<ul>
<li>An  application, from creation to abandonment, is only expected to have a  short lifespan, in support of a specific project.</li>
<li>There is an extreme internal-politics bias to operating versus  capital expenses, or something like that, forcing a user department to cloud production deployment even when it doesn&#8217;t make much rational  sense.</li>
<li>An application is small enough, or the situation is  sufficiently  desperate, that any inefficiencies are outweighed by convenience.</li>
</ul>
</li>
<li>There is non-production appeal. In particular:
<ul>
<li>Spinning up a quick cloud instance can make a lot of sense for a developer.</li>
<li>The same goes if you want to sell an Oracle-based application and need to offer demo/test capabilities.</li>
<li>The same might go for off-site replication/disaster recovery.</li>
</ul>
</li>
</ul>
<p>Of course, those are all standard observations every time something that&#8217;s basically on-premises software is offered in the cloud. They&#8217;re only reinforced by the fact that the only Oracle software Amazon can actually license you is a particularly low-end edition.</p>
<p>And Oracle is indeed on-premises software. In particular, Oracle is hard enough to manage when it&#8217;s on your premises, with a known  hardware configuration; who would want to try to manage a production  instance of Oracle in the cloud?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/05/24/quick-thoughts-on-oracle-on-amazon/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Introduction to SnapLogic</title>
		<link>http://www.dbms2.com/2011/05/13/introduction-to-snaplogic/</link>
		<comments>http://www.dbms2.com/2011/05/13/introduction-to-snaplogic/#comments</comments>
		<pubDate>Fri, 13 May 2011 06:14:46 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Data integration and middleware]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[SnapLogic]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4426</guid>
		<description><![CDATA[I talked with the SnapLogic team last week, in connection with their SnapReduce Hadoop-oriented offering. This gave me an opportunity to catch up on what SnapLogic is up to overall. SnapLogic is a data integration/ETL (Extract/Transform/Load) company with a good pedigree: Informatica founder Gaurav Dillon invested in and now runs SnapLogic, and VC Ben Horowitz [...]]]></description>
			<content:encoded><![CDATA[<p>I talked with the SnapLogic team last week, in connection with their <a href="../../../../../2011/05/12/data-integration-vendors-and-hadoop/">SnapReduce Hadoop-oriented offering</a>. This gave me an opportunity to catch up on what SnapLogic is up to overall. SnapLogic is a data integration/ETL (Extract/Transform/Load) company with a good pedigree: Informatica founder Gaurav Dillon invested in and now runs SnapLogic, and VC Ben Horowitz is involved. SnapLogic company basics include:</p>
<ul>
<li>SnapLogic has raised about $18 million from Gaurav Dillon and Andreessen Horowitz.</li>
<li>SnapLogic has almost 60 people.</li>
<li>SnapLogic has around 150 customers.</li>
<li>Based in San Mateo, SnapLogic has an office in the UK and is growing its European business.</li>
<li>SnapLogic has both SaaS (Software as a Service) and on-premise availability, but either way you pay on a subscription basis.</li>
<li>Typical SnapLogic deal size is under $20K/year. Accordingly, SnapLogic sells over the telephone.</li>
<li>SnapReduce is in beta with about a dozen customers, and slated for release by year-end.</li>
</ul>
<p>SnapLogic&#8217;s core/hub product is called SnapCenter. In addition, for any particular kind of data one might want to connect, there are &#8220;snaps&#8221; which connect to &#8212; i.e. snap into &#8212; SnapCenter.</p>
<p>SnapLogic&#8217;s market position(ing) sounds like <a href="../../../../../2008/10/09/cloud-data-integration/">Cast Iron&#8217;s</a>, by which I mean: <span id="more-4426"></span></p>
<ul>
<li>As a practical matter, clients usually first buy SnapCenter to connect on-premise and SaaS applications.</li>
<li>SnapCenter supports cloud-to-cloud* and on-premise-to-on-premise integration as well.</li>
<li>SnapCenter itself runs either on-premise on in the cloud. (Larger customers at the moment tend to prefer on-premise deployment.)</li>
<li>SnapLogic suggests its products are simpler than many ETL alternatives.</li>
</ul>
<p>Not atypically, SnapLogic believes that SnapCenter is higher-end than Cast Iron (which is now <a href="../../../../../2010/09/28/ibm-cast-iron-systems-190-million-dollars/">an IBM company</a>), and that SnapCenter&#8217;s real top competitor is in-house/hand-coded integration.</p>
<p><em>*When discussing data integration, &#8220;SaaS&#8221; and &#8220;in the cloud&#8221; are close to  synonymous.</em></p>
<p>What SnapLogic said about its use cases seemed to boil down to:</p>
<ul>
<li>The base SnapLogic use case is when a client wants to push all data from one application to another application.</li>
<li>The source and target applications can be any combination of on-premise, SaaS (e.g. salesforce.com), and so on.</li>
<li>ETL purchases often start when somebody purchases a database (by which I presume SnapLogic also meant data feed).</li>
<li>SnapLogic sees three main kinds of use case:
<ul>
<li>One-time batch (a big move of historical data).</li>
<li>Classic repeating batch.</li>
<li>&#8220;Trigger-based&#8221; (I don&#8217;t think SnapLogic was using the term &#8220;trigger&#8221; just in its technical DBMS sense).</li>
</ul>
</li>
<li>SnapLogic sees a big future in providing a scalable integration layer for Twitter, RSS feeds, and so on, especially straight into websites, but I get the impression there are only a few pioneering users of such capabilities right now.</li>
</ul>
<p>The main technical sizzle in the SnapLogic story is the <a href="http://store.snaplogic.com/">SnapStore</a>, with lets you download free snaps and buy unfree ones.* SnapLogic says there are 100 or so snaps in the SnapStore more, with a couple more being added weekly. That claim started making sense to me when SnapLogic said most snaps are offered by system integrators (as byproducts of specific integration contracts?) or software vendors (to connect to their own offerings?).</p>
<p><em>*I was expecting snap pricing to be subscription-based also, but when I went to the SnapStore this didn&#8217;t seem to be the case.</em></p>
<p>At least, I think that&#8217;s the main sizzle. I&#8217;ll confess to not having come away with much understanding of other nuances of SnapLogic technology. In particular, I don&#8217;t know what the core data interchange format is that allows all the &#8220;simplification&#8221; and &#8220;normalization&#8221; needed for this approach to be possible. So in particular I didn&#8217;t drill down far enough to uncover any limitations (functionality or performance) in that aspect of the architecture, and the same goes for SnapCenter&#8217;s RESTfulness. That&#8217;s all probably my fault; SnapLogic did put a bunch of good people on the phone, and we did at least lay the groundwork for future understanding.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/05/13/introduction-to-snaplogic/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Attensity update</title>
		<link>http://www.dbms2.com/2011/04/14/attensity-update/</link>
		<comments>http://www.dbms2.com/2011/04/14/attensity-update/#comments</comments>
		<pubDate>Thu, 14 Apr 2011 12:07:11 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4295</guid>
		<description><![CDATA[I talked with Michelle de Haaff and Ian Hersey of Attensity back in February. We covered a lot of ground, so let&#8217;s start with a very high-level view. Two years ago, Attensity merged with two other companies in somewhat related businesses, thus expanding 4X or so in size. Due to the merger, Attensity now has [...]]]></description>
			<content:encoded><![CDATA[<p>I talked with Michelle de Haaff and Ian Hersey of Attensity back in February. We covered a lot of ground, so let&#8217;s start with a very high-level view.</p>
<ul>
<li>Two years ago, <a href="http://www.texttechnologies.com/2009/04/20/the-new-attensity-deal-overview/">Attensity merged with two other companies in somewhat related businesses</a>, thus expanding 4X or so in size.</li>
<li>Due to the merger, Attensity now has two core lines of business:
<ul>
<li>Text analytics.</li>
<li>Driving actions, such as call center or social media response, based on text analytics.</li>
</ul>
</li>
<li>The combined Attensity is part American, part German.</li>
<li>Attensity&#8217;s German part compels it to do some public financial reporting. Attensity will do $50-60 million in 2011 revenue.</li>
<li>Attensity crunches text in 17 languages. English is preeminent. #2 is &#8212; you guessed it! &#8212; German.</li>
<li>A big part of Attensity&#8217;s business (or at least of its value proposition) is analyzing the text in social media. Attensity boasts coverage of 75 million social media sources, such as blogs, forums, or review sites.</li>
</ul>
<p>The four most interesting technical points were probably:</p>
<ul>
<li><strong>Attensity has changed how it does </strong><a href="http://www.texttechnologies.com/2007/10/05/david-bean-of-attensity-explains-sentiment-and-other-qualifiers/"><strong>exhaustive extraction</strong></a><strong>.</strong> I&#8217;m having some trouble writing that part up, so for now I&#8217;ll just refer you to <a href="http://www.attensity.com/technology/semantic-server/exhaustive-extraction/">Attensity&#8217;s own description</a> of the new way of doing things.<em> </em></li>
<li><strong>Attensity has development work underway meant to address some of the problems in </strong><a href="http://www.texttechnologies.com/2010/12/01/state-of-the-art-text-analytics-mining-applications/"><strong>text analytics/other analytics integration</strong></a><strong>.</strong> I don&#8217;t feel I got enough detail to want to talk about that yet.</li>
<li><strong>Attensity runs its own data centers, with approximately 60 Hadoop/HBase nodes and 30 nodes of Apache Solr</strong> (open source text search). More on that below.</li>
<li><strong>Attensity now OEMs Vertica.</strong> More on that below too.</li>
</ul>
<p>Some more specific notes include:  <span id="more-4295"></span></p>
<ul>
<li>Attensity has long had customers who use text analytics as an input into churn analysis, for example <a href="http://www.attensity.com/2010/08/21/charles-schwab/">Charles Schwab</a>.</li>
<li>At least one customer, who may or may not wish to be named, uses Attensity technology to help <a href="../../../../../2011/01/11/the-technology-of-privacy-threats/">de-anonymize</a> social media posters. (I didn&#8217;t ask how that worked, actually.)</li>
<li>Attensity&#8217;s founding CTO David Bean has been gone for a while.</li>
<li>Social media analyzers generally require less sophisticated analytics than Attensity&#8217;s older kinds of customers.</li>
<li>Social media has, in part, a vocabulary all its own. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </li>
</ul>
<p><strong>Attensity and relational DBMS</strong></p>
<p>Notes on Attensity&#8217;s choice of DBMS to OEM include:</p>
<ul>
<li>Attensity uses Hadoop/HBase itself, but didn&#8217;t consider it realistic to try to persuade OEM customers to go that way.</li>
<li>I get the impression that Attensity&#8217;s two finalists were Vertica and Sybase IQ.</li>
<li>Attensity seems to have considered only query, and not more general <a href="../../../../../2011/02/24/analytic-platforms/">analytic platform</a> capabilities, which makes sense given that the evaluation was conducted (starting) in 2009.</li>
<li>One reason Vertica won was that it required very little tuning.</li>
<li>Another reason Vertica won was true MPP scale-out, notwithstanding that the largest known installation is capable of running on two nodes (although Attensity recommended that the customer get four just to be on the safe side).</li>
<li>Sybase IQ&#8217;s load speed was even better than Vertica&#8217;s.</li>
<li>Database max-size-to-date metrics include:
<ul>
<li>Under 1 terabyte.</li>
<li>50 million documents (not rows), growing by 1million documents per day..</li>
<li>Several hundred million sentences (I guess the documents are short, but it makes sense that they would be).</li>
<li>Several billion rows.</li>
</ul>
</li>
</ul>
<p>It seems there are two parts to the Attensity schema. The raw output of &#8220;exhaustive extraction&#8221; sounds as if it has rather narrow rows. But Attensity then builds something more star-schema-like to feed into BI tools. Perhaps the latter is the reason for preferring columnar DBMS. There don&#8217;t seem to be a lot of auxiliary tables; the only ones Ian cited were:</p>
<ul>
<li>Category tables have ontology up to a couple thousand rows</li>
<li>Tables of terms</li>
<li>Structured fields that provide metadata for the triples</li>
</ul>
<p>Previous Attensity database targets (partner, not OEM) included Teradata, SQL Server, Oracle, and MySQL. Hibernate layers were in the mix somewhere too. SQL Server actually had the best performance. I don&#8217;t think that&#8217;s counting a more recent Sybase IQ partnership, which only racked up a couple of sales.</p>
<p><strong>Attensity, Hadoop, and other non-relational technologies</strong></p>
<p>But that&#8217;s OEM. Attensity runs its own data centers, with approximately 60 Hadoop/HBase nodes and 30 nodes of Apache Solr (open source text search).* One reason for moving out of Amazon EC2 was that Solr cried out for solid-state drives; another was just cost.</p>
<p><em>*But those are just rough figures, from Ian&#8217;s memory.</em></p>
<p>Attensity uses HBase to store full-text documents. However, it doesn&#8217;t seem that this is a classic low-latency update HBase use case; Attensity reports doing 3 loads a day, 50 gigabytes of documents total. Apparently that works out to 1 billion documents/month; I gather Attensity just keeps them for 6 months. HBase has been nicely stable for Attensity.</p>
<p>Attensity uses Solr to build distributed search indexes. Solr has not been nicely stable.</p>
<p>What Attensity does in Hadoop seems to be rather simple NLP (Natural Language Processing), plus things one might do in a relational DBMS instead. Examples include:</p>
<ul>
<li>Named entity extraction.</li>
<li>Scoring for sentiment.</li>
<li>Influence scores, in whatever ways Attensity can calculate them.</li>
</ul>
<p>There surely also is some basic preprocessing, ingesting text (and document metadata) in various forms and normalizing it into a more standard format. Some real-time ingesting is done outside of Hadoop, in more of a queuing system, the most obvious example of that being the Twitter firehose. Ian suggested that in the future this system will get more uses, in the form of a <a href="http://www.texttechnologies.com/2006/07/27/uima-data-point/">UIMA</a>-like pipeline.</p>
<p>I further get the impression that Attensity uses Hadoop to do on a SaaS (Software As A Service) basis what its customers do in Vertica. The old idea that <a href="http://www.texttechnologies.com/2008/06/16/attensity-update-updated/">Attensity provides hosted services for about half its customers</a> still seems to apply, at least on the new-customer front. However, I&#8217;m not sure exactly which product lines Attensity was referring to when they said that.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/04/14/attensity-update/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Three kinds of software innovation, and whether patents could possibly work for them</title>
		<link>http://www.dbms2.com/2010/03/23/software-innovation-patent/</link>
		<comments>http://www.dbms2.com/2010/03/23/software-innovation-patent/#comments</comments>
		<pubDate>Tue, 23 Mar 2010 08:18:42 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1763</guid>
		<description><![CDATA[In connection with an attempt to articulate my views on software patents (more on those below), I was thinking about the different ways in which software development can be innovative. And it turns out that most forms of software innovation can, at their core, be assigned to one or more of three overlapping categories: Direct [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">In connection with an attempt to articulate my views on software patents (more on those below), I was thinking about the different ways in which software development can be innovative. And it turns out that most forms of software innovation can, at their core, be assigned to one or more of three overlapping categories:<span id="more-1763"></span></p>
<ul>
<li><strong>Direct improvement in user 	interface or functionality.</strong> Examples (again overlapping) 	include:
<ul>
<li>True UI enhancements.</li>
<li>Application functionality that 	just lets you do more.</li>
<li>Most modern <strong>mobile, web, and/or 	social software</strong> efforts, in which a relatively small amount of 	coding effort produces features that may or may not lead to rapid 	viral adoption.</li>
<li>Ease or functionality not just for 	end users, but also for <strong>administrators.</strong> In particular, SaaS, <a href="http://www.dbms2.com/category/software-as-a-service-database-saas/cloud-computing/">cloud</a>, <a href="http://www.dbms2.com/2009/06/08/the-future-of-data-marts/">private cloud</a> and/or <a href="http://www.dbms2.com/category/database-management-system/data-warehouse-appliances/">appliance</a> benefits are 	commonly concentrated in this area.</li>
<li>Languages and other <strong>programmer</strong> aids too.</li>
</ul>
</li>
<li><strong>Performance/efficiency 	improvement.</strong> Overlapping examples include:
<ul>
<li>Anything that directly purports to 	improve response time, hardware cost or utilization, or power/floor 	space consumption.</li>
<li>Anything to do with 	<a href="http://www.dbms2.com/category/parallelization/">parallelization</a> or scale-out.</li>
<li>Many, many under-the-covers 	enhancements to make data more protected (against theft or loss 	alike), user features snazzier, and so on. With a few exceptions – 	which are generally regarded as unsolved artificial intelligence 	problems – almost anything can be hacked together quickly in some 	high-level programming tool, assuming performance is of no 	concern. It&#8217;s getting the performance remotely right that can often 	slow market introduction.</li>
</ul>
</li>
<li><strong>New or enhanced logical data 	model.</strong> Examples of innovation via data model – either truly 	new or else just newly implemented in a performant way &#8212; include:
<ul>
<li><strong>A huge fraction of application 	innovation,</strong> in “traditional” functionality and workflow 	alike. In several technological eras, just about everything about 	applications has been a commodity <strong>except</strong> the data model, but 	the data model alone was enough to provide long-lasting product 	differentiation. Indeed, it probably is true today, although that 	may finally change as business intelligence integration becomes a 	large part of application software technology.</li>
<li>Most things that are called 	<strong>knowledge representation.</strong></li>
<li>Many things that are described by 	terms like <a href="http://www.dbms2.com/2010/01/17/three-broad-categories-of-data/">“unstructured” or “semi-structured”</a> data.</li>
<li>Most innovations described by 	terms such as <strong>metadata management.</strong></li>
</ul>
</li>
</ul>
<p style="margin-bottom: 0in;">To check that I&#8217;m not being too glib here, let&#8217;s consider a few categories of software technology.</p>
<ul>
<li><strong>MPP analytic DBMS</strong> are all 	about performance/efficiency improvement (whether of SQL queries or 	<a href="http://www.dbms2.com/2010/02/22/netezza-twinfin/">other analytics</a>), except when they&#8217;re about ease of 	administration and the like.</li>
<li><a href="http://www.dbms2.com/2009/10/10/enterprises-using-hadoo/">Hadoop</a> is about scaling out 	cheap machines in a way that is (for some purposes) easy to program.</li>
<li>The core of <a href="http://www.dbms2.com/2010/03/14/nosql-taxonomy/">NoSQL</a> is about 	efficient scale-out; easier programming also plays a big role.</li>
<li>Disruptive small vendor <strong>business 	intelligence</strong> innovation has a lot to do with <a href="http://www.dbms2.com/2009/05/30/reinventing-business-intelligence/">better and more 	useful user experiences</a>,<span style="font-style: normal;"> except 	when it&#8217;s about ease of programming and/or administration. The BI 	industry is also moving to in-memory analytics, which harnesses 	better performance to provide more interactive user experiences.</span></li>
<li><em>SAS,</em><span style="font-style: normal;"> which has long competed on the basis of superior functionality for 	statistical programmers, is now also on a big performance kick via 	MPP analytic DBMS partnerships.</span></li>
<li><span style="font-style: normal;"><strong>Oracle&#8217;s 	DBMS efforts</strong></span><span style="font-style: normal;"> have long 	been focused on </span><a href="http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/">performance</a><span style="font-style: normal;"> and </span><a href="http://www.monash.com/oracle10g.html">administrative usability</a>.</li>
<li><span style="font-style: normal;">As 	noted above, </span><span style="font-style: normal;"><strong>enterprise 	application</strong></span><span style="font-style: normal;"> functionality is usually all about the data model. Exceptions arise 	when there is a major generation of UI functionality, such as 	interactivity (long ago), GUIs (ditto), or BI integration (in its 	early days now). SaaS is also pitched as an ease-of-everything play.</span></li>
<li><span style="font-style: normal;"><strong>Administrative 	tools</strong></span><span style="font-style: normal;"> are usually about 	making administration easier. In a few cases (e.g., backups), 	they&#8217;re more about performance.</span></li>
</ul>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">I&#8217;d say my proposed trichotomy is holding up pretty well.</span></p>
<p style="margin-bottom: 0in;"><span style="font-style: normal;">So what set me off on this line of reasoning? Well, </span><a href="http://redmonk.com/sogrady/2010/03/19/software-patents/">Stephen O&#8217;Grady</a><span style="font-style: normal;"> wrote</span></p>
<blockquote>
<p style="margin-bottom: 0in;">The reason I am against software patents is … very simple. … I am against software patents because it is not reasonable to expect that the current patent system, nor even one designed to improve or replace it, will ever be able to accurately determine what might be considered legitimately patentable from the overwhelming volume of innovations in software. Even the most trivial of software applications involves hundreds, potentially thousands of design decisions which might be considered by those aggressively seeking patents as potentially protectable inventions. If even the most basic elements of these are patentable, as they are currently, the patent system will be fundamentally unable to scale to meet that demand. As it is today.</p>
<p><span style="font-style: normal;">In addition to questions of volume are issues of expertise; for some of the proposed inventions, there may only be a handful of people in the world qualified to actually make a judgment on whether a development is sufficiently innovative so as to justify a patent. None of those people, presumably, will be employed by the patent office. &#8230; Nor will two developers always come to the same conclusions as to the degree to which a given invention is unique. </span></p></blockquote>
<p><span style="font-style: normal;">In considering whether I agreed, I realized that the analysis is different for each of my three categories of innovation mentioned above.</span></p>
<ul>
<li><span style="font-style: normal;">In the case of a </span><span style="font-style: normal;"><strong>logical 	data model,</strong></span><span style="font-style: normal;"> O&#8217;Grady is 	almost surely right. Many of those are just copied from the real 	world anyway, and hence don&#8217;t meet any kind of “novel and 	non-obvious” test. The rest are so general and abstract it&#8217;s 	really hard to say what – if anything – is new and non-obvious 	about them vs. well-established, often academic prior art.</span></li>
<li><span style="font-style: normal;">In the case of </span><span style="font-style: normal;"><strong>performance 	enhancements,</strong></span><span style="font-style: normal;"> the core 	ideas can usually also be found in well-established computer science 	publications. What&#8217;s more, the true innovations may be such simple 	algorithms that they&#8217;re not patentable. What&#8217;s left over is </span><a href="http://www.dbms2.com/2009/08/21/bottleneck-whack-a-mole/">incremental enhancement</a>.<span style="font-style: normal;"> Once again, O&#8217;Grady is right.</span></li>
<li><span style="font-style: normal;">But the case of </span><span style="font-style: normal;"><strong>user 	interface/experience enhancements</strong></span><span style="font-style: normal;"> is not so clear. Inventor comes up with a useful idea for something 	that hasn&#8217;t been built before. Inventor builds and patents it. I&#8217;m 	not sure how that&#8217;s different from the case of building physical 	devices of various kinds, which have been patented for centuries. 	Determining what&#8217;s novel or non-obvious doesn&#8217;t seem to require 	specialized technical knowledge, at least not above and beyond that 	required in other disciplines. </span></li>
</ul>
<p><span style="font-style: normal;"><strong>Bottom line:</strong></span><span style="font-style: normal;"> There are many other reasons to oppose software patents, but Stephen O&#8217;Grady&#8217;s “It&#8217;s impossible to adjudicate them fairly” argument remains unproven, at least when it is applied to software enhancements whose essence is better designs for user experiences.</span></p>
<p><em><strong>Related links:</strong></em></p>
<ul>
<li><span style="font-style: normal;">My negative comments about 	patents in the areas of <a href="http://www.dbms2.com/2010/02/11/google-mapreduce-patent/">MapReduce</a> and <a href="http://www.dbms2.com/2010/01/15/vertica-sybase-ipatent-litigation/">columnar DBMS</a></span></li>
<li><a href="http://www.monashreport.com/2006/04/06/microsoft-underscores-its-core-paradigm/"><span style="font-style: normal;">Three standpoints from which 	to view a software product strategy</span></a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/03/23/software-innovation-patent/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Open issues in database and analytic technology</title>
		<link>http://www.dbms2.com/2010/02/01/open-issues-in-database-and-analytic-technology/</link>
		<comments>http://www.dbms2.com/2010/02/01/open-issues-in-database-and-analytic-technology/#comments</comments>
		<pubDate>Mon, 01 Feb 2010 22:04:31 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=1507</guid>
		<description><![CDATA[The last part of my New England Database Summit talk was on open issues in database and analytic technology. This was closely intertwined with the previous section, and also relied on a lot that I&#8217;ve posted here. So I&#8217;ll just put up a few notes on that part, with lots of linkage to prior discussion [...]]]></description>
			<content:encoded><![CDATA[<p style="margin-bottom: 0in;">The last part of my <a href="http://www.dbms2.com/2009/11/25/new-england-database-summit-january-28-2010/">New England Database Summit</a> talk was on open issues in database and analytic technology. This was closely intertwined with the <a href="http://www.dbms2.com/2010/01/31/trends-database-aanalytic-technology/">previous section</a>, and also relied on a lot that I&#8217;ve posted here. So I&#8217;ll just put up a few notes on that part, with lots of linkage to prior discussion of the same points.<span id="more-1507"></span></p>
<p><!-- 		@page { margin: 0.79in } 		P { margin-bottom: 0.08in } --></p>
<ul>
<li>The most important issue in 	database and analytic technology, in my opinion, isn&#8217;t technological 	at all – rather, it&#8217;s the legal and political steps needed to <a href="http://www.dbms2.com/2010/01/31/data-based-snooping-threat-libert/"> preserve liberty</a> in the face of advancing, intrusive 	technology.</li>
<li>Another important issue for 	society – and this one does involve a lot of technology – is 	scientific number crunching. In particular, <a href="http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/">database technology for 	scientific computing</a> needs to be developed much further. I&#8217;ll have 	more to say on all this soon.</li>
<li>More generally, technology needs 	to keep advancing for parallel analytics. Fortunately, it is. Watch 	this space over the next few weeks.</li>
<li>Oracle has said, in effect, that <a href="http://www.dbms2.com/2010/01/22/oracle-database-hardware-strategy/"> its most important technological challenge of the decade</a> is getting 	<a href="http://www.dbms2.com/2010/01/31/flash-pcmsolid-state-memory-disk/">solid-state memory</a> right. I agree.</li>
<li>Data volumes will keep going up, 	up, up. Technology needs to keep evolving accordingly. Much of what 	I write is on that subject.</li>
<li>Data needs to be processed and analyzed at <a href="http://www.dbms2.com/2009/09/10/analytic-speed-latency/">very 	different latencies</a>. And there&#8217;s much further to go in integrating 	disparate latencies.</li>
<li>Analytic database management in 	the cloud hasn&#8217;t been solved yet, especially for Big Data. Among the 	reasons are the difficulty of moving data into the cloud (unless it 	originated there), the slowness of moving it from node to node in 	shared-nothing architectures (which reduces the elasticity benefit), 	and above all the long and unpredictable latencies of interprocessor 	communication while queries are running (a key subject of discussion 	at the <a href="http://www.dbms2.com/2009/11/23/boston-big-data-summit-keynote-outline/">Boston Big Data Summit</a>).</li>
<li>Better business intelligence user 	interfaces are increasingly available. I&#8217;m thinking particularly of 	approaches with buzzwords like <a href="http://www.dbms2.com/2008/08/04/qliktech-qlikview-update/">visualization/interactive exploration</a> or <a href="http://www.texttechnologies.com/2007/08/03/the-case-for-inxight-awareness-server/">faceted</a>. But they aren&#8217;t well-integrated into the overall 	analytic stack, as big BI vendors are trailing the smaller ones in 	this regards. (Part of the problem relates to my previous point.)</li>
<li>Application development over text 	search isn&#8217;t in the same league as application development over 	relational DBMS. The choices are mainly XML (e.g., <a href="http://www.texttechnologies.com/2008/04/29/mark-logic-viewed-as-a-different-kind-of-text-search-technology-vendor/">MarkLogic</a>), SQL 	for text integrated into RDBMS (limited by the weakness of those 	integrations), and something like <a href="http://www.texttechnologies.com/2008/09/20/attivio-update/">Attivio&#8217;s Java SDK</a>. There&#8217;s a 	major conceptual barrier in building those apps, namely the 	unpredictability of query results. Still, it should be possible to 	do better.</li>
<li>Similarly, text analytics and 	conventional analytics exist well side by side. They can even be in 	the same database and/or dashboard, although in practice that is 	limited by the strong <a href="http://www.texttechnologies.com/2008/10/24/attensity-update-2/">SaaS focus of text mining vendors and users</a>. But analytic 	integration of them is really hard. Linguistic imprecision is, in my 	opinion, only the #2 reason for this difficulty. The #1 reason is 	that trends detected by text analytics are much less precise than 	trends on tabular data – e.g., a 50% increase in a certain kind of 	complaint may be no more significant than a 5% change in a revenue 	variable.</li>
<li>I&#8217;m increasingly persuaded that <a href="http://www.dbms2.com/2009/08/21/social-network-analysis-aka-relationship-analytics/"> graph analytics</a> can be handled without a graph-centric data model. 	But right now, it isn&#8217;t being handled well at all. Lots more needs 	to be done – although when it is, it will just exacerbate the 	privacy/liberty dangers that so concern me.</li>
</ul>
<p><em><strong>Other posts based on my January, 2010 New England Database Summit keynote address</strong></em></p>
<ul>
<li><a title="Data-based snooping — a huge threat to liberty that we’re all helping make worse" href="../2010/01/31/data-based-snooping-threat-libert/">Data-based snooping — a huge threat to liberty that we’re all helping make worse</a></li>
<li><a title="Flash, other solid-state memory, and disk" href="../2010/01/31/flash-pcmsolid-state-memory-disk/">Flash, other solid-state memory, and disk</a></li>
<li><a title="Interesting trends in database and analytic technology" href="../2010/01/31/trends-database-aanalytic-technology/">Interesting trends in database and analytic technology</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2010/02/01/open-issues-in-database-and-analytic-technology/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic page generated in 0.393 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2012-02-09 10:59:57 -->

