<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Sybase</title>
	<atom:link href="http://www.dbms2.com/category/products-and-vendors/sybase/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Wed, 08 Feb 2012 12:22:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Terminology: Data mustering</title>
		<link>http://www.dbms2.com/2011/11/28/terminology-data-mustering/</link>
		<comments>http://www.dbms2.com/2011/11/28/terminology-data-mustering/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 19:10:11 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5736</guid>
		<description><![CDATA[I find myself in need of a word or phrase that means bring data together from various sources so that it&#8217;s ready to be used, where the use can be analysis or operations. The first words I thought of were &#8220;aggregation&#8221; and &#8220;collection,&#8221; but they both have other meanings in IT. Even &#8220;data marshalling&#8221; has [...]]]></description>
			<content:encoded><![CDATA[<p>I find myself in need of a word or phrase that means <strong>bring data together from various sources so that it&#8217;s ready to be used,</strong> where the use can be analysis or operations. The first words I thought of were &#8220;aggregation&#8221; and &#8220;collection,&#8221; but they both have other meanings in IT. Even &#8220;data marshalling&#8221; has a specific meaning different from what I want. So instead, I&#8217;ll go with <strong>data mustering.</strong></p>
<p>I mean for the term &#8220;data mustering&#8221; to encompass at least three scenarios:</p>
<ul>
<li>Integrated (relational) data warehouse.</li>
<li>Big bit bucket.</li>
<li>Big bit stream.</li>
</ul>
<p>Let me explain what I mean by each.  <span id="more-5736"></span></p>
<p><strong>&#8220;Integrated data warehouse&#8221;</strong> is a phrase Teradata has started using for enterprise data warehouses that, <a href="../../../../../2010/04/12/enterprise-data-warehouse-edw-myt/">like approximately every other EDW in the entire history of data warehousing</a>, aren&#8217;t truly enterprise-wide. In other words, it means &#8220;not just a data mart&#8221;. <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">No category name is perfect</a>, but I think that one works reasonably well.</p>
<p>I previously described the <strong><a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a></strong> use case as</p>
<blockquote><p>Users take a whole lot of data, often <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing.</p></blockquote>
<p>and quickly added</p>
<blockquote><p>Of course, there are various outfits who’d like to sell you not-so-cheap bit buckets. Contending technologies include <a href="../../../../../2011/06/02/why-you-would-want-an-appliance-and-when-you-wouldnt/">Hadoop appliances</a> (which I don’t believe in), <a href="../../../../../2009/10/18/technical-introduction-to-splunk/">Splunk</a> (which in many use cases I do), and <a href="../../../../../2010/11/29/marklogic-and-its-document-dbms/">MarkLogic</a> (ditto, but often the cases are different from Splunk’s). Cloudera and IBM, among other vendors, would also like to sell you some proprietary software to go with your standard Apache Hadoop code.</p></blockquote>
<p>I think I&#8217;ll stand pat on that explanation. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>By analogy, a <strong>big bit stream </strong>is various streams of data, assembled in the custody of a streaming engine. Sybase told me Wednesday that this scenario appears in both of the traditional markets for CEP/streaming &#8212; national intelligence, where it is a major use of streaming, and capital markets in some use cases as well. And it&#8217;s consistent with what I&#8217;ve heard from other CEP/streaming vendors as well.</p>
<p>As for where I got the word &#8220;mustering&#8221; &#8212; it&#8217;s a military term, for when you assemble your troops and their gear either for inspection or for actual use. The main modern usage I know of the word is as part of the phrase &#8220;pass muster&#8221;, which originally referred to the concept that the person being paid to put a regiment together should from time to time demonstrate that the regiment physically existed in the form that regimental records seemed to show.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/28/terminology-data-mustering/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Exasol update</title>
		<link>http://www.dbms2.com/2011/11/12/exasol-update/</link>
		<comments>http://www.dbms2.com/2011/11/12/exasol-update/#comments</comments>
		<pubDate>Sun, 13 Nov 2011 02:37:13 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Exasol]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5661</guid>
		<description><![CDATA[I last wrote about Exasol in 2008. After talking with the team Friday, I&#8217;m fixing that now. The general theme was as you&#8217;d expect: Since last we talked, Exasol has added some new management, put some effort into sales and marketing, got some customers, kept enhancing the product and so on. Top-level points included: Exasol&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p><a href="../../../../../2008/08/16/exasol-technical-briefing/">I last wrote about Exasol in 2008</a>. After talking with the team Friday, I&#8217;m fixing that now. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  The general theme was as you&#8217;d expect: Since last we talked, Exasol has added some new management, put some effort into sales and marketing, got some customers, kept enhancing the product and so on.</p>
<p>Top-level points included:</p>
<ul>
<li>Exasol&#8217;s technical philosophy is substantially the same as before, albeit not with as extreme a focus on fitting everything in RAM.</li>
<li>Exasol believes its flagship DBMS EXASolution has great performance on a load-and-go basis.</li>
<li>Exasol has 25 EXASolution customers, all in Germany.*</li>
<li>5 of those are &#8220;cloud&#8221; customers, at hosting providers engaged by Exasol.</li>
<li>EXASolution database sizes now range from the low 100s of gigabytes up to 30 terabytes.</li>
<li>Pretty much the whole company is in Nuremberg.</li>
</ul>
<p><span id="more-5661"></span><em>*That excludes some money from Hitachi. Exasol&#8217;s Hitachi partnership is still in limbo, an apparent casualty of the world economic crisis.</em></p>
<p>On the technical side:</p>
<ul>
<li>As noted in my 2008 post, EXASolution is a columnar, no-head-node MPP (Massively Parallel Processing) DBMS.</li>
<li>The main way EXASolution compresses data is via dictionary/tokenization. 5:1 is a typical compression ratio before mirroring and so on, out of a 2-10:1 range.</li>
<li>EXASolution writes data to blocks in memory that are smaller than what is otherwise its preferred size (1/2 to 5 megabytes). These are sent to disk, where merge eventually happens. Exasol insists that write performance has always been fully satisfactory to customers to date.</li>
<li>EXASolution doesn&#8217;t have much in the way of performance tuning knobs. Exasol says they aren&#8217;t needed, and says that one really can start an EXASolution POC (Proof of Concept) in a day or so.</li>
<li>EXASolution doesn&#8217;t have much in the way of workload management capabilities, except what&#8217;s automagic (e.g., short query bias). However, it does collect statistics you can query via your favorite BI tool.</li>
<li>EXASolution doesn&#8217;t have much in the way of <a href="../../../../../2011/02/24/analytic-platforms/">analytic platform</a> capabilities, although there is some Lua-based scripting. However, there&#8217;s something NDA in the analytic platform area Coming Soon.*</li>
</ul>
<p>In general, the whole thing sounds somewhat like ParAccel, at least at a high level.</p>
<p><em>*Exasol is not and never has been our client, but we can keep secrets for them even so.</em></p>
<p>Naturally, Exasol believes EXASolution has fine concurrency, with at least one customer routinely running 2000 concurrent users, 200 concurrent sessions (via connection pooling), and 5-10 concurrent queries. Another customer has 3500 Cognos users. 1-200 concurrent queries appears to be the record peak load. Anyhow, Exasol says that plans to offer real workload management could be accelerated if a need were discovered.</p>
<p>Exasol says it almost never loses POCs, but admits that it competes fairly rarely against Vertica and ParAccel, no doubt for reasons of geography. Exasol boasts one visible Sybase IQ replacement (Sony Music).</p>
<p>While Exasol&#8217;s sales to date have been in Germany, there are plans to change that soon. At least one sales cycle is well underway in Eastern Europe. Offices in other Germanic countries are planned. Existing customers are planning to deploy additional copies outside Germany. Discussions are underway regarding other geographies, e.g. English-speaking ones.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/12/exasol-update/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Very brief CEP/streaming catchup</title>
		<link>http://www.dbms2.com/2011/11/10/cep-streaming-catchup/</link>
		<comments>http://www.dbms2.com/2011/11/10/cep-streaming-catchup/#comments</comments>
		<pubDate>Fri, 11 Nov 2011 03:29:37 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[StreamBase]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Truviso]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5632</guid>
		<description><![CDATA[When I agreed to launch the StreamBase LiveView product via DBMS 2, I planned to catch up on the whole CEP/streaming area first. Due to the power and internet outages last week, that didn&#8217;t entirely happen. So I&#8217;ll do a bit of that now, albeit more cryptically than I hoped and intended. The upshot of [...]]]></description>
			<content:encoded><![CDATA[<p>When I agreed to launch the StreamBase LiveView product via <em>DBMS 2,</em> I planned to catch up on the whole CEP/streaming area first. Due to the power and internet outages last week, that didn&#8217;t entirely happen. So I&#8217;ll do a bit of that now, albeit more cryptically than I hoped and intended.</p>
<ul>
<li>The upshot of my <a href="../../../../../2011/08/25/renaming-cep-or-not/">what to call CEP thread</a> in August was that &#8220;streaming&#8221; and &#8220;event processing&#8221; are not the same concept, but it so happens that they have the most traction where they intersect. That said, I both observe and endorse an apparent shift from &#8220;event&#8221; to &#8220;stream&#8221; as the core of the terminology, in <a href="../../../../../2008/03/19/what-to-call-cep/">a reversal of my opinion of several years ago</a>.</li>
<li>IBM continues to throw a lot of resources at its <a href="../../../../../2009/05/13/ibm-system-s-infosphere-streams-processing/">System S/ InfoSphere Streams</a> product, but I haven&#8217;t heard yet of much marketplace success. That said, I believe IBM is still pretty serious about Streams, as one would expect from an effort whose code name so cheekily references <a href="http://www.softwarememories.com/2008/10/02/a-bit-of-db2-history-per-ibm/">System R</a>. In particular, Streams shows up prominently on IBM&#8217;s top-level analytic architecture slide.</li>
<li>Sybase recently released its ESP (Event Stream Processor) 5.0, which it says is the full merger of the Aleri and Coral8 predecessors. You can still get Sybase ESP without buying into the full <a href="../../../../../2010/02/05/sybase-aleri-rap/">Sybase RAP</a> stack, and Sybase has no plans to change that.</li>
<li>Sybase has discontinued all <a href="../../../../../2009/03/25/aleri-update/">the business intelligence types of products Aleri and Coral8 were developing</a>. Rather, Sybase is OEMing Panopticon, which it reports has been well received. Other than the discontinuation of the BI efforts, there seem to be few Aleri or Coral8 features missing from the merged Sybase ESP product.</li>
<li>Truviso continues to be <a href="../../../../../2010/05/04/truviso-evidently-reinvents-itself/">out of the picture</a>.</li>
<li>I have more to say about <a href="http://www.dbms2.com/2011/11/10/streambase-catchup/">StreamBase</a> separately.</li>
<li>I have more to say about Sybase and IBM, which I&#8217;ll get to when I can.</li>
<li>I have nothing new on Progress Apama. I also know little about any of the open source efforts.</li>
</ul>
<p>Meanwhile, if you want to see technically nitty-gritty posts about the CEP/streaming area, you may want to look at <a href="../../../../../category/memory-centric-data-management/event-stream-processing/page/4/">my CEP/streaming coverage circa 2007-9</a>, based on conversations with (among others) <a href="../../../../../2007/06/18/mike-stonebraker-on-financial-stream-processing/">Mike Stonebraker</a>, <a href="../../../../../2007/08/03/a-deeper-dive-into-apama/">John Bates</a>, and <a href="../../../../../2007/08/10/the-essence-of-cep-according-to-coral8/">Mark Tsimelzon</a>.</p>
<p><strong> </strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/10/cep-streaming-catchup/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Compression in Sybase ASE 15.7</title>
		<link>http://www.dbms2.com/2011/10/13/compression-in-sybase-ase-15-7/</link>
		<comments>http://www.dbms2.com/2011/10/13/compression-in-sybase-ase-15-7/#comments</comments>
		<pubDate>Fri, 14 Oct 2011 04:29:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Sybase]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5478</guid>
		<description><![CDATA[Sybase recently came up with Adaptive Server Enterprise 15.7, which is essentially the &#8220;Make SAP happy&#8221; release. Features that were slated for 2012 release, but which SAP wanted, were accelerated into 2011. Features that weren&#8217;t slated for 2012, but which SAP wanted, were also brought into 2011. Not coincidentally, SAP Business Suite will soon run [...]]]></description>
			<content:encoded><![CDATA[<p>Sybase recently came up with Adaptive Server Enterprise 15.7, which is essentially the &#8220;Make SAP happy&#8221; release. Features that were slated for 2012 release, but which SAP wanted, were accelerated into 2011. Features that weren&#8217;t slated for 2012, but which SAP wanted, were also brought into 2011. Not coincidentally, SAP Business Suite will soon run on Sybase Adaptive Server Enterprise 15.7.</p>
<p>15.7 turns out to be the first release of Sybase ASE with data compression. Sybase fondly believes that it is matching <a href="http://www.dbms2.com/2010/06/21/netezza-ibm-db2-compression/">DB2</a> and leapfrogging Oracle in compression rate with a single compression scheme, namely<strong> page-level tokenization. </strong>More precisely, SAP and Sybase seem to believe that about compression rates for actual SAP application databases, based on some degree of testing.   <span id="more-5478"></span></p>
<p><em>While Sybase ASE is unambiguously a row store, I&#8217;d be OK with calling that &#8220;<a href="http://www.dbms2.com/2011/02/06/columnar-compression-database-storage/">columnar compression</a>&#8220;. However, I wouldn&#8217;t expect compression ratios as strong as, say, Vertica&#8217;s, even in scenarios where Vertica was limited to dictionary compression only.</em></p>
<p>This is the second time I&#8217;ve heard recently about token compression being done one small block or page at a time (Sybase&#8217;s options for page size are 2/4/8/16K). As I noted in connection with <a href="http://www.dbms2.com/2011/09/22/teradata-columnar-compression/">Teradata&#8217;s similar strategy</a>,</p>
<blockquote><p>One benefit versus having a more global dictionary is that, since you  compress fewer items, compression tokens can each be shorter. (The  length of a typical token is a lot like the log of the cardinality of  the dictionary.) Another benefit is that smaller dictionaries are faster  to search. The obvious offsetting drawback is that a larger and more  global dictionary has the potential to compress various items that wind  up being left uncompressed in this smaller-scale scheme.</p></blockquote>
<p>I could also have added:</p>
<ul>
<li>It is straightforward to do join operations on globally-tokenized data.</li>
<li>It is forbiddingly difficult to do joins on locally-tokenized data; you need to decompress it before joining.</li>
</ul>
<p>However, Sybase ASE does buffer data in compressed form, so it enjoys at least some benefits of <strong>in-memory compression.</strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/13/compression-in-sybase-ase-15-7/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Virtual data marts in Sybase IQ</title>
		<link>http://www.dbms2.com/2011/08/26/virtual-data-marts-in-sybase-iq/</link>
		<comments>http://www.dbms2.com/2011/08/26/virtual-data-marts-in-sybase-iq/#comments</comments>
		<pubDate>Sat, 27 Aug 2011 04:11:46 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5142</guid>
		<description><![CDATA[I made a few remarks about Sybase IQ 15.3 when it became generally available in July. Now that I&#8217;ve had a current briefing, I&#8217;ll make a few more. The key enhancement in Sybase IQ 15.3 is distributed query &#8212; what others might call parallel query &#8212; aka PlexQ. A Sybase IQ query can now be [...]]]></description>
			<content:encoded><![CDATA[<p>I made a <a href="../../../../../2011/07/07/sybase-iq-soundbites/">few remarks about Sybase IQ 15.3</a> when it became generally available in July. Now that I&#8217;ve had a current briefing, I&#8217;ll make a few more.</p>
<p>The key enhancement in Sybase IQ 15.3 is distributed query &#8212; what others might call parallel query &#8212; aka PlexQ. A Sybase IQ query can now be distributed among many nodes, all talking to the same SAN (Storage-Area Network). Any Sybase IQ node can take the responsibility of being the &#8220;leader&#8221; for that particular query.</p>
<p>In itself, this isn&#8217;t that impressive; all the same things could have been said about pre-Exadata Oracle.* But PlexQ goes somewhat further than just removing a bottleneck from Sybase IQ. Notably, Sybase has rolled out a <strong>virtual data mart</strong> capability. Highlights of the Sybase IQ virtual data mart story include:   <span id="more-5142"></span></p>
<ul>
<li>A virtual data mart takes minutes for a DBA to set up.</li>
<li>A virtual data mart has a number of &#8220;logical&#8221; servers and disk volumes.</li>
<li>A virtual data mart can include data from the core Sybase IQ database, plus additional data that might not have passed data warehouse bureaucratic muster. (Perhaps even more than <a href="../../../../../2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">Teradata</a>, Sybase sees this as being the primary virtual data mart use case.)</li>
<li>Sybase IQ virtual data marts seem to be the mechanism for certain aspects of workload management. For example, they seem to be the only way to extend workload management to <a href="../../../../../2010/05/23/sybase-iq-15/">Sybase IQ&#8217;s in-database analytics</a>.</li>
</ul>
<p><em>*Of course, as a robust columnar DBMS, Sybase IQ lacks the fatal data warehousing drawbacks of pre-Exadata Oracle: I/O limitations, and the unnatural acts of database administration they induce.</em></p>
<p>Sybase is also proud of the elasticity of its new architecture, but seems no more able than I to come up with a use case in which anybody would much care.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/08/26/virtual-data-marts-in-sybase-iq/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Sybase IQ soundbites</title>
		<link>http://www.dbms2.com/2011/07/07/sybase-iq-soundbites/</link>
		<comments>http://www.dbms2.com/2011/07/07/sybase-iq-soundbites/#comments</comments>
		<pubDate>Thu, 07 Jul 2011 16:27:28 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Theory and architecture]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4933</guid>
		<description><![CDATA[Sybase made a total hash of the timing of this week&#8217;s press release. I got annoyed after they promised to inform me of the new embargo time, then broke the promise. Other people got annoyed earlier than that. So be it. Below is the draft of a post I was holding, with brackets added around [...]]]></description>
			<content:encoded><![CDATA[<p><em>Sybase made a total hash of the timing of this week&#8217;s press release. I got annoyed after they promised to inform me of the new embargo time, then broke the promise. Other people got annoyed earlier than that. </em></p>
<p><em>So be it. Below is the draft of a post I was holding, with brackets added around one word that is no longer accurate.<br />
</em></p>
<p>I don&#8217;t write enough about Sybase IQ. That said, I offered a couple of quotes to a reporter [yesterday] in connection with the general availability of Sybase IQ 15.3. Lightly edited, they go:</p>
<ul>
<li>&#8220;Shared-everything MPP&#8221; isn&#8217;t a total contradiction in terms. It&#8217;s great for adding in concurrent users. And there&#8217;s little doubt that Sybase IQ can support robust access to databases 10s of terabytes in size.</li>
<li>As I first noted a couple of years ago, <a href="../../../../../2009/06/08/the-future-of-data-marts/">virtual data marts are a good idea</a>. Too few vendors are making it easy to spin them out. They let departments start doing analytics very quickly, yet allow IT to keep partial control.</li>
</ul>
<p>Beyond that, I should note:</p>
<ul>
<li>Sybase IQ is the classic choice for what I call <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/">traditional data marts</a>.</li>
<li>Sybase IQ is a leader in <a href="http://www.dbms2.com/2011/06/20/temporal-data-time-series-and-imprecise-predicates/">temporal functionality</a>, which is not coincidental to its presence in the financial services market.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/07/sybase-iq-soundbites/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 1)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:17:44 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Benchmarks and POCs]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[Microsoft and SQL*Server]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Pricing]]></category>
		<category><![CDATA[QlikTech and QlikView]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Workload management]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4868</guid>
		<description><![CDATA[Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help. Let&#8217;s try eight categories instead. While no categorization [...]]]></description>
			<content:encoded><![CDATA[<p>Analytic data management technology has blossomed, leading to many questions along the lines of &#8220;So which products should I use for which category of problem?&#8221; The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for &#8220;big data&#8221; is little help.</p>
<p>Let&#8217;s try eight categories instead. While <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">no categorization is ever perfect</a>, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need &#8212; and in most cases you&#8217;ll need several &#8212; is a great early step in your analytic technology planning.  <span id="more-4868"></span></p>
<p><strong><em>Enterprise data warehouse</em></strong> (Full or partial)</p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, but especially operational</li>
<li><em>Likely use styles:</em> All</li>
<li><em>Canonical example:</em> Central EDW for a big enterprise</li>
<li><em>Stresses:</em> Concurrency, reliability, workload management</li>
</ul>
<p>The enterprise data warehouse (EDW) ideal says that you copy all your data into one place, and drive all decision-making from there. <a href="../../../../../2011/06/21/its-official-the-grand-central-edw-will-never-happen/">Full EDWs are pipedreams</a>. Still, a partial EDW makes sense for most large enterprises, and many indeed already have one. The first product lines to consider for classical EDWs are Teradata, DB2, Exadata, and maybe Microsoft SQL Server, especially if you&#8217;re going to stress concurrency and/or operational use cases.</p>
<p><strong><em>Traditional data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Business intelligence, budgeting/consolidation, investigative</li>
<li><em>Examples:</em> Reporting servers, planning/consolidation servers, anything MOLAP, etc.</li>
<li><em>Stresses:</em> Performance, concurrency, TCO</li>
</ul>
<p>Whether or not you have something like an enterprise data warehouse, it&#8217;s common to have lighter-weight data marts as well. A traditional data mart might drive reports and dashboards. Or it might be specialized for budgeting, planning, and/or consolidation.  Some <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> may be in the mix as well.</p>
<p>Any DBMS that can support an EDW can also support a data mart, but it may not be the most cost-effective way to do so. Columnar DBMS might have more attractive performance and TCO (Total Cost of Ownership); the same goes for Netezza. Some of them &#8212; e.g. Sybase IQ and <a href="../../../../../2011/06/20/vertica-release-5/">Vertica</a> &#8212; have excellent track records in concurrent usage as well. <a href="../../../../../2011/05/29/when-to-use-relational-database-management-system/">Ted Codd</a> pushed what amounts to MOLAP (Multidimensional OnLine Analytic Processing) systems for these use cases. But relational DBMS commonly do a better job, which is one reason most major MOLAP products have wound up at RDBMS companies.</p>
<p><strong><em>Investigative data mart &#8212; agile</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> A few analysts getting a few TB to examine</li>
<li><em>Stresses:</em> Ease of setup/load, ease of admin, price/performance</li>
</ul>
<p>Besides the traditional data mart, there are at least two other kinds. Both are focused on investigative analytics, but they&#8217;re differentiated by database size.</p>
<p>If you have just a few analysts,* looking at no more than a few terabytes of data (perhaps even just some gigabytes) &#8212; and if that data is &#8220;single-subject&#8221; and fairly homogenous &#8212; your watchwords should be &#8220;cheap&#8221;, &#8220;easy&#8221;, and &#8220;fast&#8221;. You don&#8217;t need to invest in much hardware, in expensive software, in much administrative effort (the analysts can be their own DBAs),  nor should you endure much set-up time. Just grab a product, grab some data, and start running queries (or extracts into the statistical tool of your choice).</p>
<p><em>*If you have dozens or even hundreds of analysts hitting the same database, you&#8217;re probably back to the more concurrency-oriented scenarios outlined above.</em></p>
<p>Infobright is often cost-effective among columnar analytic DBMS. Other vendors might cut you a price break as well. If you have multiple terabytes of data, don&#8217;t rule out Netezza&#8217;s lowest-end products (even if they&#8217;d really rather sell you something bigger). Or, if you&#8217;re in the sub-terabyte range, maybe you can get by with an in-memory BI tool such as QlikView, and not do anything special on the DBMS side at all.</p>
<p><strong><em>Investigative data mart &#8212; big</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All, especially customer-centric, logs, financial trade, scientific</li>
<li><em>Likely use styles</em>: Investigative</li>
<li><em>Canonical example:</em> Single-subject 20 TB &#8211; 20 PB relational database<em></em></li>
<li><em>Stresses:</em> Performance, scale-out, analytic functionality</li>
</ul>
<p>But if you&#8217;re looking at tens of terabytes of relational data, or even more, you really do have a &#8220;big data&#8221; problem. Performance and scalability are major challenges, usually best addressed by MPP (Massively Parallel Processing) systems, such as Netezza, Vertica, Aster Data, ParAccel, Teradata, or Greenplum. Performance POCs (Proofs Of Concept) are a big part of the buying process. Vendor price negotiations are crucial too.</p>
<p><em>Actually, in the low tens of terabytes you might be able to get away with a shared-disk system that has excellent compression &#8212; e.g., columnar products like Sybase IQ, Infobright, or SAND, rather than just Vertica and ParAccel.</em></p>
<p>Assuming you have affordable, scalable query performance, the competitive differentiator can switch to additional analytic functionality. Aster, Netezza, ParAccel, Vertica, and Greenplum either offer full <a href="../../../../../2011/02/24/analytic-platforms/">analytic platforms</a>, or seem to be on the path to doing so. Teradata, which now owns Aster Data, offers substantial built-in analytic capability in its traditional products as well, and the same goes for Sybase IQ.</p>
<p><em>Continued in <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/">Part 2</a>,</em><em> where we cover some of the more difficult use cases.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Temporal data, time series, and imprecise predicates</title>
		<link>http://www.dbms2.com/2011/06/20/temporal-data-time-series-and-imprecise-predicates/</link>
		<comments>http://www.dbms2.com/2011/06/20/temporal-data-time-series-and-imprecise-predicates/#comments</comments>
		<pubDate>Mon, 20 Jun 2011 06:11:43 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4786</guid>
		<description><![CDATA[I&#8217;ve been confused about temporal data management for a while, because there are several different things going on. Date arithmetic. This of course has been around for a very long &#8212; er, for a very long time. Time-series-aware compression. This has been around for quite a while too. &#8220;Time travel&#8221;/snapshotting &#8212; preserving the state of [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been confused about temporal data management for a while, because there are several different things going on.</p>
<ul>
<li><strong>Date      arithmetic.</strong> This of course has been around for a very long &#8212; er, for a very      long time.</li>
<li><strong>Time-series-aware      compression.</strong> This has been around for quite a while too.</li>
<li><strong>&#8220;Time      travel&#8221;/snapshotting</strong> &#8212; preserving the state of the database at      previous points in time. This is a matter of exposing (and not throwing      away) the information you capture via MVCC (Multi-Version Concurrency      Control) and/or append-only updates (as opposed to update-in-place). Those      update strategies are increasingly popular for pretty much anything except      update-intensive OLTP (OnLine Transaction Processing) DBMS, so      time-travel/snapshotting is an achievable feature for most vendors.</li>
<li><strong>Bitemporal      data access.</strong> This occurs when a fact has both a transaction timestamp and a      separate validity duration. <a href="http://en.wikipedia.org/wiki/Temporal_database">A Wikipedia article</a> seems to cover the subject pretty well, and I touched on <a href="http://www.dbms2.com/2009/08/02/teradata-13-focuses-on-advanced-analytic-performance/">Teradata&#8217;s      bitemporal plans</a> back in 2009.</li>
<li><strong>Time      series SQL extensions.</strong> <a href="http://www.dbms2.com/2011/06/20/vertica-as-an-analytic-platform/">Vertica</a> explained its version of these to me a few days ago. I      imagine Sybase IQ and other serious financial-trading market players have      similar features.</li>
</ul>
<p>In essence, the point of time series/event series SQL functionality is to<strong> do SQL against incomplete, imprecise, or derived data.*</strong> <span id="more-4786"></span>For example, suppose in one time series events happen at times 3.00, 3.01, 3.03, and 3.05; in another time series events happen at times 3.00, 3.02, 3.03, 3.04, and 3.05; and you want to join the time series together. Then you can do an <strong>event series join</strong> &#8212; i.e., you can join on each of the times 3.00, 3.01, 3.02, 3.03, 3.04, and 3.05, using interpolated values to check WHERE conditions. Vertica says that the only interpolation methods anybody ever wants are &#8220;first value in the interval,&#8221; &#8220;last value in the interval,&#8221; and &#8220;linear average of the endpoint values&#8221; (I forget whether that&#8217;s weighted by time-distance from the endpoints, or is a simple arithmetic mean).</p>
<p><em>*This is a </em>limited <em>counterexample to my dictum that <a href="../../../../../2011/06/19/investigative-analytics-derived-data/">you should explicitly store derived data because it&#8217;s too much trouble to keep re-deriving it on the fly</a>.</em></p>
<p>Also cool is the &#8220;CONDITION_TRUE_EVENT&#8221; syntax Vertica has had since Release 4.0, which generalized SQL 99 windowing; you now can look at all the rows that meet a specific criterion &#8212; via an arbitrary expression &#8212; rather than just being restricted to a row count or strict time duration. Vertica says it&#8217;s gone further in the direction of event series pattern matching in Vertica 5.0; I didn&#8217;t grasp the details, but it sounded philosophically akin to <a href="../../../../../2009/02/10/aster-data-npath/">Aster Data&#8217;s nPath</a>, albeit without the arbitrary-language procedural extensibility.</p>
<p>Finally, Vertica also gave me an imprecise-SQL example that has little to do with time series or other even series. Vertica has a concept of &#8220;range join,&#8221; implemented so that telecom firms can save space by storing partial IP addresses. I&#8217;ve noted before that while we should retain all human-generated data, <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">it will never be practical to retain all machine-generated data</a> (because its volume will keep going up based on the same technological factors that keep storage cost per unit volume going down). This sounds like one interesting (if specialized) approach to storing machine-generated data in summarized form.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/20/temporal-data-time-series-and-imprecise-predicates/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Columnar DBMS vendor customer metrics</title>
		<link>http://www.dbms2.com/2011/06/20/columnar-dbms-vendor-customer-metrics/</link>
		<comments>http://www.dbms2.com/2011/06/20/columnar-dbms-vendor-customer-metrics/#comments</comments>
		<pubDate>Mon, 20 Jun 2011 05:41:54 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4742</guid>
		<description><![CDATA[Last April, I asked some columnar DBMS vendors to share customer metrics. They answered, but it took until now to iron out a couple of details. Overall, the answers are pretty impressive.  Sybase said that Sybase IQ had &#62; 2000 direct customers and &#62;500 indirect customers (i.e., end customers of OEMs). That&#8217;s counting by customers; [...]]]></description>
			<content:encoded><![CDATA[<p>Last April, I asked some columnar DBMS vendors to share customer metrics. They answered, but it took until now to iron out a couple of details. Overall, the answers are pretty impressive.  <span id="more-4742"></span></p>
<p>Sybase said that <strong>Sybase IQ </strong>had<strong> &gt; 2000 direct customers </strong>and<strong> &gt;500 indirect customers</strong> (i.e., end customers of OEMs). That&#8217;s counting by customers; I know from prior discussions that Sybase IQ is running at close to two installations per customer. I also believe that Sybase counts different divisions of the same large enterprise as separate customers.</p>
<p><strong>Vertica</strong> cited a figure of <strong>500 customers</strong> as of April (end Q1?), which is close to <strong>600</strong> now, about <strong>40% or a little more direct.</strong> The difference between this and a <a href="http://www.dbms2.com/2011/02/14/now-we-know-why-vertica-has-been-so-weirdly-evasive/">2010 year-end figure of 328</a> is not only new sales, but also slow reporting by OEMs.  One cool figure &#8212; a single OEM reported 82 end sales in a single (quarterly?) report. And a number of those direct customers are substantial; Vertica&#8217;s <a href="http://www.vertica.com/customers/">customer logo</a> page features lots of telcos, lots of internet companies, and the national operation of Blue Cross/Blue Shield.</p>
<p><em>Pay no attention to small inconsistencies in the number of Vertica direct  customers (250 at year-end, no more than that now); Colin Mahony just  estimates these numbers for me from memory, and minor inaccuracies are quite excusable.</em></p>
<p>Even cooler &#8212; <strong>Vertica </strong>reports <strong>7 customers with a petabyte or more of user data each.</strong> About 5 of the 7 are obvious-suspect big-name firms; but unsurprisingly, those big names are NDA. I did secure permission to say that there are 2 telecom companies, a mobile gaming vendor, another internet company, and 3 financial services outfits of various kinds.</p>
<p><strong>SAND Technology </strong>reported <strong>&gt;600 total customers,</strong> including<strong> &gt;100 direct. </strong>Since SAND has been around since the 1990s, those aren&#8217;t great average annual figures, but they&#8217;re probably more than many people (including me) thought.</p>
<p><strong>Infobright</strong> reported around <strong>200 total paying customers, 130 direct.</strong> There are surely a lot more users of open source Infobright, but precise numbers are of course hard to come by.</p>
<p>If I asked <strong>ParAccel</strong> in the April go-round, I&#8217;ve misplaced their answer, but back in October the figure was &gt;30 customers, 2 of them over 100 terabytes. I&#8217;ve seen published figures of 40+ for ParAccel since.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/20/columnar-dbms-vendor-customer-metrics/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Attensity update</title>
		<link>http://www.dbms2.com/2011/04/14/attensity-update/</link>
		<comments>http://www.dbms2.com/2011/04/14/attensity-update/#comments</comments>
		<pubDate>Thu, 14 Apr 2011 12:07:11 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4295</guid>
		<description><![CDATA[I talked with Michelle de Haaff and Ian Hersey of Attensity back in February. We covered a lot of ground, so let&#8217;s start with a very high-level view. Two years ago, Attensity merged with two other companies in somewhat related businesses, thus expanding 4X or so in size. Due to the merger, Attensity now has [...]]]></description>
			<content:encoded><![CDATA[<p>I talked with Michelle de Haaff and Ian Hersey of Attensity back in February. We covered a lot of ground, so let&#8217;s start with a very high-level view.</p>
<ul>
<li>Two years ago, <a href="http://www.texttechnologies.com/2009/04/20/the-new-attensity-deal-overview/">Attensity merged with two other companies in somewhat related businesses</a>, thus expanding 4X or so in size.</li>
<li>Due to the merger, Attensity now has two core lines of business:
<ul>
<li>Text analytics.</li>
<li>Driving actions, such as call center or social media response, based on text analytics.</li>
</ul>
</li>
<li>The combined Attensity is part American, part German.</li>
<li>Attensity&#8217;s German part compels it to do some public financial reporting. Attensity will do $50-60 million in 2011 revenue.</li>
<li>Attensity crunches text in 17 languages. English is preeminent. #2 is &#8212; you guessed it! &#8212; German.</li>
<li>A big part of Attensity&#8217;s business (or at least of its value proposition) is analyzing the text in social media. Attensity boasts coverage of 75 million social media sources, such as blogs, forums, or review sites.</li>
</ul>
<p>The four most interesting technical points were probably:</p>
<ul>
<li><strong>Attensity has changed how it does </strong><a href="http://www.texttechnologies.com/2007/10/05/david-bean-of-attensity-explains-sentiment-and-other-qualifiers/"><strong>exhaustive extraction</strong></a><strong>.</strong> I&#8217;m having some trouble writing that part up, so for now I&#8217;ll just refer you to <a href="http://www.attensity.com/technology/semantic-server/exhaustive-extraction/">Attensity&#8217;s own description</a> of the new way of doing things.<em> </em></li>
<li><strong>Attensity has development work underway meant to address some of the problems in </strong><a href="http://www.texttechnologies.com/2010/12/01/state-of-the-art-text-analytics-mining-applications/"><strong>text analytics/other analytics integration</strong></a><strong>.</strong> I don&#8217;t feel I got enough detail to want to talk about that yet.</li>
<li><strong>Attensity runs its own data centers, with approximately 60 Hadoop/HBase nodes and 30 nodes of Apache Solr</strong> (open source text search). More on that below.</li>
<li><strong>Attensity now OEMs Vertica.</strong> More on that below too.</li>
</ul>
<p>Some more specific notes include:  <span id="more-4295"></span></p>
<ul>
<li>Attensity has long had customers who use text analytics as an input into churn analysis, for example <a href="http://www.attensity.com/2010/08/21/charles-schwab/">Charles Schwab</a>.</li>
<li>At least one customer, who may or may not wish to be named, uses Attensity technology to help <a href="../../../../../2011/01/11/the-technology-of-privacy-threats/">de-anonymize</a> social media posters. (I didn&#8217;t ask how that worked, actually.)</li>
<li>Attensity&#8217;s founding CTO David Bean has been gone for a while.</li>
<li>Social media analyzers generally require less sophisticated analytics than Attensity&#8217;s older kinds of customers.</li>
<li>Social media has, in part, a vocabulary all its own. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </li>
</ul>
<p><strong>Attensity and relational DBMS</strong></p>
<p>Notes on Attensity&#8217;s choice of DBMS to OEM include:</p>
<ul>
<li>Attensity uses Hadoop/HBase itself, but didn&#8217;t consider it realistic to try to persuade OEM customers to go that way.</li>
<li>I get the impression that Attensity&#8217;s two finalists were Vertica and Sybase IQ.</li>
<li>Attensity seems to have considered only query, and not more general <a href="../../../../../2011/02/24/analytic-platforms/">analytic platform</a> capabilities, which makes sense given that the evaluation was conducted (starting) in 2009.</li>
<li>One reason Vertica won was that it required very little tuning.</li>
<li>Another reason Vertica won was true MPP scale-out, notwithstanding that the largest known installation is capable of running on two nodes (although Attensity recommended that the customer get four just to be on the safe side).</li>
<li>Sybase IQ&#8217;s load speed was even better than Vertica&#8217;s.</li>
<li>Database max-size-to-date metrics include:
<ul>
<li>Under 1 terabyte.</li>
<li>50 million documents (not rows), growing by 1million documents per day..</li>
<li>Several hundred million sentences (I guess the documents are short, but it makes sense that they would be).</li>
<li>Several billion rows.</li>
</ul>
</li>
</ul>
<p>It seems there are two parts to the Attensity schema. The raw output of &#8220;exhaustive extraction&#8221; sounds as if it has rather narrow rows. But Attensity then builds something more star-schema-like to feed into BI tools. Perhaps the latter is the reason for preferring columnar DBMS. There don&#8217;t seem to be a lot of auxiliary tables; the only ones Ian cited were:</p>
<ul>
<li>Category tables have ontology up to a couple thousand rows</li>
<li>Tables of terms</li>
<li>Structured fields that provide metadata for the triples</li>
</ul>
<p>Previous Attensity database targets (partner, not OEM) included Teradata, SQL Server, Oracle, and MySQL. Hibernate layers were in the mix somewhere too. SQL Server actually had the best performance. I don&#8217;t think that&#8217;s counting a more recent Sybase IQ partnership, which only racked up a couple of sales.</p>
<p><strong>Attensity, Hadoop, and other non-relational technologies</strong></p>
<p>But that&#8217;s OEM. Attensity runs its own data centers, with approximately 60 Hadoop/HBase nodes and 30 nodes of Apache Solr (open source text search).* One reason for moving out of Amazon EC2 was that Solr cried out for solid-state drives; another was just cost.</p>
<p><em>*But those are just rough figures, from Ian&#8217;s memory.</em></p>
<p>Attensity uses HBase to store full-text documents. However, it doesn&#8217;t seem that this is a classic low-latency update HBase use case; Attensity reports doing 3 loads a day, 50 gigabytes of documents total. Apparently that works out to 1 billion documents/month; I gather Attensity just keeps them for 6 months. HBase has been nicely stable for Attensity.</p>
<p>Attensity uses Solr to build distributed search indexes. Solr has not been nicely stable.</p>
<p>What Attensity does in Hadoop seems to be rather simple NLP (Natural Language Processing), plus things one might do in a relational DBMS instead. Examples include:</p>
<ul>
<li>Named entity extraction.</li>
<li>Scoring for sentiment.</li>
<li>Influence scores, in whatever ways Attensity can calculate them.</li>
</ul>
<p>There surely also is some basic preprocessing, ingesting text (and document metadata) in various forms and normalizing it into a more standard format. Some real-time ingesting is done outside of Hadoop, in more of a queuing system, the most obvious example of that being the Twitter firehose. Ian suggested that in the future this system will get more uses, in the form of a <a href="http://www.texttechnologies.com/2006/07/27/uima-data-point/">UIMA</a>-like pipeline.</p>
<p>I further get the impression that Attensity uses Hadoop to do on a SaaS (Software As A Service) basis what its customers do in Vertica. The old idea that <a href="http://www.texttechnologies.com/2008/06/16/attensity-update-updated/">Attensity provides hosted services for about half its customers</a> still seems to apply, at least on the new-customer front. However, I&#8217;m not sure exactly which product lines Attensity was referring to when they said that.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/04/14/attensity-update/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

