<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Application areas</title>
	<atom:link href="http://www.dbms2.com/category/applications/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Tue, 07 Feb 2012 06:49:30 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Sumo Logic and UIs for text-oriented data</title>
		<link>http://www.dbms2.com/2012/02/06/sumo-logic-and-uis-for-text-oriented-data/</link>
		<comments>http://www.dbms2.com/2012/02/06/sumo-logic-and-uis-for-text-oriented-data/#comments</comments>
		<pubDate>Mon, 06 Feb 2012 13:27:06 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5897</guid>
		<description><![CDATA[I talked with the Sumo Logic folks for an hour Thursday. Highlights included: Sumo Logic does SaaS (Software as a Service) log management. Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as &#8220;Splunk-like&#8221;. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to [...]]]></description>
			<content:encoded><![CDATA[<p>I talked with the Sumo Logic folks for an hour Thursday. Highlights included:</p>
<ul>
<li>Sumo Logic does SaaS (Software as a Service) log management.</li>
<li>Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as &#8220;Splunk-like&#8221;. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to <a href="../../../../../2012/01/10/splunk-update/">branch out</a>.)</li>
<li>Sumo Logic has hacked Lucene for faster indexing, and says 10-30 second latencies are typical.</li>
<li>Sumo Logic&#8217;s main differentiation is <strong>automated classification of events. </strong></li>
<li>There&#8217;s some kind of streaming engine in the mix, to update counters and drive alerts.</li>
<li>Sumo Logic has around 30 &#8220;customers,&#8221; free (mainly) or paying (around 5) as the case may be.</li>
<li>A truly typical Sumo Logic customer has single to low double digits of gigabytes of log data per day. However, Sumo Logic seems highly confident in its ability to handle a terabyte per customer per day, give or take a factor of 2.</li>
<li>When I asked about the implications of shipping that much data to a remote data center, Sumo Logic observed that log data compresses really well.</li>
<li>Sumo Logic recently raised a bunch of venture capital.</li>
<li>Sumo Logic&#8217;s founders are out of ArcSight, a log management company HP paid a bunch of money for.</li>
<li>Sumo Logic coined a marketing term &#8220;LogReduce&#8221;, but it has nothing to do with &#8220;MapReduce&#8221;. Sumo Logic seems to find this amusing.</li>
</ul>
<p>What interests me about Sumo Logic is that automated classification story. I thought I heard Sumo Logic say:<span id="more-5897"></span></p>
<ul>
<li>It&#8217;s largely unsupervised machine learning.</li>
<li>It&#8217;s specific to a particular user/data set.</li>
<li>It can be up and running and classifying things effectively almost instantly (i.e., on seconds&#8217; or minutes&#8217; worth of data).</li>
<li>It&#8217;s informed by what different users tag as false positives. (Or maybe that is planned for future versions.)</li>
</ul>
<p><em>I have a little trouble seeing how all those points fit exactly together, so perhaps I got some details wrong.</em></p>
<p>The payoff is that <strong>machine learning directly informs the Sumo Logic user interface</strong>. In particular, large numbers of events are bundled into a small number of categories, hopefully making it much easier for network operations types to scan the UI and pick out what&#8217;s important.</p>
<p>In general, the idea of machine-learning informing analytic UIs via some sort of classification is common in text-oriented technologies, notably in:</p>
<ul>
<li>Good ol&#8217; text search.</li>
<li>Text mining vendors&#8217; approaches to clustering hits on words or phrases that say substantially the same thing.</li>
</ul>
<p>But otherwise it seems kind of rare, if we stipulate that ad-serving/general internet personalization isn&#8217;t really an analytic UI &#8212; but I&#8217;d love to hear of any interesting examples I&#8217;ve overlooked.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/02/06/sumo-logic-and-uis-for-text-oriented-data/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Couchbase update</title>
		<link>http://www.dbms2.com/2012/02/01/couchbase-update/</link>
		<comments>http://www.dbms2.com/2012/02/01/couchbase-update/#comments</comments>
		<pubDate>Thu, 02 Feb 2012 04:00:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Basho and Riak]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[CouchDB]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Zynga]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5877</guid>
		<description><![CDATA[I checked in with James Phillips for a Couchbase update, and I understand better what&#8217;s going on. In particular: Give or take minor tweaks, what I wrote in my August, 2010 Couchbase updates still applies. Couchbase now and for the foreseeable future has one product line, called Couchbase. Couchbase 2.0, the first version of Couchbase [...]]]></description>
			<content:encoded><![CDATA[<p>I checked in with James Phillips for a Couchbase update, and I understand better what&#8217;s going on. In particular:</p>
<ul>
<li>Give or take minor tweaks, what I wrote in my <a href="../../../../../2011/08/13/couchbase-business-update/">August, 2010 Couchbase updates</a> still applies.</li>
<li>Couchbase now and for the foreseeable future has one product line, called Couchbase.</li>
<li>Couchbase 2.0, the first version of Couchbase (the product) to use CouchDB for persistence, has slipped &#8230;</li>
<li>&#8230; because more parts of CouchDB had to be rewritten for performance than Couchbase (the company) had hoped.</li>
<li>Think mid-year or so for the release of Couchbase 2.0, hopefully sooner.</li>
<li>In connection with the need to rewrite parts of CouchDB, Couchbase has:
<ul>
<li><a href="../../../../../2012/01/18/notes-from-the-couch-blogs/">Gotten out of the single-server CouchDB business</a>.</li>
<li>Donated its proprietary single-sever CouchDB intellectual property to the Apache Foundation.</li>
</ul>
</li>
<li>The 150ish new customers in 2011 Couchbase brags about are real, subscription customers.</li>
<li>Couchbase has 60ish people, headed to &gt;100 over the next few months.</li>
</ul>
<p><span id="more-5877"></span><em>If you previously heard the brand names Couchbase Single or Couchbase Mobile, pay no further attention to them. Couchbase Single was CouchDB; Couchbase Mobile is part of Couchbase&#8217;s feature set.</em></p>
<p>The current product is Couchbase 1.8, which is a whole lot like what previously was called Membase. New features in Couchbase 1.8 (versus prior versions of Membase) were concentrated in client libraries/SDK (Software Development Kit). Not coincidentally, Couchbase has hired developer evangelists who are in charge of making Couchbase play nicely with various specific languages (e.g. C/C++)</p>
<p>Drilling down further into the CouchDB part of the story:</p>
<ul>
<li>Couchbase 2.0 will replace Couchbase 1.8/Membase&#8217;s SQLite back-end with CouchDB.</li>
<li>Parts of CouchDB that do things like read, write, or compact data have been rewritten from Erlang to C.</li>
<li>Couchbase still uses other Erlang parts of Apache CouchDB, and would be delighted if the community were to usefully enhance them.</li>
<li>Couchbase&#8217;s heavy contributions to development of open source CouchDB will, for the most part, continue.</li>
<li>CouchDB stuff donated to the Apache Foundation includes:
<ul>
<li>Documentation</li>
<li>Packaging</li>
<li>Performance enhancements</li>
</ul>
</li>
</ul>
<p>There&#8217;s at least one Couchbase user with &gt;1000 nodes (at a guess, <a href="../../../../../2011/09/05/zynga-linkedin-data-warehous/">Zynga</a>).  More typical might be 20 nodes or less. This led me to wonder how much data one puts on a Couchbase node anyway. The answer turns out to vary widely, in that you want your working set to be in RAM, and whether that&#8217;s your entire database or just a slice of it depends on the nature of the application.</p>
<p>James echoed a trend I&#8217;ve heard elsewhere as well, in which products one things of as being internet-specific are also sold in a few cases to conventional enterprises for &#8212; you guessed it! &#8212; their internet operations. I also asked him about competition, and he asserted:</p>
<ul>
<li>MongoDB is the big competition. He believes Couchbase has an excellent win rate vs. 10gen for actual paying accounts.</li>
<li>DataStax/Cassandra wins over Couchbase only when multi-data-center capability is important. Naturally, multi-data-center capability is planned for Couchbase. (Indeed, that&#8217;s one of the benefits of swapping in CouchDB at the back end.)</li>
<li>Redis has &#8220;dropped off the radar&#8221;, presumably because there&#8217;s no particular persistence strategy for it.</li>
<li>Riak doesn&#8217;t show up much.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/02/01/couchbase-update/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Splunk update</title>
		<link>http://www.dbms2.com/2012/01/10/splunk-update/</link>
		<comments>http://www.dbms2.com/2012/01/10/splunk-update/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 05:55:08 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5791</guid>
		<description><![CDATA[Splunk is announcing the Splunk 4.3 point release. Before discussing it, let&#8217;s recall a few things about Splunk, starting with: Splunk is first and foremost an analytic DBMS &#8230; &#8230; used to manage logs and similar multistructured data. Splunk&#8217;s DML (Data Manipulation Language) is based on text search, not on SQL. Splunk has extended its [...]]]></description>
			<content:encoded><![CDATA[<p>Splunk is announcing the Splunk 4.3 point release. Before discussing it, let&#8217;s recall a few things about Splunk, starting with:</p>
<ul>
<li>Splunk is first and foremost an analytic DBMS &#8230;</li>
<li>&#8230; used to manage logs and similar multistructured data.</li>
<li>Splunk&#8217;s DML (Data Manipulation Language) is based on text search, not on SQL.</li>
<li>Splunk has extended its DML in natural ways (e.g., you can use it to do calculations and even some statistics).</li>
<li>Splunk bundles some (very) basic, Splunk-specific business intelligence capabilities.</li>
<li>The paradigmatic use of Splunk is to monitor IT operations in real time. However:
<ul>
<li>There also are plenty of non-real-time uses for Splunk.</li>
<li>Splunk is proudest of its growth in non-IT quasi-real-time uses, such as the marketing side of web operations.</li>
</ul>
</li>
</ul>
<p>As in any release, a lot of Splunk 4.3 is about &#8220;Oh, you didn&#8217;t have that before?&#8221; features and <a href="../../../../../2009/08/21/bottleneck-whack-a-mole/">Bottleneck Whack-A-Mole</a> performance speed-up. One performance enhancement is Bloom filters, which are a very hot topic these days. More important is a switch from Flash to HTML5, so as to accommodate mobile devices with less server-side rendering. Splunk reports that its users &#8212; especially the non-IT ones &#8212; really want to get Splunk information on the tablet devices. While this somewhat contradicts <a href="../../../../../2012/01/04/some-issues-in-business-intelligence/">what I wrote a few days ago pooh-poohing mobile BI</a>, let me hasten to point out:</p>
<ul>
<li>Splunk is used for a lot of (quasi) real-time monitoring.</li>
<li>Splunk&#8217;s desktop user interfaces are, by BI standards, quite primitive.</li>
</ul>
<p>That&#8217;s pretty much the ideal scenario for mobile BI: Timeliness matters and prettiness doesn&#8217;t.</p>
<p><span id="more-5791"></span><em>Hmm. Maybe <a href="../../../../../2011/11/10/streambase-liveview-push-based-real-time-bi/">StreamBase LiveView</a> needs a mobile option as well &#8230;</em></p>
<p>Splunk&#8217;s basic use is to take the text string that is a log and make sense of it. But Splunk now also supports JSON structures. It does this via something called spath, which as you might guess from the name has XPath similarities. That probably bore more discussion than we found the time to have.</p>
<p><em>By the way: If you&#8217;re interested in BI over XML, that&#8217;s what my former clients at Skytide were founded to do, before they pivoted a bit. I don&#8217;t think those capabilities have disappeared from the product</em>.</p>
<p><a href="http://www.monash.com/uploads/Splunk-4-3.pdf">Splunk has graciously allowed me to post a slide deck</a>. More stuff in there, including quotes from a customer &#8212; Expedia &#8212; that has 2700 Splunk users.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/10/splunk-update/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Big data terminology and positioning</title>
		<link>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/</link>
		<comments>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 01:35:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5768</guid>
		<description><![CDATA[Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions: Bigness &#8212; Volume, Velocity, size Structure &#8212; Variety, Variability, Complexity given that High-velocity &#8220;big data&#8221; problems are usually high-volume as well.* Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction. But the conflation should [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I observed that <a href="../../../../../2011/09/11/big-data-has-jumped-the-shark/">Big Data terminology is seriously broken</a>. It is reasonable to reduce the subject to two quasi-dimensions:</p>
<ul>
<li><strong>Bigness</strong> &#8212; Volume, Velocity, size</li>
<li><strong>Structure</strong> &#8212; Variety, Variability, Complexity</li>
</ul>
<p>given that</p>
<ul>
<li>High-velocity &#8220;big data&#8221; problems are usually high-volume as well.*</li>
<li>Variety, variability, and complexity all relate to the <a href="../../../../../2011/05/17/poly-structured-database/">simply-structured/poly-structured</a> distinction.</li>
</ul>
<p>But the conflation should stop there.</p>
<p><em>*Low-volume/high-velocity problems are commonly referred to as <a href="../2011/08/25/renaming-cep-or-not/">&#8220;event processing&#8221; and/or &#8220;streaming&#8221;</a>.</em></p>
<p>When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2&#215;2 matrix of possibilities. For want of better alternatives, my suggestions are:</p>
<ul>
<li><strong>Relational big data</strong> is data of high volume that fits well into a relational DBMS.</li>
<li><strong>Multi-structured big data</strong> is data of high volume that doesn&#8217;t fit well into a relational DBMS. <em>Alternative: Poly-structured big data.</em></li>
<li><strong>Conventional relational data</strong> is data of not-so-high volume that fits well into a relational DBMS. <em>Alternatives: Ordinary/normal/smaller relational data.</em></li>
<li><strong>Smaller poly-structured data</strong> is data for which <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schema</a> capabilities are important, but which doesn&#8217;t rise to &#8220;big data&#8221; volume.</li>
</ul>
<p><span id="more-5768"></span>Notes on all this include:</p>
<ul>
<li>&#8220;Relational big data&#8221; is commonly what you need a scalable analytic relational DBMS for. But there are non-analytic use cases as well.</li>
<li>The paradigmatic example of &#8220;multi-structured big data&#8221; is log files. Thus, multi-structured big data is commonly what you need a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> for.</li>
<li>One might want to equate non-analytic relational big data technology to &#8220;NewSQL&#8221;. However, I&#8217;m struggling to think of a database size range in which the entire NewSQL industry can match Oracle&#8217;s market share alone.</li>
<li>One might want to equate non-analytic multi-structured big data technology to &#8220;NoSQL&#8221;. However:
<ul>
<li>&#8220;NoSQL&#8221; is also used to encompass not-so-big-data use cases, such as prototyping in MongoDB.</li>
<li><a href="../../../../../2011/10/02/defining-nosql/">&#8220;NoSQL&#8221; has non-ACID/low(er)-data-integrity connotations</a> that aren&#8217;t appropriate for all non-relational systems.</li>
</ul>
</li>
<li>Up to a point, you can analyze relational big data in a conventional relational DBMS, but an analytic RDBMS will usually win on TCO (Total Cost of Ownership). In particular, reasonable thresholds for moving an analytic database off Oracle might be:
<ul>
<li>1-2 terabytes if you&#8217;ve never bought anything past Oracle Standard Edition.</li>
<li>5-10 terabytes if you&#8217;re already paying for Oracle Enterprise Edition.</li>
<li>A lot higher than that if you actually find Oracle Exadata to be cost-effective.</li>
</ul>
</li>
<li>Depending on how big one acknowledges as &#8220;big&#8221;, the market share leader in &#8220;big bit bucket&#8221; use cases is either Splunk or Hadoop.</li>
<li>If we look at multi-structured big data management overall, MarkLogic joins the list of market share contenders, as do various NoSQL alternatives.</li>
<li>It is wrong to say that the large web companies invented &#8220;big data&#8221; technology. But it is more reasonable to say they invented much of &#8220;multi-structured big data&#8221; management. In particular (and this is just a partial list), Google, Amazon, Yahoo, Facebook, et al. can reasonably be credited with Hadoop, Cassandra, HBase and various predecessors to same.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Agile predictive analytics &#8211; the heart of the matter</title>
		<link>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/</link>
		<comments>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 19:40:26 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5746</guid>
		<description><![CDATA[I&#8217;ve already suggested that several apparent issues in predictive analytic agility can be dismissed by straightforwardly applying best-of-breed technology, for example in analytic data management. At first blush, the same could be said about the actual analysis, which comprises: Data preparation, which is tedious unless you do a good job of automating it. Running the [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve already suggested that several apparent issues in <a href="http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-easy-parts/">predictive analytic agility</a> can be dismissed by straightforwardly applying best-of-breed technology, for example in analytic data management. At first blush, the same could be said about the actual analysis, which comprises:</p>
<ul>
<li>Data preparation, which is tedious unless you do a good job of automating it.</li>
<li>Running the actual algorithms.</li>
</ul>
<p>Numerous statistical software vendors (or open source projects) help you with the second part; some make strong claims in the first area as well (e.g., my clients at KXEN). Even so, large enterprises typically have statistical silos, commonly featuring expensive annual SAS licenses and seemingly slow-moving SAS programmers.</p>
<p>As I see it, the predictive analytics workflow goes something like this<span id="more-5746"></span>:</p>
<ul>
<li>Business-knowledgeable people develop a theory as to what kinds of information and segmentation could be valuable in making better business micro-decisions.</li>
<li>Statistics-knowledgeable people determine a structure for modeling that reflects this theory.</li>
<li>Statistics-knowledgeable people tweak the model over time, within a fixed general structure, as new data comes in.</li>
<li>(Optional) Somebody sees to acquiring whatever data is needed that the organization doesn&#8217;t already have (and won&#8217;t get in the ordinary course of ongoing business).</li>
</ul>
<p>The optional last part can be a purchase of third-party information (relatively fast and easy) or the development of a business process (and if necessary associated software) to capture the information (not always so easy). But even if that&#8217;s taken care of, or not present, we have at least two hand-offs where agility can be lost:</p>
<ul>
<li>Businesspeople may throw a request &#8220;over the wall&#8221; to the statisticians, who then work on it as their schedule permits.</li>
<li>Once created, a model may be so set in stone that even small changes are as hard as building a new model from scratch.</li>
</ul>
<p>The second problem can be solved by the statisticians themselves, without outside involvement. Model research and model refinement should be separate processes. You can recheck your clustering on one schedule, but recalibrate your regressions against each cluster more frequently. If that all sounds forbiddingly difficult, perhaps your model recalibration process needs another level of automation.</p>
<p>So I&#8217;ve finally gotten to the point of saying what may have been obvious from the start: <strong>The only excusable impediment to predictive analytic agility is the hand-off from the people who know the business to the people who know the math.</strong> So let&#8217;s examine ways that difficulty can be resolved.</p>
<p>At big internet companies, the usual answer is something like</p>
<blockquote><p>Hey, it&#8217;s just data. From web logs. And network event logs. The data scientists know how to handle that.</p></blockquote>
<p>In financial trading firms, the answer is more</p>
<blockquote><p>The traders and analysts work closely together. Very closely. In fact, when the traders rip out their phones and throw them across the room, the analysts need to duck to avoid getting clobbered.</p></blockquote>
<p>In credit card or telecom marketing or insurance actuarial organizations, the answer may be</p>
<blockquote><p>Don&#8217;t worry; the stats geeks have been at this for a long time; they really do understand our business.</p></blockquote>
<p>All three approaches work.</p>
<p>But what about conventional enterprises, where line-of-business people may not be as math-savvy as internet developers or financial traders, and where the math experts may not have the business issues down cold? My flippant answer is that businesspeople should know some math too.* My more serious answer is that <strong>the &#8220;business analyst&#8221; role should be expanded </strong>beyond BI and planning<strong> to include lightweight predictive analytics as well.</strong></p>
<p><em>*I wasn&#8217;t being entirely flippant, of course. Statistics is even being taught in high school these days. And when I got a PhD in game theory, 2/3 of my thesis committee was at the Harvard Business School.</em></p>
<p>For example, at retailers:</p>
<ul>
<li>Market basket analysis is pretty simplistic (it only looks at small subsets of a basket at a time).</li>
<li>Seasonality is tricky. (Weather and so on can skew it.)</li>
<li>Each store or region can be its own universe.</li>
<li>Some of the results of analytics are rather coarse-grained &#8212; e.g., merchandise adjacencies &#8212; so precision in statistical analysis may not matter much anyway.</li>
</ul>
<p>And so truly rigorous statistical analysis may be both unfeasible and unnecessary; a lot of business-informed seat-of-the-pants reasoning needs to be mixed in. Consequently, there&#8217;s a lot to be said for pushing at least some retail predictive analytics pretty close to the merchandising department(s).</p>
<p>Similar stories could be told in many other industries and pursuits, including but emphatically not limited to:</p>
<ul>
<li>Event marketing.</li>
<li>College admissions.</li>
<li>Political campaigning.</li>
<li>Field maintenance at utility companies.</li>
<li>Price-setting (across many industries).</li>
</ul>
<p>In each case, it&#8217;s easy to see how statistical and predictive analytic techniques could add real value to the business. But it&#8217;s hard to imagine how the enterprise could support the kind of large, experienced, business-knowledge analytic operation one might find in hedge fund investing or telecom churn analysis. And absent that, it&#8217;s tough to see why the only people doing predictive analytics for the organization should sit in some silo of statistical expertise.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Agile predictive analytics &#8212; the &#8220;easy&#8221; parts</title>
		<link>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-easy-parts/</link>
		<comments>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-easy-parts/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 19:38:58 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5747</guid>
		<description><![CDATA[I&#8217;m hearing a lot these days about agile predictive analytics, albeit rarely in those exact terms. The general idea is unassailable, in that it boils down to using data as quickly as reasonably possible. But discussing particulars is hard, for several reasons: Pundits tend to sketch castles in the air. Vendors tend to confuse part [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m hearing a lot these days about <strong>agile predictive analytics</strong>, albeit rarely in those exact terms. The general idea is unassailable, in that it boils down to <strong>using data as quickly as reasonably possible.</strong> But discussing particulars is hard, for several reasons:</p>
<ul>
<li><a href="http://www.column2.com/2011/11/agile-predictive-process-platforms-for-business-agility-with-jameskobielus/">Pundits tend to sketch castles in the air</a>.</li>
<li>Vendors tend to confuse part of the story &#8212; generally the part they happen to offer <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  &#8212; with the whole.</li>
<li>Different use cases give rise to different kinds of issues.</li>
</ul>
<p>At least three of the generic arguments for agility apply to predictive analytics:</p>
<ul>
<li>Doing the correct thing soon is usually better than doing the same correct thing later.</li>
<li>If it doesn&#8217;t take much time to do something, hopefully it doesn&#8217;t take that much expense (labor and so on) either.</li>
<li>It&#8217;s hard to get new stuff completely right on the first try. Often, the best strategy is to come close fast, then fix what&#8217;s still not ideal.</li>
</ul>
<p>But the reasons to want agile predictive analytics don&#8217;t stop there.</p>
<p><span id="more-5747"></span>Not only is it hard to get stuff right on the first try for a given information set, but the available information can also change quickly. For example:</p>
<ul>
<li>If you&#8217;re a consumer marketer, consumer tastes can change quickly, due to news (of many different kinds), seasonal trends, and so on. The most recent data you have contain information unavailable in your historical data sets. Also &#8230;</li>
<li>&#8230; if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn&#8217;t well-reflected in your previous models.</li>
<li>If you&#8217;re in capital markets, and you figure something out, probably so will rival investors. So whatever you knew three weeks ago may already be partially obsolete.</li>
</ul>
<p>What&#8217;s more, often you deliberately don&#8217;t want to test, model, or tune all your variables at once. First you determine whether the ad text should be &#8220;Would you be so kind as to allow us to supply you with our wares?&#8221; or &#8220;Buy it, dude!&#8221;; only afterwards do you decide whether the color scheme should rely on red or green.</p>
<p>With that as backdrop, how can you make your predictive analytics more agile? Let&#8217;s start by breaking predictive analytics into four pieces:</p>
<ul>
<li><a href="http://www.dbms2.com/2011/11/28/terminology-data-mustering/">Data mustering</a> for the analysts.</li>
<li>Actual analysis.</li>
<li>Data mustering for deployment.</li>
<li>Deployment.</li>
</ul>
<p><strong>Only the second of those has much excuse for being an agility bottleneck;</strong> the other three are well addressed by technology you can buy (or straightforwardly build) today.</p>
<p>The deployment part of the story can be pretty simple, at least technically &#8212; spit out some PMML (Predictive Modeling Markup Language), and if you&#8217;re deploying to a DBMS with good enough PMML support, you&#8217;re good to go. Any vendor who doesn&#8217;t offer that degree of simplicity had better be working toward it fast. That said, your applications that are infused with predictive analytics need to be modular enough to accommodate model changes; if not, some refactoring lies ahead. And the same can be said for the work processes that surround them.</p>
<p>The data mustering parts should be pretty straightforward too. Setting up a relational data mart tuned for <a href="http://www.dbms2.com/2011/03/03/investigative-analytics/">investigative analytics</a> isn&#8217;t all that hard or costly (perhaps unless your database is enormously large), and the same actually goes for a Hadoop cluster. Beyond that, if you can model and deploy from the same database, that&#8217;s great; if not, you have an ETL (Extract/Transform/Load) need. I guess you could have data quality/MDM (Master Data Management) issues as well, but offhand I&#8217;m not seeing why you wouldn&#8217;t push their solutions back to analysis time. And any decent analytic technology stack can give sub-hour latency; <a href="http://www.dbms2.com/2009/09/10/analytic-speed-latency/">while that may not suffice from all standpoints</a>, it&#8217;s plenty fast enough for analysis-time agility.</p>
<p>With those preliminaries out of the way, now let&#8217;s turn to <a href="http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/">the heart of the agile predictive analytics challenge</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-easy-parts/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Terminology: Data mustering</title>
		<link>http://www.dbms2.com/2011/11/28/terminology-data-mustering/</link>
		<comments>http://www.dbms2.com/2011/11/28/terminology-data-mustering/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 19:10:11 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Teradata]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5736</guid>
		<description><![CDATA[I find myself in need of a word or phrase that means bring data together from various sources so that it&#8217;s ready to be used, where the use can be analysis or operations. The first words I thought of were &#8220;aggregation&#8221; and &#8220;collection,&#8221; but they both have other meanings in IT. Even &#8220;data marshalling&#8221; has [...]]]></description>
			<content:encoded><![CDATA[<p>I find myself in need of a word or phrase that means <strong>bring data together from various sources so that it&#8217;s ready to be used,</strong> where the use can be analysis or operations. The first words I thought of were &#8220;aggregation&#8221; and &#8220;collection,&#8221; but they both have other meanings in IT. Even &#8220;data marshalling&#8221; has a specific meaning different from what I want. So instead, I&#8217;ll go with <strong>data mustering.</strong></p>
<p>I mean for the term &#8220;data mustering&#8221; to encompass at least three scenarios:</p>
<ul>
<li>Integrated (relational) data warehouse.</li>
<li>Big bit bucket.</li>
<li>Big bit stream.</li>
</ul>
<p>Let me explain what I mean by each.  <span id="more-5736"></span></p>
<p><strong>&#8220;Integrated data warehouse&#8221;</strong> is a phrase Teradata has started using for enterprise data warehouses that, <a href="../../../../../2010/04/12/enterprise-data-warehouse-edw-myt/">like approximately every other EDW in the entire history of data warehousing</a>, aren&#8217;t truly enterprise-wide. In other words, it means &#8220;not just a data mart&#8221;. <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">No category name is perfect</a>, but I think that one works reasonably well.</p>
<p>I previously described the <strong><a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a></strong> use case as</p>
<blockquote><p>Users take a whole lot of data, often <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing.</p></blockquote>
<p>and quickly added</p>
<blockquote><p>Of course, there are various outfits who’d like to sell you not-so-cheap bit buckets. Contending technologies include <a href="../../../../../2011/06/02/why-you-would-want-an-appliance-and-when-you-wouldnt/">Hadoop appliances</a> (which I don’t believe in), <a href="../../../../../2009/10/18/technical-introduction-to-splunk/">Splunk</a> (which in many use cases I do), and <a href="../../../../../2010/11/29/marklogic-and-its-document-dbms/">MarkLogic</a> (ditto, but often the cases are different from Splunk’s). Cloudera and IBM, among other vendors, would also like to sell you some proprietary software to go with your standard Apache Hadoop code.</p></blockquote>
<p>I think I&#8217;ll stand pat on that explanation. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>By analogy, a <strong>big bit stream </strong>is various streams of data, assembled in the custody of a streaming engine. Sybase told me Wednesday that this scenario appears in both of the traditional markets for CEP/streaming &#8212; national intelligence, where it is a major use of streaming, and capital markets in some use cases as well. And it&#8217;s consistent with what I&#8217;ve heard from other CEP/streaming vendors as well.</p>
<p>As for where I got the word &#8220;mustering&#8221; &#8212; it&#8217;s a military term, for when you assemble your troops and their gear either for inspection or for actual use. The main modern usage I know of the word is as part of the phrase &#8220;pass muster&#8221;, which originally referred to the concept that the person being paid to put a regiment together should from time to time demonstrate that the regiment physically existed in the form that regimental records seemed to show.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/28/terminology-data-mustering/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Some big-vendor execution questions, and why they matter</title>
		<link>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/</link>
		<comments>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 11:01:20 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Cognos]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[HP and Neoview]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[IBM and DB2]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[SAP AG]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5704</guid>
		<description><![CDATA[When I drafted a list of key analytics-sector issues in honor of look-ahead season, the first item was &#8220;execution of various big vendors&#8217; ambitious initiatives&#8221;.  By &#8220;execute&#8221; I mean mainly: &#8220;Deliver products that really meet customers&#8217; desires and needs.&#8221; &#8220;Successfully convince them that you&#8217;re doing so &#8230;&#8221; &#8220;&#8230; at an attractive overall cost.&#8221; Vendors mentioned [...]]]></description>
			<content:encoded><![CDATA[<p>When I drafted a list of key analytics-sector issues in honor of <a href="http://www.dbms2.com/2011/11/21/analytic-trends-in-2012-qa/">look-ahead season</a>, the first item was &#8220;execution of various big vendors&#8217; ambitious initiatives&#8221;.  By &#8220;execute&#8221; I mean mainly:</p>
<ul>
<li>&#8220;Deliver products that really meet customers&#8217; desires and needs.&#8221;</li>
<li> &#8220;Successfully convince them that you&#8217;re doing so &#8230;&#8221;</li>
<li>&#8220;&#8230; at an attractive overall cost.&#8221;</li>
</ul>
<p>Vendors mentioned here are Oracle, SAP, HP, and IBM. Anybody smaller got left out due to the length of this post. Among the bigger omissions were:</p>
<ul>
<li>salesforce.com (multiple subjects).</li>
<li><a href="../../../../../2011/04/21/sas-hpa-does-make-sense-after-all/">SAS HPA</a>.</li>
<li><a href="../../../../../2011/08/21/hadoop-evolution/">The evolution of Hadoop</a>.</li>
</ul>
<p><span id="more-5704"></span><strong>A (lingering) issue for SAP and Oracle alike</strong></p>
<p>As I noted in January of this year, <a href="../../../../../2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/">integration of business intelligence into operational apps is making very slow progress</a>. Even so, it&#8217;s a huge part of the apparent strategy at SAP and Oracle alike, as well it should be. Much of the benefit from automating routine desk work has already happened. The areas ripest for exploitation are the ones where analytics are part of the equation.</p>
<p>Given the lack of tangible progress, why do I think this is a genuine area of Oracle and SAP emphasis? Three reasons of many are:</p>
<ul>
<li>Why else did SAP buy Business Objects?</li>
<li>If they&#8217;re not trying to <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrate operational apps and analytics</a>, why else does SAP&#8217;s emphasis on HANA make sense?</li>
<li>Without business intelligence in the picture, how does Oracle&#8217;s integrated-stack story promise any direct user benefits?*</li>
</ul>
<p><em>*As opposed to IT concerns &#8212; integration, administration, TCO (Total Cost of Ownership), etc.</em></p>
<p>After so many years of disappointment, I&#8217;m not going to forecast 2012 as a pivotal year for <strong>the integration of business intelligence into operational applications.</strong> But if one of SAP or Oracle ever does get a significant BI/operational app integration advantage over the other, it could be a major competitive advantage in those application market segments that are still up for grabs. It also is an opportunity for both vendors to gain BI market share in their respective application customer bases.</p>
<p><strong>A more urgent issue for SAP</strong></p>
<p>SAP has put huge amounts of credibility on the line for HANA, the integration of two different and not particularly mature in-memory database technologies. So far, it is difficult to find evidence that HANA is robust enough for widespread adoption. Whether or not SAP can fix that is a huge open question, which could have significant impact on the course of several technology areas: applications, business intelligence, in-memory DBMS, and maybe even hardware.</p>
<p>Based on current information, which is admittedly partial, I&#8217;m a short-term pessimist on HANA. Longer-term, I&#8217;m on record as saying that <a href="../../../../../2011/05/23/databases-ram/">traditional databases will eventually wind up in RAM</a>. SAP will surely get that technology right some day, whether or not the way it does so has anything to do with present-day HANA code.</p>
<p><strong>Four more issues for Oracle </strong></p>
<p>Oracle&#8217;s ambitions are near-endless, and so also therefore is its list of execution challenges. Four in the analytics area that I find particularly interesting are:</p>
<ul>
<li><strong>True hybrid columnar DBMS.</strong> <a href="../../../../../2011/09/22/teradata-columnar-compression/">I was guessing that Oracle, like Teradata, would announce true hybrid columnar the week of Oracle OpenWorld</a>. I was wrong. But if Oracle can&#8217;t bring out true hybrid columnar DBMS functionality relatively soon, Exadata will lose credibility as a competitor to more specialized analytic DBMS.</li>
<li><strong>Oracle Exalytics.</strong> With Exalytics in the mix, Oracle&#8217;s technology stack has HANA-like potential. But will Exalytics even ship in 2012? (I think so.) Will it be good for much in the first release? (I&#8217;m skeptical.)</li>
<li><strong>Oracle&#8217;s Big Data Appliance</strong>. I&#8217;m skeptical both about <a href="../../../../../2011/10/20/more-notes-on-oracle-nosql/">Oracle&#8217;s NoSQL product</a> &#8212; <a href="http://www.infoworld.com/d/data-explosion/first-look-oracle-nosql-database-179107">a favorable InfoWorld review</a> notwithstanding &#8212; and <a href="../../../../../2011/09/23/hadoop-appliances/">Hadoop appliances</a>. But if I&#8217;m wrong, and Oracle can successfully embrace/extend the new non-relational paradigms, then it really might regain control over the evolution of data management.</li>
<li><strong><a href="../../../../../2011/10/18/oracle-is-buying-endeca/">Oracle&#8217;s Endeca acquisition</a></strong> &#8212; will Oracle prove me wrong and integrate Endeca effectively into its overall analytic product line? If it does, we might finally see effective text (and eventually speech) navigation of enterprise software. (But as with all Oracle issues cited here, this is something that probably won&#8217;t amount to much in 2012 even if it does later go well.)</li>
</ul>
<p><strong>Three issues for IBM</strong></p>
<p>Like Oracle, IBM is a huge company with many ambitions and hence many execution challenges. The biggest of those is surely: <strong>How effective can IBM be at selling outside its existing customer base?</strong> I don&#8217;t hear as much competitively about IBM DataStage, IBM SPSS or now IBM Netezza as I did when their vendors were independent companies. Even Cognos may not be much of an exception to the rule, although it has its own large customer base outside of IBM&#8217;s traditional one. (To lesser extents , the same is of course true of Netezza and numerous other IBM acquisitions.)</p>
<p>Another general issue for IBM is <strong>substantively integrating its various product lines,</strong> at least to the extent that makes sense. DB2/Netezza integration sounds good, but even that is a matter more of product marketing (the admirable part of that discipline) more than of actual technology. Other integrations (e.g. Cognos/DB2 in various bundles) have tended toward the dubious side.*</p>
<p><em>*I&#8217;m still waiting for IBM to get back to me with examples of how Cognos/DB2 joint tuning amounts to anything. It&#8217;s been more than a year, so I&#8217;m glad I didn&#8217;t hold my breath.</em></p>
<p>In a somewhat narrower vein, I wonder: <strong><a href="../../../../../2011/11/10/cep-streaming-catchup/">Will IBM be able to gain traction for InfoSphere Streams</a>? </strong>And if so, when and where will the traction be?</p>
<p><strong>Will HP screw up Vertica?</strong></p>
<p>Vertica has a very attractive product offering. It&#8217;s perhaps <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">the most scalable analytic DBMS outside of Teradata</a>, running on the hardware of your reasonable choice.  It&#8217;s also the one I recommend most often to clients in the 1-50 terabyte range.</p>
<p>So far HP doesn&#8217;t seem to have done much to leadfoot Vertica. (About all I&#8217;ve heard from competitors is that Vertica seems to have faded somewhat in the financial services market, and there could be multiple explanations if that is indeed true.) But if HP Vertica does somehow manage to botch things, opportunities will open up for a range of columnar analytic DBMS competitors.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/21/big-vendor-execution-analytics/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>StreamBase catchup</title>
		<link>http://www.dbms2.com/2011/11/10/streambase-catchup/</link>
		<comments>http://www.dbms2.com/2011/11/10/streambase-catchup/#comments</comments>
		<pubDate>Fri, 11 Nov 2011 03:31:44 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[StreamBase]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5630</guid>
		<description><![CDATA[While I was cryptic in my general CEP/streaming catchup, I&#8217;ll say a bit more regarding StreamBase in particular. At the highest level, non-technically: StreamBase once planned to conquer the world. However, StreamBase really only sold effectively in the financial trading and intelligence markets. StreamBase retrenched, focusing almost exclusively on the financial trading market. With StreamBase [...]]]></description>
			<content:encoded><![CDATA[<p>While I was cryptic in my general <a href="http://www.dbms2.com/2011/11/10/cep-streaming-catchup/">CEP/streaming catchup</a>, I&#8217;ll say a bit more regarding StreamBase in particular. At the highest level, non-technically:</p>
<ul>
<li>StreamBase once planned to conquer the world.</li>
<li>However, StreamBase really only sold effectively in the financial trading and intelligence markets.</li>
<li>StreamBase retrenched, focusing almost exclusively on the financial trading market.</li>
<li>With <a href="http://www.dbms2.com/2011/11/10/streambase-liveview-push-based-real-time-bi/">StreamBase LiveView</a>, StreamBase is expanding from embedded <a href="../../../../../2011/11/08/terminology-operational-analytics/">operational analytics</a> to do (also operational) business intelligence as well.</li>
<li>StreamBase is hopeful that, perhaps starting with Version 2 or so, LiveView will be successful outside the financial trading market.</li>
</ul>
<p><span id="more-5630"></span><em>Not coincidental to these shifts in focus, StreamBase was our client, then stopped being one for a while, and now is a client again.</em></p>
<p>StreamBase (the product set) consists primarily of three things (LiveView aside):</p>
<ul>
<li>A development environment, whose output is in &#8230;</li>
<li>&#8230; a visual programming language called EventFlow &#8230;</li>
<li>&#8230; which is complied and executed by StreamBase&#8217;s execution layers.</li>
</ul>
<p>One important set of ancillary products are StreamBase&#8217;s connectors to various data sources &#8212; StreamBase offers about 125 of its own, a number that approaches 200 when <a href="../../../../../2010/02/16/quick-thoughts-on-the-streambase-component-exchange/">community contributions</a> are included.</p>
<p>StreamBase has a second programming language called StreamSQL, but that&#8217;s rarely used except for embedding in or connecting to third-party software. EventFlow and StreamSQL compile to nearly identical byte code. (The main difference seems to be that as a practical matter you&#8217;ll name things a bit differently in the two languages, focusing on verbs in EventFlow and nouns in StreamSQL.)</p>
<p>StreamBase says that in the financial trading market, great performance out of the box equates to better time-to-value, since you are spared time you&#8217;d otherwise have to spend tuning the system. Implicit in that is a claim &#8212; which competitors might dispute <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  &#8212; that StreamBase has great <a href="../../../../../2009/05/21/notes-on-cep-performance/">performance</a>. StreamBase fondly thinks that having a domain-specific language gives it a leg up in achieving great compiler optimization. (The same would presumably apply to StreamBase&#8217;s competitors, but only if they have optimizing compilers themselves.)</p>
<p>One point that&#8217;s a little unusual for me these days is that StreamBase favors big SMP (Symmetric MultiProcessing) boxes over blade-based scale-out. 16+ cores and 256 gigabytes of RAM are not uncommon. Clusters commonly include 4-8 machines, but rarely more; the largest StreamBase cluster evidently contains 36 machines.</p>
<p>And with that I&#8217;ll turn to StreamBase&#8217;s newest offering, <a href="http://www.dbms2.com/2011/11/10/streambase-liveview-push-based-real-time-bi/">LiveView</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/10/streambase-catchup/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The cool aspects of Odiago WibiData</title>
		<link>http://www.dbms2.com/2011/11/02/5576/</link>
		<comments>http://www.dbms2.com/2011/11/02/5576/#comments</comments>
		<pubDate>Wed, 02 Nov 2011 15:05:01 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Odiago and WibiData]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5576</guid>
		<description><![CDATA[Christophe Bisciglia and Aaron Kimball have a new company. It&#8217;s called Odiago, and is one of my gratifyingly more numerous tiny clients. Odiago&#8217;s product line is called WibiData, after the justly popular We Be Sushi restaurants. We&#8217;ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on TechCrunch. But this [...]]]></description>
			<content:encoded><![CDATA[<p>Christophe Bisciglia and Aaron Kimball have a new company.</p>
<ul>
<li>It&#8217;s called Odiago, and is one of my gratifyingly more numerous tiny clients.</li>
<li>Odiago&#8217;s product line is called <a href="http://www.wibidata.com/">WibiData</a>, after the justly popular We Be Sushi restaurants.</li>
<li>We&#8217;ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on <a href="http://techcrunch.com/2011/11/02/cloudera-founder-debuts-big-data-management-and-analysis-platform-wibidata-with-backing-from-eric-schmidt/">TechCrunch</a>. But this is the place for &#8212; well, for the tech crunch.</li>
</ul>
<p><strong>WibiData is designed for management of, <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> on, and operational analytics on consumer internet data,</strong> the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop,* is <strong>a data management and analytic execution layer.</strong> That&#8217;s where the secret sauce resides. Also included are:</p>
<ul>
<li>REST APIs for interactive access.</li>
<li>Import/export tools, including JDBC access.</li>
<li>Management tools.</li>
<li>Analytic libraries &#8212; data mining, predictive      analytics, machine learning, and so on.</li>
</ul>
<p>The whole thing is in beta, with about three (paying) beta customers.</p>
<p><em>*And Avro and so on.</em></p>
<p>The core ideas of WibiData include:</p>
<ul>
<li><strong>ALL data pertaining to a single user </strong>(or mobile device) <strong>is kept in      a single, </strong>possibly very long,<strong> HBase row.</strong><strong> </strong></li>
<li>There are two primary operators in WibiData, <strong>Produce </strong>and <strong>Gather.</strong>
<ul>
<li>Produce operates on single       rows. It can operate on one row at HBase speed (milliseconds) if you need       to inform an interactive user response. Or it can operate on the whole       database in batch via Hadoop MapReduce.</li>
<li>It is reasonable to think of       Produce as mainly doing two things. One is the aforementioned serving of       data out of WibiData into interactive applications. The other is scoring,       classifying, recommending, etc. on individual users (i.e. rows), in line       with an analytic model.</li>
<li>Gather typically operates on       all your rows at once, and emits suitable input for a MapReduce Reduce       step. It is reasonable to think of Gather as being a key cog in the       training of analytic models.</li>
</ul>
</li>
<li>HBase <strong>schema management is done at the      WibiData system level,</strong> not directly in applications. There&#8217;s a      WibiData HBase data dictionary, powered by a set of system tables, that      specifies cell data types/record types and, in effect, primitive schemas.</li>
</ul>
<p><span id="more-5576"></span>WibiData-enhanced HBase differs from relational DBMS in most of the ways you would imagine, both good and bad. In particular:</p>
<ul>
<li>Depending on how you look at it,      WibiData-enhanced HBase either has no DML (Data Manipulation Language) at      all, or else has one that &#8216;s a lot less rich than SQL.</li>
<li>WibiData-enhanced HBase schemas are much more <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic</a> than SQL schemas.</li>
<li>WibiData-enhanced HBase schemas can have nested      or recursive data structures, such as array-valued cells.</li>
</ul>
<p>To expand on each of those points in turn:</p>
<p>WibiData&#8217;s underlying one-giant-table philosophy notwithstanding, there are times you manage multiple tables with it. (For example, you ingest data into WibiData however you can, and then run transformations &#8212; typically batch &#8212; until the data is in the preferred structure.) While Wibidata does have ways to simulate joins, foreign keys, and so on, there&#8217;s nothing resembling referential integrity or foreign key constraints.</p>
<p><strong>WibiData takes single-table schema flexibility to an extreme.</strong> Not only can different rows in the same table have different associated columns &#8212; something that relational systems can in effect also do via NULL values &#8212; but schemas can even change over the life of a column. If you have an array-valued cell storing the results of a marketing campaign, and you start recording more data partway through the campaign, then different rows in the table will, in the same column, hold different-sized arrays.</p>
<p>That nesting can also get pretty serious; <strong>where you’d have a single value in a relational table, you might have the equivalent of a whole relational table (or at least selection/view) in WibiData-enhanced HBase. </strong>For example, if a user visits the same web page ten times, and each time 50 attributes are recorded (including a timestamp), all 500 data – to use the word “data” in its original “plural of <em>datum</em>” sense – would likely be stored in the same WibiData cell.</p>
<p>That’s about all Odiago is disclosing about WibiData right now. Christophe will also be talking at Hadoop World next week, and presumably can be hit up with any burning questions then.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/02/5576/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
	</channel>
</rss>

