<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Web analytics</title>
	<atom:link href="http://www.dbms2.com/category/applications/web-analytics/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Wed, 08 Feb 2012 12:22:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Couchbase update</title>
		<link>http://www.dbms2.com/2012/02/01/couchbase-update/</link>
		<comments>http://www.dbms2.com/2012/02/01/couchbase-update/#comments</comments>
		<pubDate>Thu, 02 Feb 2012 04:00:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Basho and Riak]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[CouchDB]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Zynga]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5877</guid>
		<description><![CDATA[I checked in with James Phillips for a Couchbase update, and I understand better what&#8217;s going on. In particular: Give or take minor tweaks, what I wrote in my August, 2010 Couchbase updates still applies. Couchbase now and for the foreseeable future has one product line, called Couchbase. Couchbase 2.0, the first version of Couchbase [...]]]></description>
			<content:encoded><![CDATA[<p>I checked in with James Phillips for a Couchbase update, and I understand better what&#8217;s going on. In particular:</p>
<ul>
<li>Give or take minor tweaks, what I wrote in my <a href="../../../../../2011/08/13/couchbase-business-update/">August, 2010 Couchbase updates</a> still applies.</li>
<li>Couchbase now and for the foreseeable future has one product line, called Couchbase.</li>
<li>Couchbase 2.0, the first version of Couchbase (the product) to use CouchDB for persistence, has slipped &#8230;</li>
<li>&#8230; because more parts of CouchDB had to be rewritten for performance than Couchbase (the company) had hoped.</li>
<li>Think mid-year or so for the release of Couchbase 2.0, hopefully sooner.</li>
<li>In connection with the need to rewrite parts of CouchDB, Couchbase has:
<ul>
<li><a href="../../../../../2012/01/18/notes-from-the-couch-blogs/">Gotten out of the single-server CouchDB business</a>.</li>
<li>Donated its proprietary single-sever CouchDB intellectual property to the Apache Foundation.</li>
</ul>
</li>
<li>The 150ish new customers in 2011 Couchbase brags about are real, subscription customers.</li>
<li>Couchbase has 60ish people, headed to &gt;100 over the next few months.</li>
</ul>
<p><span id="more-5877"></span><em>If you previously heard the brand names Couchbase Single or Couchbase Mobile, pay no further attention to them. Couchbase Single was CouchDB; Couchbase Mobile is part of Couchbase&#8217;s feature set.</em></p>
<p>The current product is Couchbase 1.8, which is a whole lot like what previously was called Membase. New features in Couchbase 1.8 (versus prior versions of Membase) were concentrated in client libraries/SDK (Software Development Kit). Not coincidentally, Couchbase has hired developer evangelists who are in charge of making Couchbase play nicely with various specific languages (e.g. C/C++)</p>
<p>Drilling down further into the CouchDB part of the story:</p>
<ul>
<li>Couchbase 2.0 will replace Couchbase 1.8/Membase&#8217;s SQLite back-end with CouchDB.</li>
<li>Parts of CouchDB that do things like read, write, or compact data have been rewritten from Erlang to C.</li>
<li>Couchbase still uses other Erlang parts of Apache CouchDB, and would be delighted if the community were to usefully enhance them.</li>
<li>Couchbase&#8217;s heavy contributions to development of open source CouchDB will, for the most part, continue.</li>
<li>CouchDB stuff donated to the Apache Foundation includes:
<ul>
<li>Documentation</li>
<li>Packaging</li>
<li>Performance enhancements</li>
</ul>
</li>
</ul>
<p>There&#8217;s at least one Couchbase user with &gt;1000 nodes (at a guess, <a href="../../../../../2011/09/05/zynga-linkedin-data-warehous/">Zynga</a>).  More typical might be 20 nodes or less. This led me to wonder how much data one puts on a Couchbase node anyway. The answer turns out to vary widely, in that you want your working set to be in RAM, and whether that&#8217;s your entire database or just a slice of it depends on the nature of the application.</p>
<p>James echoed a trend I&#8217;ve heard elsewhere as well, in which products one things of as being internet-specific are also sold in a few cases to conventional enterprises for &#8212; you guessed it! &#8212; their internet operations. I also asked him about competition, and he asserted:</p>
<ul>
<li>MongoDB is the big competition. He believes Couchbase has an excellent win rate vs. 10gen for actual paying accounts.</li>
<li>DataStax/Cassandra wins over Couchbase only when multi-data-center capability is important. Naturally, multi-data-center capability is planned for Couchbase. (Indeed, that&#8217;s one of the benefits of swapping in CouchDB at the back end.)</li>
<li>Redis has &#8220;dropped off the radar&#8221;, presumably because there&#8217;s no particular persistence strategy for it.</li>
<li>Riak doesn&#8217;t show up much.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/02/01/couchbase-update/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Splunk update</title>
		<link>http://www.dbms2.com/2012/01/10/splunk-update/</link>
		<comments>http://www.dbms2.com/2012/01/10/splunk-update/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 05:55:08 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5791</guid>
		<description><![CDATA[Splunk is announcing the Splunk 4.3 point release. Before discussing it, let&#8217;s recall a few things about Splunk, starting with: Splunk is first and foremost an analytic DBMS &#8230; &#8230; used to manage logs and similar multistructured data. Splunk&#8217;s DML (Data Manipulation Language) is based on text search, not on SQL. Splunk has extended its [...]]]></description>
			<content:encoded><![CDATA[<p>Splunk is announcing the Splunk 4.3 point release. Before discussing it, let&#8217;s recall a few things about Splunk, starting with:</p>
<ul>
<li>Splunk is first and foremost an analytic DBMS &#8230;</li>
<li>&#8230; used to manage logs and similar multistructured data.</li>
<li>Splunk&#8217;s DML (Data Manipulation Language) is based on text search, not on SQL.</li>
<li>Splunk has extended its DML in natural ways (e.g., you can use it to do calculations and even some statistics).</li>
<li>Splunk bundles some (very) basic, Splunk-specific business intelligence capabilities.</li>
<li>The paradigmatic use of Splunk is to monitor IT operations in real time. However:
<ul>
<li>There also are plenty of non-real-time uses for Splunk.</li>
<li>Splunk is proudest of its growth in non-IT quasi-real-time uses, such as the marketing side of web operations.</li>
</ul>
</li>
</ul>
<p>As in any release, a lot of Splunk 4.3 is about &#8220;Oh, you didn&#8217;t have that before?&#8221; features and <a href="../../../../../2009/08/21/bottleneck-whack-a-mole/">Bottleneck Whack-A-Mole</a> performance speed-up. One performance enhancement is Bloom filters, which are a very hot topic these days. More important is a switch from Flash to HTML5, so as to accommodate mobile devices with less server-side rendering. Splunk reports that its users &#8212; especially the non-IT ones &#8212; really want to get Splunk information on the tablet devices. While this somewhat contradicts <a href="../../../../../2012/01/04/some-issues-in-business-intelligence/">what I wrote a few days ago pooh-poohing mobile BI</a>, let me hasten to point out:</p>
<ul>
<li>Splunk is used for a lot of (quasi) real-time monitoring.</li>
<li>Splunk&#8217;s desktop user interfaces are, by BI standards, quite primitive.</li>
</ul>
<p>That&#8217;s pretty much the ideal scenario for mobile BI: Timeliness matters and prettiness doesn&#8217;t.</p>
<p><span id="more-5791"></span><em>Hmm. Maybe <a href="../../../../../2011/11/10/streambase-liveview-push-based-real-time-bi/">StreamBase LiveView</a> needs a mobile option as well &#8230;</em></p>
<p>Splunk&#8217;s basic use is to take the text string that is a log and make sense of it. But Splunk now also supports JSON structures. It does this via something called spath, which as you might guess from the name has XPath similarities. That probably bore more discussion than we found the time to have.</p>
<p><em>By the way: If you&#8217;re interested in BI over XML, that&#8217;s what my former clients at Skytide were founded to do, before they pivoted a bit. I don&#8217;t think those capabilities have disappeared from the product</em>.</p>
<p><a href="http://www.monash.com/uploads/Splunk-4-3.pdf">Splunk has graciously allowed me to post a slide deck</a>. More stuff in there, including quotes from a customer &#8212; Expedia &#8212; that has 2700 Splunk users.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/10/splunk-update/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Agile predictive analytics &#8211; the heart of the matter</title>
		<link>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/</link>
		<comments>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 19:40:26 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5746</guid>
		<description><![CDATA[I&#8217;ve already suggested that several apparent issues in predictive analytic agility can be dismissed by straightforwardly applying best-of-breed technology, for example in analytic data management. At first blush, the same could be said about the actual analysis, which comprises: Data preparation, which is tedious unless you do a good job of automating it. Running the [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve already suggested that several apparent issues in <a href="http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-easy-parts/">predictive analytic agility</a> can be dismissed by straightforwardly applying best-of-breed technology, for example in analytic data management. At first blush, the same could be said about the actual analysis, which comprises:</p>
<ul>
<li>Data preparation, which is tedious unless you do a good job of automating it.</li>
<li>Running the actual algorithms.</li>
</ul>
<p>Numerous statistical software vendors (or open source projects) help you with the second part; some make strong claims in the first area as well (e.g., my clients at KXEN). Even so, large enterprises typically have statistical silos, commonly featuring expensive annual SAS licenses and seemingly slow-moving SAS programmers.</p>
<p>As I see it, the predictive analytics workflow goes something like this<span id="more-5746"></span>:</p>
<ul>
<li>Business-knowledgeable people develop a theory as to what kinds of information and segmentation could be valuable in making better business micro-decisions.</li>
<li>Statistics-knowledgeable people determine a structure for modeling that reflects this theory.</li>
<li>Statistics-knowledgeable people tweak the model over time, within a fixed general structure, as new data comes in.</li>
<li>(Optional) Somebody sees to acquiring whatever data is needed that the organization doesn&#8217;t already have (and won&#8217;t get in the ordinary course of ongoing business).</li>
</ul>
<p>The optional last part can be a purchase of third-party information (relatively fast and easy) or the development of a business process (and if necessary associated software) to capture the information (not always so easy). But even if that&#8217;s taken care of, or not present, we have at least two hand-offs where agility can be lost:</p>
<ul>
<li>Businesspeople may throw a request &#8220;over the wall&#8221; to the statisticians, who then work on it as their schedule permits.</li>
<li>Once created, a model may be so set in stone that even small changes are as hard as building a new model from scratch.</li>
</ul>
<p>The second problem can be solved by the statisticians themselves, without outside involvement. Model research and model refinement should be separate processes. You can recheck your clustering on one schedule, but recalibrate your regressions against each cluster more frequently. If that all sounds forbiddingly difficult, perhaps your model recalibration process needs another level of automation.</p>
<p>So I&#8217;ve finally gotten to the point of saying what may have been obvious from the start: <strong>The only excusable impediment to predictive analytic agility is the hand-off from the people who know the business to the people who know the math.</strong> So let&#8217;s examine ways that difficulty can be resolved.</p>
<p>At big internet companies, the usual answer is something like</p>
<blockquote><p>Hey, it&#8217;s just data. From web logs. And network event logs. The data scientists know how to handle that.</p></blockquote>
<p>In financial trading firms, the answer is more</p>
<blockquote><p>The traders and analysts work closely together. Very closely. In fact, when the traders rip out their phones and throw them across the room, the analysts need to duck to avoid getting clobbered.</p></blockquote>
<p>In credit card or telecom marketing or insurance actuarial organizations, the answer may be</p>
<blockquote><p>Don&#8217;t worry; the stats geeks have been at this for a long time; they really do understand our business.</p></blockquote>
<p>All three approaches work.</p>
<p>But what about conventional enterprises, where line-of-business people may not be as math-savvy as internet developers or financial traders, and where the math experts may not have the business issues down cold? My flippant answer is that businesspeople should know some math too.* My more serious answer is that <strong>the &#8220;business analyst&#8221; role should be expanded </strong>beyond BI and planning<strong> to include lightweight predictive analytics as well.</strong></p>
<p><em>*I wasn&#8217;t being entirely flippant, of course. Statistics is even being taught in high school these days. And when I got a PhD in game theory, 2/3 of my thesis committee was at the Harvard Business School.</em></p>
<p>For example, at retailers:</p>
<ul>
<li>Market basket analysis is pretty simplistic (it only looks at small subsets of a basket at a time).</li>
<li>Seasonality is tricky. (Weather and so on can skew it.)</li>
<li>Each store or region can be its own universe.</li>
<li>Some of the results of analytics are rather coarse-grained &#8212; e.g., merchandise adjacencies &#8212; so precision in statistical analysis may not matter much anyway.</li>
</ul>
<p>And so truly rigorous statistical analysis may be both unfeasible and unnecessary; a lot of business-informed seat-of-the-pants reasoning needs to be mixed in. Consequently, there&#8217;s a lot to be said for pushing at least some retail predictive analytics pretty close to the merchandising department(s).</p>
<p>Similar stories could be told in many other industries and pursuits, including but emphatically not limited to:</p>
<ul>
<li>Event marketing.</li>
<li>College admissions.</li>
<li>Political campaigning.</li>
<li>Field maintenance at utility companies.</li>
<li>Price-setting (across many industries).</li>
</ul>
<p>In each case, it&#8217;s easy to see how statistical and predictive analytic techniques could add real value to the business. But it&#8217;s hard to imagine how the enterprise could support the kind of large, experienced, business-knowledge analytic operation one might find in hedge fund investing or telecom churn analysis. And absent that, it&#8217;s tough to see why the only people doing predictive analytics for the organization should sit in some silo of statistical expertise.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>The cool aspects of Odiago WibiData</title>
		<link>http://www.dbms2.com/2011/11/02/5576/</link>
		<comments>http://www.dbms2.com/2011/11/02/5576/#comments</comments>
		<pubDate>Wed, 02 Nov 2011 15:05:01 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Odiago and WibiData]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5576</guid>
		<description><![CDATA[Christophe Bisciglia and Aaron Kimball have a new company. It&#8217;s called Odiago, and is one of my gratifyingly more numerous tiny clients. Odiago&#8217;s product line is called WibiData, after the justly popular We Be Sushi restaurants. We&#8217;ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on TechCrunch. But this [...]]]></description>
			<content:encoded><![CDATA[<p>Christophe Bisciglia and Aaron Kimball have a new company.</p>
<ul>
<li>It&#8217;s called Odiago, and is one of my gratifyingly more numerous tiny clients.</li>
<li>Odiago&#8217;s product line is called <a href="http://www.wibidata.com/">WibiData</a>, after the justly popular We Be Sushi restaurants.</li>
<li>We&#8217;ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on <a href="http://techcrunch.com/2011/11/02/cloudera-founder-debuts-big-data-management-and-analysis-platform-wibidata-with-backing-from-eric-schmidt/">TechCrunch</a>. But this is the place for &#8212; well, for the tech crunch.</li>
</ul>
<p><strong>WibiData is designed for management of, <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> on, and operational analytics on consumer internet data,</strong> the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop,* is <strong>a data management and analytic execution layer.</strong> That&#8217;s where the secret sauce resides. Also included are:</p>
<ul>
<li>REST APIs for interactive access.</li>
<li>Import/export tools, including JDBC access.</li>
<li>Management tools.</li>
<li>Analytic libraries &#8212; data mining, predictive      analytics, machine learning, and so on.</li>
</ul>
<p>The whole thing is in beta, with about three (paying) beta customers.</p>
<p><em>*And Avro and so on.</em></p>
<p>The core ideas of WibiData include:</p>
<ul>
<li><strong>ALL data pertaining to a single user </strong>(or mobile device) <strong>is kept in      a single, </strong>possibly very long,<strong> HBase row.</strong><strong> </strong></li>
<li>There are two primary operators in WibiData, <strong>Produce </strong>and <strong>Gather.</strong>
<ul>
<li>Produce operates on single       rows. It can operate on one row at HBase speed (milliseconds) if you need       to inform an interactive user response. Or it can operate on the whole       database in batch via Hadoop MapReduce.</li>
<li>It is reasonable to think of       Produce as mainly doing two things. One is the aforementioned serving of       data out of WibiData into interactive applications. The other is scoring,       classifying, recommending, etc. on individual users (i.e. rows), in line       with an analytic model.</li>
<li>Gather typically operates on       all your rows at once, and emits suitable input for a MapReduce Reduce       step. It is reasonable to think of Gather as being a key cog in the       training of analytic models.</li>
</ul>
</li>
<li>HBase <strong>schema management is done at the      WibiData system level,</strong> not directly in applications. There&#8217;s a      WibiData HBase data dictionary, powered by a set of system tables, that      specifies cell data types/record types and, in effect, primitive schemas.</li>
</ul>
<p><span id="more-5576"></span>WibiData-enhanced HBase differs from relational DBMS in most of the ways you would imagine, both good and bad. In particular:</p>
<ul>
<li>Depending on how you look at it,      WibiData-enhanced HBase either has no DML (Data Manipulation Language) at      all, or else has one that &#8216;s a lot less rich than SQL.</li>
<li>WibiData-enhanced HBase schemas are much more <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic</a> than SQL schemas.</li>
<li>WibiData-enhanced HBase schemas can have nested      or recursive data structures, such as array-valued cells.</li>
</ul>
<p>To expand on each of those points in turn:</p>
<p>WibiData&#8217;s underlying one-giant-table philosophy notwithstanding, there are times you manage multiple tables with it. (For example, you ingest data into WibiData however you can, and then run transformations &#8212; typically batch &#8212; until the data is in the preferred structure.) While Wibidata does have ways to simulate joins, foreign keys, and so on, there&#8217;s nothing resembling referential integrity or foreign key constraints.</p>
<p><strong>WibiData takes single-table schema flexibility to an extreme.</strong> Not only can different rows in the same table have different associated columns &#8212; something that relational systems can in effect also do via NULL values &#8212; but schemas can even change over the life of a column. If you have an array-valued cell storing the results of a marketing campaign, and you start recording more data partway through the campaign, then different rows in the table will, in the same column, hold different-sized arrays.</p>
<p>That nesting can also get pretty serious; <strong>where you’d have a single value in a relational table, you might have the equivalent of a whole relational table (or at least selection/view) in WibiData-enhanced HBase. </strong>For example, if a user visits the same web page ten times, and each time 50 attributes are recorded (including a timestamp), all 500 data – to use the word “data” in its original “plural of <em>datum</em>” sense – would likely be stored in the same WibiData cell.</p>
<p>That’s about all Odiago is disclosing about WibiData right now. Christophe will also be talking at Hadoop World next week, and presumably can be hit up with any burning questions then.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/02/5576/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>More notes on Oracle NoSQL</title>
		<link>http://www.dbms2.com/2011/10/20/more-notes-on-oracle-nosql/</link>
		<comments>http://www.dbms2.com/2011/10/20/more-notes-on-oracle-nosql/#comments</comments>
		<pubDate>Thu, 20 Oct 2011 15:49:31 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5515</guid>
		<description><![CDATA[A reporter asked me for some thoughts on Oracle&#8217;s new NoSQL product. For the most part, I stand by my previous comments on Oracle NoSQL. Still, NoSQL in general deserves a place in Oracle shops, so it makes sense for Oracle to try to coopt it. Oracle&#8217;s core DBMS is not well suited to track [...]]]></description>
			<content:encoded><![CDATA[<p>A reporter asked me for some thoughts on Oracle&#8217;s new NoSQL product. For the most part, I stand by <a href="http://www.dbms2.com/2011/09/30/oracle-nosql/">my previous comments on Oracle NoSQL</a>. Still, NoSQL in general deserves a place in Oracle shops, so it makes sense for Oracle to try to coopt it.</p>
<p>Oracle&#8217;s core DBMS is not well suited to track interactions (e.g. web clicks), even in cases where it&#8217;s the choice for transactions; it&#8217;s unnecessarily heavyweight. What&#8217;s worse, <a href="http://www.dbms2.com/2010/09/16/chase-authentication-database-outage/">using the same database to store actions and interactions can lead to serious reliability problems.</a> If a better architecture is to dump the clicks into some NoSQL store, massage the information, and eventually put some derived data into a relational DBMS, then Oracle will naturally try to own each step of the data pipeline.</p>
<p><a href="http://www.dbms2.com/2011/07/31/dynamic-fixed-schema-databases/">Dynamic schemas</a> are another area of Oracle weakness, leading in some cases to outright <a href="http://www.dbms2.com/2011/07/27/mongodb-users-and-use-cases/">Oracle replacements</a>. However, pure key-value stores go too far to the opposite extreme; you should at least be able to index and retrieve data one field at a time. Based on what I&#8217;ve seen of Oracle&#8217;s marketing literature, that feature will be missing from the first release of Oracle&#8217;s NoSQL.* Until it&#8217;s in there, and until it works well, I don&#8217;t see why anybody should use Oracle&#8217;s NoSQL product.</p>
<p><em>*Frankly, that choice makes no sense to me on any level. Yet it&#8217;s the way Oracle seems to have elected to go &#8212; or, if it isn&#8217;t, then there&#8217;s somebody writing Oracle marketing collateral who&#8217;s clearly in the wrong line of work.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/20/more-notes-on-oracle-nosql/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>What those nested data structures are about</title>
		<link>http://www.dbms2.com/2011/10/19/nested-data-structures/</link>
		<comments>http://www.dbms2.com/2011/10/19/nested-data-structures/#comments</comments>
		<pubDate>Wed, 19 Oct 2011 17:29:59 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5506</guid>
		<description><![CDATA[As I&#8217;ve noted before, the very big web companies have an issue with nested data structures. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a [...]]]></description>
			<content:encoded><![CDATA[<p>As I&#8217;ve noted before, <a href="http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/">the very big web companies have an issue with nested data structures</a>. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a good tactic.</p>
<p>The explanation was led by Oliver Ratzesberger, late of eBay*and progenitor of <a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay&#8217;s Singularity project</a>. In simplest terms, <strong>one event can spawn a lot of event attribute information, perhaps in the form of name-value pairs,</strong> which it then makes sense to store together in some way. The example Oliver dwelled on was that, on any given web page, there can be 100+ pieces of information to record, including:</p>
<ul>
<li>All 50 search results you were shown, and their positions in the search rankings.</li>
<li>Every ad, image, or graphical element.</li>
<li>An ID as to which test you were participating in (every page you see on eBay has some element being tested).</li>
</ul>
<p><em>*Oliver is leaving eBay for a still-secret large company. I would conjecture that Michael McIntire is on the move too, either to replace Oliver or to go with him, but Oliver did a very good job of not commenting on the matter.</em></p>
<p>There are several reasons why one might wish to store this information in ways that grieve relational purists. First, reconstructing all this information via joins would be brutally expensive. What&#8217;s more, reconstructing all this information via joins could be impractical. Some comes from third party ad servers, which might not reproduce the same ads upon demand. Other is in the form of rankings, which can&#8217;t always be reliably reproduced from one query to the next. (That&#8217;s just one of several reasons <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">text search and relational DBMS are an awkward fit</a>.)</p>
<p>Also, there&#8217;s a strong <a href="http://www.dbms2.com/2011/07/31/dynamic-fixed-schema-databases/">dynamic schema</a> flavor to these databases. The list of attributes for one web click might be very different in kind from the list for the next page. Forcing that kind of variability into a fixed relational schema, while theoretically possible, doesn&#8217;t necessarily make a lot of sense.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/19/nested-data-structures/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Hadoop notes</title>
		<link>http://www.dbms2.com/2011/09/12/hadoop-notes/</link>
		<comments>http://www.dbms2.com/2011/09/12/hadoop-notes/#comments</comments>
		<pubDate>Mon, 12 Sep 2011 09:03:52 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Health care]]></category>
		<category><![CDATA[Hortonworks]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapR]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5218</guid>
		<description><![CDATA[I visited California recently, and chatted with numerous companies involved in Hadoop &#8212; Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I&#8217;ll defer further Hadoop technical discussions for now &#8212; my target to restart them is later this month &#8212; but that still leaves some other issues to discuss, namely adoption and partnering. The total number [...]]]></description>
			<content:encoded><![CDATA[<p>I visited California recently, and chatted with numerous companies involved in Hadoop &#8212; Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I&#8217;ll defer further <a href="../../../../../2011/08/21/hadoop-evolution/">Hadoop technical discussions</a> for now &#8212; my target to restart them is later this month &#8212; but that still leaves some other issues to discuss, namely adoption and partnering.</p>
<p>The total number of enterprises in the world paying subscription and license fees that they would regard as being for &#8220;Hadoop or something Hadoop-related&#8221; probably is not much over 100 right now, but I&#8217;d expect to see pretty rapid growth. Beyond that, let&#8217;s divide customers into three groups:</p>
<ul>
<li>Internet businesses.</li>
<li>Traditional enterprises &#8216; internet operations.</li>
<li>Traditional enterprises&#8217; other operations.</li>
</ul>
<p>Hadoop vendors, in different mixes, claim to be doing well in all three segments. Even so, almost all use cases involve some kind of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>, with one exception being a credit card vendor crunching a large database of transaction details. Multiple kinds of machine-generated data come into play &#8212; web/network/mobile device logs, financial trade data, scientific/experimental data, and more. In particular, pharmaceutical research got some mentions, which makes sense, in that it&#8217;s one area of scientific research that actually enjoys fat for-profit research budgets.</p>
<p><span id="more-5218"></span>On the partnering side, I heard things about a Hortonworks conference call that do not seem to have been contradicted by my visit to Hortonworks. Namely, Hortonworks promised prospective partners, such as analytic DBMS vendors, hardware vendors, or large system integrators, that it wouldn&#8217;t compete with them, in that Hortonworks pledges not to introduce its own products for at least two years. This is presumably targeted most directly at <a href="../../../../../2010/10/10/partnering-with-cloudera/">Cloudera</a>, which has lots of partners, but also some <a href="../../../../../2010/06/30/cloudera-enterprise-hadoop-evolution/">proprietary code</a> of its own. MapR, I&#8217;d think, would be the #2 target, but that&#8217;s just speculation.</p>
<p>The other big part of <a href="../../../../../2011/07/10/cloudera-and-hortonworks/">Hortonworks&#8217; story</a> is the claim that it holds the axe in Apache Hadoop development. Nobody doubts that a large fraction of the work on Hadoop&#8217;s core projects was done by Yahoo employees. Many of those indeed moved to Hortonworks; others left Yahoo earlier; Hadoop creator Doug Cutting is actually at Cloudera. So just how dominant Hortonworks really is in core Hadoop development is a bit unclear. Meanwhile, Cloudera people seem to be leading a number of Hadoop companion or sub-projects, including the first two I can think of that relate to Hadoop integration or connectivity, namely Sqoop and Flume. So I&#8217;m not persuaded that the &#8220;we know this stuff better&#8221; part of the Hortonworks partnering story really holds up.</p>
<p>What I am persuaded of is that the Hadoop platform competition is a good thing. Whichever vendors and projects win will be healthier from having had to outcompete worthy alternatives.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/12/hadoop-notes/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Aster Data business trends</title>
		<link>http://www.dbms2.com/2011/09/08/aster-data-business-trends/</link>
		<comments>http://www.dbms2.com/2011/09/08/aster-data-business-trends/#comments</comments>
		<pubDate>Thu, 08 Sep 2011 05:33:56 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[RDF and graphs]]></category>
		<category><![CDATA[Teradata]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5204</guid>
		<description><![CDATA[Last month, I reviewed with the Aster Data folks which markets they were targeting and selling into, subsequent to acquisition by their new orange overlords. The answers aren&#8217;t what they used to be. Aster no longer focuses much on what it used to call frontline (i.e., low-latency, operational) applications; those are of course a key [...]]]></description>
			<content:encoded><![CDATA[<p>Last month, I reviewed with the Aster Data folks which markets they were targeting and selling into, subsequent to <a href="../../../../../2011/03/04/teradata-aster-data-ncluster/">acquisition</a> by their new orange overlords. The answers aren&#8217;t what they used to be. Aster no longer focuses much on what it used to call <a href="../../../../../2008/10/22/aster-data-systems-ncluster/">frontline</a> (i.e., low-latency, operational) applications; those are of course a key strength for Teradata. Rather, Aster focuses on <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> &#8212; they&#8217;ve long <a href="../../../../../2011/02/12/upcoming-webinar-on-investigative-analytics/">endorsed</a> my use of the term &#8212; and on the batch run/scoring kinds of applications that inform operational systems.</p>
<p><span id="more-5204"></span>Also, Aster no longer focuses much on the general internet industry where it got its earliest sales, its <a href="../../../../../2011/09/05/zynga-linkedin-data-warehous/">continued success at LinkedIn</a> and a recent win at <span style="text-decoration: line-through;">an (NDA) fairly-big-name internet new account</span> <em>Razorfish</em> notwithstanding. That said, the first target market Aster did share with me was &#8220;digital marketing optimization,&#8221; which includes &#8220;marketing optimization&#8221; (duh), search engine optimization (SEO), clickstream analysis, and the like. Also, Aster is going after &#8220;data scientists&#8221; in general, and that&#8217;s a term I&#8217;m still seeing used most frequently in the internet area.</p>
<p><em>I&#8217;m seeing ever more granularity as companies break down internet-related market segments. DataStax showed me a chart last week of 15 different market segments it had sold into, and at least 14 were in some way internet-related.</em></p>
<p>Rather, if Aster is to name three industries in which it has pleasingly strong sales traction, it would say manufacturing (which in Teradata lingo includes resource extraction), financial services (including insurance), and retail. A cynic might note that that breakdown, like many similar ones, adds up to fairly large swaths of the economy and the computer market, but never mind that part. (Other firms might have thrown in telecommunications and health care as well, to get even more coverage.</p>
<p>Two of Aster&#8217;s other favorite application areas are social network analysis/influencer identification and &#8212; which is analytically very similar &#8212; fraud detection/prevention. Taken together, that&#8217;s a whole lot of graph analysis. And I note with interest that the influencer identification stuff does NOT seem to be concentrated in telecom, which is the traditional sector one would imagine it being used in; all those call records are a lovely source of graph edges. Rather, the influencers seem to be identified from sources such as social media and credit card data .</p>
<p><em>Once again, this kind of thing gives me privacy jitters.</em></p>
<p>The match between Aster&#8217;s favorite industries and application areas is pretty much as you might expect &#8212; fraud in financial services, influencer analysis in retailing (and probably consumer financial services too), and digital marketing in both. As for manufacturing, the opportunities there seem to be focused on machine-generated data. That would be at least in high-tech manufacturing (I bet especially in flow-oriented stuff such as semiconductor fab) and oil/gas. Smart grid opportunities don&#8217;t seem to have arisen yet for Aster the way they have for a couple other vendors.</p>
<p>As for general Aster business trends, I think they&#8217;re good, while Aster would perhaps want to portray them as very good. Aster named a couple of impressive joint Teradata/Aster wins under NDA, but only a couple. Ramping up sales headcount is proving challenging, and some sales leadership turnover probably hasn&#8217;t helped. I do believe Aster&#8217;s spin that this is a matter of somebody being promoted quickly to a bigger job, and am optimistic about the current team &#8212; still, such moves tend to have at least short-term cost.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/08/aster-data-business-trends/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Couchbase business update</title>
		<link>http://www.dbms2.com/2011/08/13/couchbase-business-update/</link>
		<comments>http://www.dbms2.com/2011/08/13/couchbase-business-update/#comments</comments>
		<pubDate>Sun, 14 Aug 2011 04:02:42 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Basho and Riak]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[CouchDB]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Mid-range]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[memcached]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5080</guid>
		<description><![CDATA[I decided I needed some Couchbase drilldown, on business and technology alike, so I had solid chats with both CEO Bob Wiederhold and Chief Architect Dustin Sallings. Pretty much everything I wrote at the time Membase and CouchOne merged to form Couchbase (the company) still holds up. But I have more detail now. Context for [...]]]></description>
			<content:encoded><![CDATA[<p>I decided I needed some Couchbase drilldown, on business and technology alike, so I had solid chats with both CEO Bob Wiederhold and Chief Architect Dustin Sallings. Pretty much everything I wrote at the time <a href="../../../../../2011/02/08/couchbase-membase-couchone-couchdb/">Membase and CouchOne merged to form Couchbase</a> (the company) still holds up. But I have more detail now. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>Context for any comments on customer traction includes:</p>
<ul>
<li>Membase went into limited production release in October, and full release in January. Similar things are true of CouchDB.</li>
<li>Hence, most sales of Couchbase&#8217;s products have been made over the past 6 months.</li>
<li>Couchbase (the merged product) is at this point only in a pre-production developer&#8217;s release.</li>
<li>Couchbase has both a direct sales force and a classic open-source &#8220;funnel&#8221;-based online selling model. Naturally, Couchbase&#8217;s understanding of what its customers are doing is more solid with respect to the direct sales base.</li>
<li>Most of Couchbase&#8217;s revenue to date seems to have come from a limited number of big-ticket &#8220;lighthouse&#8221; accounts (as opposed to, say, the larger number of smaller deals that come in through the online funnel).</li>
</ul>
<p>That said,</p>
<ul>
<li>Most Membase purchases are for new applications, as opposed to memcached migrations. However, customers are the kinds of companies that probably also are using memcached elsewhere.</li>
<li>Most other Membase purchases are replacements for the Membase/MySQL combination. Bob says those are easy sales with short sales cycles.</li>
<li>Pure memcached support is a small but non-zero business for Couchbase, and a fine source of upsell opportunities.</li>
<li>In the pipeline but not so much yet in the customer base are SaaS vendors and the like who use and may want to replace traditional DBMS such as Oracle. Other than among those, Couchbase doesn&#8217;t compete much yet with Oracle et al.</li>
<li>Pure CouchDB isn&#8217;t all that much of a business, at least relative to community size, as CouchDB is a single-server product commonly used by people who are content not to pay for support.</li>
</ul>
<p>Membase sales are concentrated in five kinds of internet-centric companies, which in declining order are: <span id="more-5080"></span></p>
<ul>
<li>Social gaming</li>
<li>Ad platforms</li>
<li>Online retail</li>
<li>Online business, including B2B  SaaS</li>
<li>Social networking</li>
</ul>
<p>Bob said that Couchbase often sees MongoDB competitively, but never Riak, HBase, or Redis. I got the impression Couchbase sees at least a little Cassandra. That would, of course, all pertain only to direct sales, rather than download/community kinds of usage.</p>
<p>Couchbase is also excited about the potential for the CouchDB-based Couchbase Mobile occasionally-connected offering. The hottest use cases, interestingly, seem to be non-consumer; Bob rattled off military, farming, and health care, and surely could have named more besides. However, the Couchbase Mobile sales effort still seems to be in early days, as is evidenced by the fact that Couchbase has not yet competitively encountered <a href="../../../../../2010/07/17/sybase-sql-anywhere/">Sybase SQL Anywhere</a>.</p>
<p>With all that said, I&#8217;ll go now to a separate post for a <a href="http://www.dbms2.com/2011/08/13/couchbase-technical-update/">Couchbase technical update</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/08/13/couchbase-business-update/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>MongoDB users and use cases</title>
		<link>http://www.dbms2.com/2011/07/27/mongodb-users-and-use-cases/</link>
		<comments>http://www.dbms2.com/2011/07/27/mongodb-users-and-use-cases/#comments</comments>
		<pubDate>Wed, 27 Jul 2011 18:14:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5031</guid>
		<description><![CDATA[I spoke with Eliot Horowitz and Max Schierson of 10gen last month about MongoDB users and use cases. The biggest clusters they came up with weren&#8217;t much over 100 nodes, but clusters an order of magnitude bigger were under development. The 100 node one we talked the most about had 33 replica sets, each with [...]]]></description>
			<content:encoded><![CDATA[<p>I spoke with Eliot Horowitz and Max Schierson of 10gen last month about MongoDB users and use cases. The biggest clusters they came up with weren&#8217;t much over 100 nodes, but clusters an order of magnitude bigger were under development. The 100 node one we talked the most about had 33 replica sets, each with about 100 gigabytes of data, so that&#8217;s in the 3-4 terabyte range total. In general, the largest MongoDB databases are 20-30 TB; I&#8217;d guess those really do use the bulk of available disk space.   <span id="more-5031"></span></p>
<p>10gen recommends solid-state storage in many cases. In some cases solid-state lets you get away with fewer total nodes. 10gen also likes Flashcache (Facebook-developed technology to put a flash cache in front of hard disks). But the 100-node example mentioned above uses spinning disk.</p>
<p>Use cases 10gen is proud of include:</p>
<ul>
<li>Lots of user profile maintenance, including at online ad companies. This includes full user ad impression data. (I&#8217;ve argued for a while that <a href="../../../../../2010/09/17/jp-morgan-chase-oracle-database-outage/">user profile information belongs in something like a NoSQL database</a>.)</li>
<li>A big-name web company that wants to inspect every packet that enters their network, and replaced Splunk with MongoDB for performance reasons.</li>
<li>A big-name photo/video site whose metadata is all in MongoDB. (That&#8217;s the kind of thing that often makes for good <a href="../../../../../2011/05/30/another-category-of-derived-data/">MarkLogic</a> use cases.)</li>
</ul>
<p>But actually, the reason we had the call was to review cases where MongoDB&#8217;s <strong>schemaless</strong> nature was significant. Examples of those included:</p>
<ul>
<li>A couple of top examples were of the kind &#8220;A bunch of apps, similar but not the same.&#8221; For MTV, it&#8217;s a single content management system for a bunch of websites. For Disney Playdom, it&#8217;s different schemas for every game.</li>
<li>For a wireless telco, the issue was a product catalog in which devices and service plans called for very different schemas, and which the telco felt had thus become unmanageable in Oracle.</li>
<li>For Craigslist, the issue wasn&#8217;t programming so much as performance &#8212; <a href="http://blog.zawodny.com/2010/04/27/i-want-a-new-data-store/">ALTER TABLE operations took months in MySQL</a>, and that&#8217;s not a typo, although I&#8217;ll confess to not understanding why this was the case.</li>
</ul>
<p>The 10gen guys went on to claim that schemalessness is helpful for incremental development in general, the point being that you don&#8217;t have a database-modification step. To some extent, changes can even be rolled back more easily than if you actually changed your schemas.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/27/mongodb-users-and-use-cases/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
	</channel>
</rss>

