<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Data models and architecture</title>
	<atom:link href="http://www.dbms2.com/category/database-theory-practice/data-models/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>WibiData, derived data, and analytic schema flexibility</title>
		<link>http://www.dbms2.com/2012/02/06/wibidata-derived-data-and-analytic-schema-flexibility/</link>
		<comments>http://www.dbms2.com/2012/02/06/wibidata-derived-data-and-analytic-schema-flexibility/#comments</comments>
		<pubDate>Tue, 07 Feb 2012 03:18:25 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Odiago and WibiData]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5907</guid>
		<description><![CDATA[My clients at Odiago, vendors of WibiData, have changed their company name simply to WibiData. Even better, they blogged with more detail as to how WibiData works, in what is essentially a follow-on to my original WibiData post last October. Among other virtues, WibiData turns out to be a poster child for my views on [...]]]></description>
			<content:encoded><![CDATA[<p>My clients at Odiago, vendors of WibiData, have changed their company name simply to WibiData. Even better, they blogged with more detail as to <a href="http://www.wibidata.com/2012/02/07/how-wibidata-works/">how WibiData works</a>, in what is essentially a follow-on to <a href="../../../../../2011/11/02/5576/">my original WibiData post</a> last October. Among other virtues, WibiData turns out to be a poster child for my views on <a href="../../../../../2011/09/06/derived-data-progressive-enhancement-and-schema-evolution/">derived data and the corresponding schema evolution</a>.</p>
<p>Interesting quotes include:</p>
<blockquote><p>WibiData is designed to store &#8230; transactional data side-by-side with profile and other derived data attributes.</p></blockquote>
<blockquote><p>&#8230; the ability to add new ad-hoc columns to a table enables more flexible analysis: output data that is the result of one analytic pipeline is stored adjacent to its input data, meaning that you can easily use this as input to second- or third-order derived data as well.</p></blockquote>
<blockquote><p>schemas can vary over time; you can easily add a field to a record, or delete a field. &#8230; But even though you start collecting that new data, your existing analysis pipelines can treat records like they always did; programs that don’t yet know about the new cookie are still compatible with both the old records already collected, and the new records with the additional field. New programs fill in default values for old data recorded before a field was added, applying the new schema at read time.</p></blockquote>
<blockquote><p>schemas for every column are stored in a data dictionary that matches column names with their schemas, as well as human-readable descriptions of the data.</p></blockquote>
<p>Interesting aspects of the post that don&#8217;t lend themselves as well to being excerpted include:</p>
<ul>
<li>How the Produce-Gather &#8220;analysis calculus&#8221; &#8212; i.e. framework &#8212; works.</li>
<li>How this all ties into Apache projects (and sub-projects) such as Hadoop, HBase, and Avro.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/02/06/wibidata-derived-data-and-analytic-schema-flexibility/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Splunk update</title>
		<link>http://www.dbms2.com/2012/01/10/splunk-update/</link>
		<comments>http://www.dbms2.com/2012/01/10/splunk-update/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 05:55:08 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Structured documents]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5791</guid>
		<description><![CDATA[Splunk is announcing the Splunk 4.3 point release. Before discussing it, let&#8217;s recall a few things about Splunk, starting with: Splunk is first and foremost an analytic DBMS &#8230; &#8230; used to manage logs and similar multistructured data. Splunk&#8217;s DML (Data Manipulation Language) is based on text search, not on SQL. Splunk has extended its [...]]]></description>
			<content:encoded><![CDATA[<p>Splunk is announcing the Splunk 4.3 point release. Before discussing it, let&#8217;s recall a few things about Splunk, starting with:</p>
<ul>
<li>Splunk is first and foremost an analytic DBMS &#8230;</li>
<li>&#8230; used to manage logs and similar multistructured data.</li>
<li>Splunk&#8217;s DML (Data Manipulation Language) is based on text search, not on SQL.</li>
<li>Splunk has extended its DML in natural ways (e.g., you can use it to do calculations and even some statistics).</li>
<li>Splunk bundles some (very) basic, Splunk-specific business intelligence capabilities.</li>
<li>The paradigmatic use of Splunk is to monitor IT operations in real time. However:
<ul>
<li>There also are plenty of non-real-time uses for Splunk.</li>
<li>Splunk is proudest of its growth in non-IT quasi-real-time uses, such as the marketing side of web operations.</li>
</ul>
</li>
</ul>
<p>As in any release, a lot of Splunk 4.3 is about &#8220;Oh, you didn&#8217;t have that before?&#8221; features and <a href="../../../../../2009/08/21/bottleneck-whack-a-mole/">Bottleneck Whack-A-Mole</a> performance speed-up. One performance enhancement is Bloom filters, which are a very hot topic these days. More important is a switch from Flash to HTML5, so as to accommodate mobile devices with less server-side rendering. Splunk reports that its users &#8212; especially the non-IT ones &#8212; really want to get Splunk information on the tablet devices. While this somewhat contradicts <a href="../../../../../2012/01/04/some-issues-in-business-intelligence/">what I wrote a few days ago pooh-poohing mobile BI</a>, let me hasten to point out:</p>
<ul>
<li>Splunk is used for a lot of (quasi) real-time monitoring.</li>
<li>Splunk&#8217;s desktop user interfaces are, by BI standards, quite primitive.</li>
</ul>
<p>That&#8217;s pretty much the ideal scenario for mobile BI: Timeliness matters and prettiness doesn&#8217;t.</p>
<p><span id="more-5791"></span><em>Hmm. Maybe <a href="../../../../../2011/11/10/streambase-liveview-push-based-real-time-bi/">StreamBase LiveView</a> needs a mobile option as well &#8230;</em></p>
<p>Splunk&#8217;s basic use is to take the text string that is a log and make sense of it. But Splunk now also supports JSON structures. It does this via something called spath, which as you might guess from the name has XPath similarities. That probably bore more discussion than we found the time to have.</p>
<p><em>By the way: If you&#8217;re interested in BI over XML, that&#8217;s what my former clients at Skytide were founded to do, before they pivoted a bit. I don&#8217;t think those capabilities have disappeared from the product</em>.</p>
<p><a href="http://www.monash.com/uploads/Splunk-4-3.pdf">Splunk has graciously allowed me to post a slide deck</a>. More stuff in there, including quotes from a customer &#8212; Expedia &#8212; that has 2700 Splunk users.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/10/splunk-update/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Big data terminology and positioning</title>
		<link>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/</link>
		<comments>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 01:35:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5768</guid>
		<description><![CDATA[Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions: Bigness &#8212; Volume, Velocity, size Structure &#8212; Variety, Variability, Complexity given that High-velocity &#8220;big data&#8221; problems are usually high-volume as well.* Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction. But the conflation should [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I observed that <a href="../../../../../2011/09/11/big-data-has-jumped-the-shark/">Big Data terminology is seriously broken</a>. It is reasonable to reduce the subject to two quasi-dimensions:</p>
<ul>
<li><strong>Bigness</strong> &#8212; Volume, Velocity, size</li>
<li><strong>Structure</strong> &#8212; Variety, Variability, Complexity</li>
</ul>
<p>given that</p>
<ul>
<li>High-velocity &#8220;big data&#8221; problems are usually high-volume as well.*</li>
<li>Variety, variability, and complexity all relate to the <a href="../../../../../2011/05/17/poly-structured-database/">simply-structured/poly-structured</a> distinction.</li>
</ul>
<p>But the conflation should stop there.</p>
<p><em>*Low-volume/high-velocity problems are commonly referred to as <a href="../2011/08/25/renaming-cep-or-not/">&#8220;event processing&#8221; and/or &#8220;streaming&#8221;</a>.</em></p>
<p>When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2&#215;2 matrix of possibilities. For want of better alternatives, my suggestions are:</p>
<ul>
<li><strong>Relational big data</strong> is data of high volume that fits well into a relational DBMS.</li>
<li><strong>Multi-structured big data</strong> is data of high volume that doesn&#8217;t fit well into a relational DBMS. <em>Alternative: Poly-structured big data.</em></li>
<li><strong>Conventional relational data</strong> is data of not-so-high volume that fits well into a relational DBMS. <em>Alternatives: Ordinary/normal/smaller relational data.</em></li>
<li><strong>Smaller poly-structured data</strong> is data for which <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schema</a> capabilities are important, but which doesn&#8217;t rise to &#8220;big data&#8221; volume.</li>
</ul>
<p><span id="more-5768"></span>Notes on all this include:</p>
<ul>
<li>&#8220;Relational big data&#8221; is commonly what you need a scalable analytic relational DBMS for. But there are non-analytic use cases as well.</li>
<li>The paradigmatic example of &#8220;multi-structured big data&#8221; is log files. Thus, multi-structured big data is commonly what you need a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> for.</li>
<li>One might want to equate non-analytic relational big data technology to &#8220;NewSQL&#8221;. However, I&#8217;m struggling to think of a database size range in which the entire NewSQL industry can match Oracle&#8217;s market share alone.</li>
<li>One might want to equate non-analytic multi-structured big data technology to &#8220;NoSQL&#8221;. However:
<ul>
<li>&#8220;NoSQL&#8221; is also used to encompass not-so-big-data use cases, such as prototyping in MongoDB.</li>
<li><a href="../../../../../2011/10/02/defining-nosql/">&#8220;NoSQL&#8221; has non-ACID/low(er)-data-integrity connotations</a> that aren&#8217;t appropriate for all non-relational systems.</li>
</ul>
</li>
<li>Up to a point, you can analyze relational big data in a conventional relational DBMS, but an analytic RDBMS will usually win on TCO (Total Cost of Ownership). In particular, reasonable thresholds for moving an analytic database off Oracle might be:
<ul>
<li>1-2 terabytes if you&#8217;ve never bought anything past Oracle Standard Edition.</li>
<li>5-10 terabytes if you&#8217;re already paying for Oracle Enterprise Edition.</li>
<li>A lot higher than that if you actually find Oracle Exadata to be cost-effective.</li>
</ul>
</li>
<li>Depending on how big one acknowledges as &#8220;big&#8221;, the market share leader in &#8220;big bit bucket&#8221; use cases is either Splunk or Hadoop.</li>
<li>If we look at multi-structured big data management overall, MarkLogic joins the list of market share contenders, as do various NoSQL alternatives.</li>
<li>It is wrong to say that the large web companies invented &#8220;big data&#8221; technology. But it is more reasonable to say they invented much of &#8220;multi-structured big data&#8221; management. In particular (and this is just a partial list), Google, Amazon, Yahoo, Facebook, et al. can reasonably be credited with Hadoop, Cassandra, HBase and various predecessors to same.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>The cool aspects of Odiago WibiData</title>
		<link>http://www.dbms2.com/2011/11/02/5576/</link>
		<comments>http://www.dbms2.com/2011/11/02/5576/#comments</comments>
		<pubDate>Wed, 02 Nov 2011 15:05:01 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Odiago and WibiData]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5576</guid>
		<description><![CDATA[Christophe Bisciglia and Aaron Kimball have a new company. It&#8217;s called Odiago, and is one of my gratifyingly more numerous tiny clients. Odiago&#8217;s product line is called WibiData, after the justly popular We Be Sushi restaurants. We&#8217;ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on TechCrunch. But this [...]]]></description>
			<content:encoded><![CDATA[<p>Christophe Bisciglia and Aaron Kimball have a new company.</p>
<ul>
<li>It&#8217;s called Odiago, and is one of my gratifyingly more numerous tiny clients.</li>
<li>Odiago&#8217;s product line is called <a href="http://www.wibidata.com/">WibiData</a>, after the justly popular We Be Sushi restaurants.</li>
<li>We&#8217;ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on <a href="http://techcrunch.com/2011/11/02/cloudera-founder-debuts-big-data-management-and-analysis-platform-wibidata-with-backing-from-eric-schmidt/">TechCrunch</a>. But this is the place for &#8212; well, for the tech crunch.</li>
</ul>
<p><strong>WibiData is designed for management of, <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> on, and operational analytics on consumer internet data,</strong> the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop,* is <strong>a data management and analytic execution layer.</strong> That&#8217;s where the secret sauce resides. Also included are:</p>
<ul>
<li>REST APIs for interactive access.</li>
<li>Import/export tools, including JDBC access.</li>
<li>Management tools.</li>
<li>Analytic libraries &#8212; data mining, predictive      analytics, machine learning, and so on.</li>
</ul>
<p>The whole thing is in beta, with about three (paying) beta customers.</p>
<p><em>*And Avro and so on.</em></p>
<p>The core ideas of WibiData include:</p>
<ul>
<li><strong>ALL data pertaining to a single user </strong>(or mobile device) <strong>is kept in      a single, </strong>possibly very long,<strong> HBase row.</strong><strong> </strong></li>
<li>There are two primary operators in WibiData, <strong>Produce </strong>and <strong>Gather.</strong>
<ul>
<li>Produce operates on single       rows. It can operate on one row at HBase speed (milliseconds) if you need       to inform an interactive user response. Or it can operate on the whole       database in batch via Hadoop MapReduce.</li>
<li>It is reasonable to think of       Produce as mainly doing two things. One is the aforementioned serving of       data out of WibiData into interactive applications. The other is scoring,       classifying, recommending, etc. on individual users (i.e. rows), in line       with an analytic model.</li>
<li>Gather typically operates on       all your rows at once, and emits suitable input for a MapReduce Reduce       step. It is reasonable to think of Gather as being a key cog in the       training of analytic models.</li>
</ul>
</li>
<li>HBase <strong>schema management is done at the      WibiData system level,</strong> not directly in applications. There&#8217;s a      WibiData HBase data dictionary, powered by a set of system tables, that      specifies cell data types/record types and, in effect, primitive schemas.</li>
</ul>
<p><span id="more-5576"></span>WibiData-enhanced HBase differs from relational DBMS in most of the ways you would imagine, both good and bad. In particular:</p>
<ul>
<li>Depending on how you look at it,      WibiData-enhanced HBase either has no DML (Data Manipulation Language) at      all, or else has one that &#8216;s a lot less rich than SQL.</li>
<li>WibiData-enhanced HBase schemas are much more <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic</a> than SQL schemas.</li>
<li>WibiData-enhanced HBase schemas can have nested      or recursive data structures, such as array-valued cells.</li>
</ul>
<p>To expand on each of those points in turn:</p>
<p>WibiData&#8217;s underlying one-giant-table philosophy notwithstanding, there are times you manage multiple tables with it. (For example, you ingest data into WibiData however you can, and then run transformations &#8212; typically batch &#8212; until the data is in the preferred structure.) While Wibidata does have ways to simulate joins, foreign keys, and so on, there&#8217;s nothing resembling referential integrity or foreign key constraints.</p>
<p><strong>WibiData takes single-table schema flexibility to an extreme.</strong> Not only can different rows in the same table have different associated columns &#8212; something that relational systems can in effect also do via NULL values &#8212; but schemas can even change over the life of a column. If you have an array-valued cell storing the results of a marketing campaign, and you start recording more data partway through the campaign, then different rows in the table will, in the same column, hold different-sized arrays.</p>
<p>That nesting can also get pretty serious; <strong>where you’d have a single value in a relational table, you might have the equivalent of a whole relational table (or at least selection/view) in WibiData-enhanced HBase. </strong>For example, if a user visits the same web page ten times, and each time 50 attributes are recorded (including a timestamp), all 500 data – to use the word “data” in its original “plural of <em>datum</em>” sense – would likely be stored in the same WibiData cell.</p>
<p>That’s about all Odiago is disclosing about WibiData right now. Christophe will also be talking at Hadoop World next week, and presumably can be hit up with any burning questions then.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/02/5576/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>What those nested data structures are about</title>
		<link>http://www.dbms2.com/2011/10/19/nested-data-structures/</link>
		<comments>http://www.dbms2.com/2011/10/19/nested-data-structures/#comments</comments>
		<pubDate>Wed, 19 Oct 2011 17:29:59 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[eBay]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5506</guid>
		<description><![CDATA[As I&#8217;ve noted before, the very big web companies have an issue with nested data structures. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a [...]]]></description>
			<content:encoded><![CDATA[<p>As I&#8217;ve noted before, <a href="http://www.dbms2.com/2010/07/31/nested-data-structures-keep-coming-up-especially-for-log-files/">the very big web companies have an issue with nested data structures</a>. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a good tactic.</p>
<p>The explanation was led by Oliver Ratzesberger, late of eBay*and progenitor of <a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay&#8217;s Singularity project</a>. In simplest terms, <strong>one event can spawn a lot of event attribute information, perhaps in the form of name-value pairs,</strong> which it then makes sense to store together in some way. The example Oliver dwelled on was that, on any given web page, there can be 100+ pieces of information to record, including:</p>
<ul>
<li>All 50 search results you were shown, and their positions in the search rankings.</li>
<li>Every ad, image, or graphical element.</li>
<li>An ID as to which test you were participating in (every page you see on eBay has some element being tested).</li>
</ul>
<p><em>*Oliver is leaving eBay for a still-secret large company. I would conjecture that Michael McIntire is on the move too, either to replace Oliver or to go with him, but Oliver did a very good job of not commenting on the matter.</em></p>
<p>There are several reasons why one might wish to store this information in ways that grieve relational purists. First, reconstructing all this information via joins would be brutally expensive. What&#8217;s more, reconstructing all this information via joins could be impractical. Some comes from third party ad servers, which might not reproduce the same ads upon demand. Other is in the form of rankings, which can&#8217;t always be reliably reproduced from one query to the next. (That&#8217;s just one of several reasons <a href="http://www.dbms2.com/2005/12/09/relational-dbms-versus-text-data/">text search and relational DBMS are an awkward fit</a>.)</p>
<p>Also, there&#8217;s a strong <a href="http://www.dbms2.com/2011/07/31/dynamic-fixed-schema-databases/">dynamic schema</a> flavor to these databases. The list of attributes for one web click might be very different in kind from the list for the next page. Forcing that kind of variability into a fixed relational schema, while theoretically possible, doesn&#8217;t necessarily make a lot of sense.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/10/19/nested-data-structures/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The database architecture of salesforce.com, force.com, and database.com</title>
		<link>http://www.dbms2.com/2011/09/15/database-architecture-salesforce-com-force-com-and-database/</link>
		<comments>http://www.dbms2.com/2011/09/15/database-architecture-salesforce-com-force-com-and-database/#comments</comments>
		<pubDate>Thu, 15 Sep 2011 16:09:32 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[salesforce.com]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5237</guid>
		<description><![CDATA[salesforce.com, force.com, and database.com use exactly the same database infrastructure and architecture. That&#8217;s the good news. The bad news is that salesforce.com is somewhat obscure about technical details, for reasons such as: A long-ago marketing decision to not give infrastructure details, so as to convey a &#8220;Don&#8217;t worry; we&#8217;ll take care of everything&#8221; message. Even [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.dbms2.com/2011/09/15/salesforce-force-database-data-heroku/">salesforce.com, force.com, and database.com use exactly the same database infrastructure and architecture</a>. That&#8217;s the good news. The bad news is that salesforce.com is somewhat obscure about technical details, for reasons such as:</p>
<ul>
<li>A long-ago marketing decision to not give infrastructure details, so as to convey a &#8220;Don&#8217;t worry; we&#8217;ll take care of everything&#8221; message.</li>
<li>Even so, a long-ago and perhaps now-regretted marketing decision to disclose and even exaggerate salesforce.com&#8217;s reliance on Oracle, as part of an early-days attempt to prove salesforce was using enterprise-class technology.</li>
<li>A desire to hide the recipe for salesforce.com&#8217;s secret sauce.</li>
<li>Force of habit &#8212; I&#8217;m not sure salesforce even knows how to tell its technical story with any clarity.</li>
</ul>
<p>Actually, salesforce.com has moved some kinds of data out of Oracle that previously used to be stored there. Besides Oracle, salesforce uses at least a file system and a RAM-based data store about which I have no details. Even so, much of salesforce.com&#8217;s data is stored in Oracle &#8212; a single instance of Oracle, which it believes may be the largest instance of Oracle in the world.</p>
<p><span id="more-5237"></span>Salesforce did spell out some of its database story in <a href="http://www.salesforce.com/au/assets/pdf/Force.com_Multitenancy_WP_101508.pdf">a 2008 force.com white paper</a>,<em> </em>which is good stuff, but potentially misleading in one important way. The paper tells of a level of abstraction, whereby what the application sees as logical &#8220;columns&#8221; are stored in a very different schema than one might assume. However, it doesn&#8217;t spell out a second level of abstraction, whereby that logical schema also isn&#8217;t how the database is actually laid out.</p>
<p><em>Another flaw in the paper is that it spins &#8220;We had to do this, to support multitenancy, so we did.&#8221; issues as &#8220;Because we&#8217;re multitenant, we can do this, while single-tenant systems can&#8217;t.&#8221; One example is the query optimization step around &#8220;user visibility&#8221; in Figure 11. Welcome to marketing.</em></p>
<p>At the first level of abstraction, data seems to be kept mainly in a single wide table, with hundreds of columns. What&#8217;s more, many of those are &#8220;flex columns&#8221;; a flex column can hold data of many different kinds and even datatypes. Notwithstanding the second level of abstraction, I imagine the idea of stuffing different kinds of thing into the same column has something to do with the fact that <a href="../../../../../2011/03/13/so-how-many-columns-can-a-single-table-have-anyway/">Oracle&#8217;s physical limit on columns</a> falls far short of the number of logical columns salesforce wants to use.</p>
<p>If we imagine that the different kinds of data in a flex column were each in their own column instead, the whole thing might sound like BigTable/Cassandra/HBase-style column-group NoSQL. Thus, much as <a href="../../../../../2010/08/22/workday-technology-stack/">Workday uses MySQL to simulate a key-value store</a>, salesforce.com can be said to use Oracle to simulate a different kind of NoSQL. In both cases, what&#8217;s going on seems to be a kind of object/relational mapping, but with the relational aspect strongly deemphasized. Or, if you take a more relational view, we could say that salesforce.com&#8217;s tables are a lot wider than any one user organization&#8217;s, because each user sees only its own custom columns (plus the standard ones common to all users).</p>
<p>The second layer of abstraction has a lot to do with multitenancy. If you want to stick data for many different user organizations into the same huge table, then you have to label it in some way to show who is permitted to see or update each part. Logically, this leads to a join, between one table carrying data plus a simple key showing which users/roles are entitled to see it, and a second table showing who actually is that kind of user/has that kind of role. But that join makes a lot of sense to store in a denormalized way, all the more because data is partitioned across the computer cluster in line with which user organization it actually belongs to.</p>
<p><em>Multitenant security isn&#8217;t the only reason for this denormalization, but it appears to be the biggest one.</em></p>
<p>The whole thing is doing 550 million or so transactions per day. salesforce.com thinks that fact should be regarded as evidence that it works. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/15/database-architecture-salesforce-com-force-com-and-database/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>&#8220;Big data&#8221; has jumped the shark</title>
		<link>http://www.dbms2.com/2011/09/11/big-data-has-jumped-the-shark/</link>
		<comments>http://www.dbms2.com/2011/09/11/big-data-has-jumped-the-shark/#comments</comments>
		<pubDate>Sun, 11 Sep 2011 13:23:51 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5212</guid>
		<description><![CDATA[I frequently observe that no market categorization is ever precise and, in particular, that bad jargon drives out good. But when it comes to &#8220;big data&#8221; or &#8220;big data analytics&#8221;, matters are worse yet. The definitive shark-jumping moment may be Forrester Research&#8217;s Brian Hopkins&#8217; claim that: &#8230; typical data warehouse appliances, even if they are [...]]]></description>
			<content:encoded><![CDATA[<p>I frequently observe that <a href="http://www.strategicmessaging.com/no-market-categorization-is-ever-precise/2011/03/01/">no market categorization is ever precise</a> and, in particular, that <a href="http://www.strategicmessaging.com/monashs-first-law-of-commercial-semantics-explained/2009/01/09/">bad jargon drives out good</a>. But when it comes to <strong>&#8220;big data&#8221;</strong> or <strong>&#8220;big data analytics&#8221;,</strong> matters are worse yet. The definitive shark-jumping moment may be <a href="http://blogs.forrester.com/brian_hopkins/11-08-29-big_data_brewer_and_a_couple_of_webinars">Forrester Research&#8217;s Brian Hopkins&#8217; claim</a> that:</p>
<blockquote><p>&#8230; typical data warehouse appliances, even if they are petascale and parallel, [are] NOT big data solutions.</p></blockquote>
<p>Nonsense almost as bad can be found in other venues.</p>
<p>Forrester seems to claim that &#8220;big data&#8221; is characterized by Volume, Velocity, Variety, and Variability. Others, less alliteratively-inclined, might put Complexity in the mix. So far, so good; after all, much of what people call &#8220;big data&#8221; is collections of disparate data streams, all collected somewhere in a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a>. But when people start <strong>defining</strong> &#8220;big data&#8221; to include Variety and/or Variability, they&#8217;ve gone too far.</p>
<p><span id="more-5212"></span><em>Up to that point, Hopkins &#8212; while wrong &#8212; is far from alone. The less common part of his error is to further claim that for data to be &#8220;big&#8221;, it must be stored in a way that violates the C in the CAP Theorem. Yes, the bigger the data set, the more likely that each datum has low individual value, with immediate consistency not being strictly necessary. But there are plenty of big data use cases in which data accuracy turns out to be a good idea.</em></p>
<p>It actually is reasonable to say that Volume and Velocity of data go together. If you&#8217;re storing 5 terabytes of data per day, you have a &#8220;big data&#8221; kind of problem, whether you then keep it for 30 days or 3000. It also is reasonable to say that Variety and Variability go together; indeed, I&#8217;d guess that what Forrester means by those terms corresponds to <a href="../../../../../2011/05/17/poly-structured-database/">multi-structured and poly-structured</a> respectively, and using one of those terms is generally plenty.</p>
<p>But while we can whittle four concepts down to two, the reduction should stop there. I say this because any of four combinations is possible (and not just in edge cases):</p>
<ul>
<li><em>Data can be both <strong>big</strong> and <strong>poly-structured.</strong></em> For example, consider the classic Hadoop log-collection use case, or the bigger of MarkLogic&#8217;s databases, or of Splunk&#8217;s, or even the dynamic-schema parts of relational data warehouses built by <a href="../../../../../2011/09/05/zynga-linkedin-data-warehous/">Zynga</a> and <a href="../../../../../2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay</a>. And yes, also consider some of the NoSQL-based <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">short-request</a> systems Hopkins was surely thinking of as well.</li>
<li><em>Data can be both <strong>big</strong> and <strong>simply-structured.</strong></em> I think most of Teradata&#8217;s and <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">Vertica&#8217;s</a> petabyte-scale installations would fit that description, the partial counterexamples at eBay and Zynga notwithstanding.</li>
<li><em>Data can be <strong>not-so-big</strong> and <strong>poly-structured.</strong></em> Consider, for example, a typical user of <a href="../../../../../2010/01/15/intersystems-cache-highlights/">Intersystems Cache&#8217;</a>.</li>
<li><em>Data can be <strong>not-so-big</strong> and <strong>simply-structured.</strong></em> Consider, for example, most of the traditional RDBMS world.</li>
</ul>
<p>To pretend that those four possibilities are only two &#8212; &#8220;big data&#8221; and otherwise &#8212; is a travesty.</p>
<p>If the term &#8220;big data&#8221; has become useless, then what? Gartner may have switched over to <strong>&#8220;extreme data,&#8221; </strong><a href="http://www.sand.com/extreme-data/">as reported by my clients at SAND</a>, in honor of the multi-V stuff. That would be an improvement. Better yet would be to stop pretending that a matrix with two dimensions has only one. If what you mean is &#8220;huge, poly-structured databases&#8221;, then that&#8217;s what you should say, or something like it.</p>
<p>If things are bad for &#8220;big data&#8221;, they&#8217;re even worse for &#8220;big data analytics&#8221;, a term that starts out by inheriting all of big data&#8217;s problems and adds more of its own. &#8220;Big data analytics&#8221; surely means &#8220;analytics done on big data&#8221; &#8212; but nobody&#8217;s quite sure what &#8220;analytics&#8221; are. For example:</p>
<ul>
<li>I&#8217;m OK with &#8220;analytic processing&#8221; incorporating all of what might be called business intelligence, visualization (which sometimes now is just the new term for BI), data mining, machine learning, predictive analytics (which for some years has been the term for data mining and machine learning), planning, and yet more. However, &#8230;</li>
<li>&#8230; others don&#8217;t agree, and contrast &#8220;analytics&#8221; to &#8220;OLAP&#8221; and/or to &#8220;visualization&#8221;, and  seem to equate &#8220;analytics&#8221; to &#8220;predictive analytics&#8221; or something similar.</li>
<li>The latter is what most people have in mind when they say &#8220;big data analytics&#8221;, but &#8230;</li>
<li>&#8230; vendors who can only lay claim to the &#8220;analytics&#8221; term in its most expansive sense claim to be doing &#8220;big data analytics&#8221; as well.</li>
</ul>
<p><a href="http://soa.sys-con.com/node/1968472">Nonsense even worse than Forrester&#8217;s</a> ensues.</p>
<p>So here&#8217;s what I propose.</p>
<ul>
<li>Nobody should ever again say that &#8220;big data&#8221; doesn&#8217;t include big relational data warehouses.</li>
<li>If your definition of &#8220;big data&#8221; goes beyond Volume and perhaps Velocity to include Variety, Variability, or Complexity &#8212; please call it something else instead. &#8220;Extreme data&#8221; sounds like a snowboarding competition or something, but at least it&#8217;s not as totally erroneous as &#8220;big&#8221;.</li>
<li>Never, ever use the phrase &#8220;big data analytics&#8221; unless you have modifiers near it, to show what kind of big data analytics you&#8217;re talking about, or at least to describe the special value you think you bring to the big data analytics process.</li>
</ul>
<p><em>Edit: <a href="http://twitter.com/#!/merv/status/113078204364890112">Merv Adrian of Gartner Group</a> has a more reasonable &#8212; and wittier! &#8212; take than Forrester&#8217;s:</em></p>
<blockquote><p><em>You won&#8217;t see us telling people &#8220;That&#8217;s not <a title="#bigdata" rel="nofollow" href="http://twitter.com/#%21/search?q=%23bigdata">#<strong>bigdata</strong></a>. This is big data.&#8221; That&#8217;s Crocodile Dundee&#8217;s job.</em></p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/11/big-data-has-jumped-the-shark/feed/</wfw:commentRss>
		<slash:comments>30</slash:comments>
		</item>
		<item>
		<title>Derived data, progressive enhancement, and schema evolution</title>
		<link>http://www.dbms2.com/2011/09/06/derived-data-progressive-enhancement-and-schema-evolution/</link>
		<comments>http://www.dbms2.com/2011/09/06/derived-data-progressive-enhancement-and-schema-evolution/#comments</comments>
		<pubDate>Tue, 06 Sep 2011 08:10:23 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Text]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5177</guid>
		<description><![CDATA[The emphasis I&#8217;m putting on derived data is leading to a variety of questions, especially about how to tease apart several related concepts: Derived data. Many-step processes to produce derived data. Schema evolution. Temporary data constructs. So let&#8217;s dive in.  When I started my discussion of derived data, I focused on five kinds: Aggregates, when [...]]]></description>
			<content:encoded><![CDATA[<p>The emphasis I&#8217;m putting on derived data is leading to a variety of questions, especially about how to tease apart several related concepts:</p>
<ul>
<li>Derived data.</li>
<li>Many-step processes to produce derived data.</li>
<li>Schema evolution.</li>
<li>Temporary data constructs.</li>
</ul>
<p>So let&#8217;s dive in.  <span id="more-5177"></span></p>
<p>When I started <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">my discussion of derived data</a>, I focused on five kinds:</p>
<ul>
<blockquote>
<li>Aggregates,      when they are maintained, generally for reasons of performance or response      time.</li>
<li>Calculated      scores, commonly based on data mining/predictive analytics.</li>
<li>Text      analytics.</li>
<li>The      kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce      are commonly used for.</li>
<li>Adjusted      data, especially in scientific contexts.</li>
</blockquote>
</ul>
<p>Later I added a sixth kind:</p>
<ul>
<li><a href="../2011/05/30/another-category-of-derived-data/">Derived metadata</a>, commonly for polystructured data sets (logs, text, images, video, whatever).</li>
</ul>
<p>More kinds may yet follow.</p>
<p>In all cases, I was (and am) talking about data that is actually persisted into the database. Temporary tables &#8212; for example the kind frequently created by Microstrategy &#8212; are also important in data processing, as is <a href="../../../../../2010/08/16/vertica-flash-temp-space/">temp space managed solely for the convenience of the DBMS</a>. But neither are what I mean when I talk about &#8220;derived data.&#8221;</p>
<p>As I noted back in June, <a href="../../../../../2011/06/19/investigative-analytics-derived-data/">derived data naturally leads to schema evolution</a>. You load data into an analytic database. You do some analysis and get some interesting results &#8212; interesting enough for you to want to keep them persistently. So you extend the schema to include them. You do more research; you discover something else interesting; you extend the schema again. Repeat as needed.</p>
<p>However, in no way is derived data the only source of analytic schema evolution. Duh. Sometimes you just have new kinds of information coming in. Of course, once it&#8217;s there, you may want to derive something from it. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  In <a href="../../../../../2010/06/08/profile-of-revealed-preferences/">marketing</a> contexts, both parts of that might be true in spades.</p>
<p>When I mentioned all this to my clients at MarkLogic &#8212; which was my inspiration for the polystructured/metadata example &#8212; they perked up and said &#8220;Oh! Progressive enhancement.&#8221; Indeed, it&#8217;s long been the case that a simple text processing pipeline could have &gt;15 steps of extraction; indeed, <a href="http://www.texttechnologies.com/2005/10/19/linkage-among-different-text-technologies/">I learned about the &#8220;tokenization chain&#8221; in 1997</a>. If all the &#8220;progression&#8221; in  the data enhancement occurs in a single processing run, that wouldn&#8217;t necessarily spawn much schema evolution. On the other hand, if you think of additional steps to add every now and then &#8212; in that case your schema might indeed evolve over time.</p>
<p>Somewhat similarly, <a href="../../../../../2009/10/10/enterprises-using-hadoo/">Hadoop can be used to run &#8220;aggregation pipelines&#8221; of many 10s of steps</a>. The output of the whole thing might be a relatively small number of fields. But again, if the number or nature of the fields changes over time, schemas will need to evolve accordingly.</p>
<p>So to sum up:</p>
<ul>
<li>Derived data &#8212; of multiple kinds &#8212; is very important.</li>
<li>If you want to increase the value you get from derived data, you might need to evolve your schema accordingly.</li>
<li>Data derivation happens to sometimes have long processing pipelines; those pipelines might happen to offer clues as how to do yet better at data derivation in the future; those improvements might happen to lead to schema evolution over time.</li>
</ul>
<p>&#8220;Just the raw facts&#8221; analytic databases are, for the most part, obsolete.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/06/derived-data-progressive-enhancement-and-schema-evolution/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data management at Zynga and LinkedIn</title>
		<link>http://www.dbms2.com/2011/09/05/zynga-linkedin-data-warehous/</link>
		<comments>http://www.dbms2.com/2011/09/05/zynga-linkedin-data-warehous/#comments</comments>
		<pubDate>Mon, 05 Sep 2011 08:49:04 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Aster Data]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Zynga]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5159</guid>
		<description><![CDATA[Mike Driscoll and his Metamarkets colleagues organized a bit of a bash Thursday night. Among the many folks I chatted with were Ken Rudin of Zynga, Sam Shah of LinkedIn, and D. J. Patil, late of LinkedIn. I now know more about analytic data management at Zynga and LinkedIn, plus some bonus stuff on LinkedIn&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>Mike Driscoll and his <a href="http://www.metamarketsgroup.com/">Metamarkets</a> colleagues organized a bit of a <a href="http://yfrog.com/h8msmkqj">bash</a> Thursday night. Among the many folks I chatted with were Ken Rudin of Zynga, Sam Shah of LinkedIn, and D. J. Patil, late of LinkedIn. I now know more about analytic data management at Zynga and LinkedIn, plus some bonus stuff on LinkedIn&#8217;s People You May Know application. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>It&#8217;s blindingly obvious that Zynga is one of <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">Vertica&#8217;s petabyte-scale customers</a>, given that Zynga sends 5 TB/day of data into Vertica, and keeps that data for about a year. (Zynga may retain even more data going forward; in particular, Zynga regrets ever having thrown out the first month of data for any game it&#8217;s tried to launch.) This is game actions, for the most part, rather than log files; true logs generally go into Splunk.</p>
<p><em>I don&#8217;t know whether the missing data is completely thrown away, or just stashed on inaccessible tapes somewhere.</em></p>
<p>I found two aspects of the Zynga story particularly interesting. First, those 5 TB/day are going straight into Vertica (from, I presume, <a href="http://www.dbms2.com/2010/08/18/nosql-hvsp-adoption/">memcached/Membase/Couchbase</a>), as Zynga decided that sending the data to some kind of log first was more trouble than it&#8217;s worth. Second, there&#8217;s Zynga&#8217;s approach to analytic database design. Highlights of that include: <span id="more-5159"></span></p>
<ul>
<li>Data is divided into two parts. One part has a  pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like <a href="../../../../../2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">eBay</a>&#8216;s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) About half the data is in each part, but I don&#8217;t think that&#8217;s by deliberate choice.</li>
<li>Zynga adds data into the real schema when it&#8217;s clear it will be needed for a while. This isn&#8217;t a matter of query volumes, for the most part; rather, it&#8217;s when Zynga&#8217;s tests (e.g. of new games?) have determined that the data will keep being collected and used for a while.</li>
<li>Zynga only adds columns to its analytic  database; it never goes through the more complex process of deleting them.</li>
</ul>
<p>Just as Zynga is one of Vertica&#8217;s flagship accounts, LinkedIn is one of Aster Data&#8217;s. Specifically, before leaving LinkedIn for Aster, Jonathan Goldman built LinkedIn&#8217;s People You May Know feature in Aster nCluster. This was long ago, and I&#8217;m not sure how sophisticated his use of <a href="../../../../../2009/03/07/three-greenplum-customers-applications-of-mapreduce/">SQL and MapReduce</a> would be in today&#8217;s terms; for example, I was told he didn&#8217;t use &#8220;nPath or anything like that.&#8221; <em>(Edit: See the comments below for clarifications from Jonathan.) </em>Anyhow, LinkedIn has replaced Aster for PYMK with Hadoop, and in my opinion is getting much better results.</p>
<p>That, from an Aster standpoint, is the bad news. The good news is that LinkedIn is happily using Aster nCluster for several other applications; LinkedIn folks doesn&#8217;t seem to regret throwing out* Greenplum for Aster; and they also seem to have a very high opinion of Jonathan and his work while he was there.</p>
<p><em>*And <a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">this time</a> that is indeed the phrase that was used. <img src='http://www.dbms2.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </em></p>
<p>One thing that astonished me is that LinkedIn PYMK is based only on data innate to LinkedIn (as opposed to imported email addresses, the results of web crawls, and so on). Given that, I am at a loss to explain how it suggested a couple of old friends, to whom I have no discernable chain of connection. Yes, we were at Harvard at the same time, but if that&#8217;s all it was, there would be a huge number of false positives I&#8217;m not actually seeing.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/05/zynga-linkedin-data-warehous/feed/</wfw:commentRss>
		<slash:comments>24</slash:comments>
		</item>
		<item>
		<title>Terminology: Dynamic- vs. fixed-schema databases</title>
		<link>http://www.dbms2.com/2011/07/31/dynamic-fixed-schema-databases/</link>
		<comments>http://www.dbms2.com/2011/07/31/dynamic-fixed-schema-databases/#comments</comments>
		<pubDate>Sun, 31 Jul 2011 23:02:56 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Structured documents]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5045</guid>
		<description><![CDATA[E. F. &#8220;Ted&#8221; Codd taught the computing world that databases should have fixed logical schemas (which protect the user from having to know about physical database organization).  But he may not have been as universally correct as he thought. Cases I&#8217;ve noted in which fixed schemas may be problematic include: &#8220;A bunch of apps in [...]]]></description>
			<content:encoded><![CDATA[<p>E. F. &#8220;Ted&#8221; Codd taught the computing world that <a href="http://www.dbms2.com/2011/07/31/the-ted-codd-guarantee/">databases should have fixed logical schemas</a> (which protect the user from having to know about physical database organization).  But he may not have been as universally correct as he thought. Cases I&#8217;ve noted in which fixed schemas may be problematic include:</p>
<ul>
<li>&#8220;A bunch of apps in one, similar but not the same&#8221; (in <a href="../../../../../2011/07/27/mongodb-users-and-use-cases/">my recent post on MongoDB</a>).</li>
<li>Out-of-control product catalogs (ditto).</li>
<li><a href="../../../../../2011/06/19/investigative-analytics-derived-data/">Analytic use cases in which one keeps enhancing the database with derived data</a>.</li>
</ul>
<p>And <a href="../../../../../2010/06/08/profile-of-revealed-preferences/">if marketing profile analysis is ever done correctly</a>, that will be a huge example for the list.</p>
<p>So what do we call those DBMS &#8212; for example NoSQL, object-oriented, or XML-based systems &#8212; that bake the schema into the applications or the records themselves? In the MongoDB post I went with &#8220;schemaless,&#8221; but I wasn&#8217;t really comfortable with that, so I took the discussion to Twitter. Comments from <a href="http://twitter.com/#%21/vldid/status/96271464310898688">Vlad Didenko</a> (in particular), <a href="http://twitter.com/#%21/ryanprociuk/status/96289631234035712">Ryan Prociuk</a>, <a href="http://twitter.com/#!/merv/status/96283658951995392">Merv Adrian</a>, and <a href="http://twitter.com/#%21/rolandbouman/status/96297629369106432">Roland Bouman</a> favored the idea that <strong>schemas in such systems are changeable or late-bound, rather than entirely absent.</strong> I quickly agreed.</p>
<p><em><span id="more-5045"></span>The discussion wasn&#8217;t entirely serious; wise-ass comments were contributed by at least <a href="http://twitter.com/#%21/merv/status/96236381382254592">Merv</a>, <a href="http://twitter.com/#%21/NeilRaden/status/96233519637999617">Neil Raden</a>, <a href="http://twitter.com/#%21/hakmem/status/96229674849533952">Yiorgos Adamopoulos</a>, and <a href="http://twitter.com/#%21/CurtMonash/status/96287448434360320">myself</a>.</em></p>
<p>I like that approach for the same reason I favor saying that databases are <a href="../../../../../2011/05/17/poly-structured-database/">poly- or multi-structured</a> (rather than un- or semi-):  <strong>Every database has structure, the only question being <em>when</em> that structure is determined.</strong> I wouldn&#8217;t precisely equate &#8220;poly-structured&#8221; to &#8220;has a late-bound schema&#8221;; for example, I&#8217;d say that mucking with the DDL (Data Description Language) of a relational database shows that it&#8217;s a little bit poly-structured, even though it&#8217;s not at all late-bound. But the concepts are definitely related.</p>
<p>So what actual wording should we use here? The only alternative I see to <strong>fixed schema</strong> is &#8220;static&#8221;, and that feels like it has <em>too</em> much of a connotation of &#8220;unchangeable&#8221;. The simplest word I can think of for changeable/late-bound/whatever is <strong>dynamic schema; </strong>that choice also has the virtue of some traction, as per the Vlad Didenko tweet linked above. Casual googling is also supportive of &#8220;fixed&#8221; and &#8220;dynamic&#8221;, at least over whatever alternatives I came up with. So those are my choices.</p>
<p>For actual definitions, I&#8217;ll say:</p>
<ul>
<li><strong>A (logical) schema is fixed </strong>if it is <strong>defined before a program is written, </strong>but<strong> dynamic </strong>if it is <strong>defined by the program or data itself.</strong></li>
<li><strong>A database is fixed- or dynamic-schema</strong> depending on whether its schemas are fixed or dynamic respectively.</li>
<li><strong>A DBMS is fixed- or dynamic-schema</strong> depending on whether databases created in it tend to have fixed or dynamic schemas respectively.</li>
</ul>
<p>Suit yourself as to what you say about relational DBMS when they also have a bit of XML, text, or whatever support.</p>
<p>By these definitions:</p>
<ul>
<li>Relational databases are fixed-schema (within the caveat above about XML or text data).</li>
<li>MOLAP databases are fixed-schema.</li>
<li>Pre-relational network and hierarchical DBMS (e.g. IMS) are fixed-schema.</li>
<li>Most other DBMS are dynamic-schema.</li>
</ul>
<p><em>What do you think? Do these definitions work for you?</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/31/dynamic-fixed-schema-databases/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
	</channel>
</rss>

