<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Facebook</title>
	<atom:link href="http://www.dbms2.com/category/users/facebook-cassandra/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 01:51:16 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Big data terminology and positioning</title>
		<link>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/</link>
		<comments>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/#comments</comments>
		<pubDate>Mon, 09 Jan 2012 01:35:57 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Exadata]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MarkLogic]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Oracle]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5768</guid>
		<description><![CDATA[Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions: Bigness &#8212; Volume, Velocity, size Structure &#8212; Variety, Variability, Complexity given that High-velocity &#8220;big data&#8221; problems are usually high-volume as well.* Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction. But the conflation should [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I observed that <a href="../../../../../2011/09/11/big-data-has-jumped-the-shark/">Big Data terminology is seriously broken</a>. It is reasonable to reduce the subject to two quasi-dimensions:</p>
<ul>
<li><strong>Bigness</strong> &#8212; Volume, Velocity, size</li>
<li><strong>Structure</strong> &#8212; Variety, Variability, Complexity</li>
</ul>
<p>given that</p>
<ul>
<li>High-velocity &#8220;big data&#8221; problems are usually high-volume as well.*</li>
<li>Variety, variability, and complexity all relate to the <a href="../../../../../2011/05/17/poly-structured-database/">simply-structured/poly-structured</a> distinction.</li>
</ul>
<p>But the conflation should stop there.</p>
<p><em>*Low-volume/high-velocity problems are commonly referred to as <a href="../2011/08/25/renaming-cep-or-not/">&#8220;event processing&#8221; and/or &#8220;streaming&#8221;</a>.</em></p>
<p>When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2&#215;2 matrix of possibilities. For want of better alternatives, my suggestions are:</p>
<ul>
<li><strong>Relational big data</strong> is data of high volume that fits well into a relational DBMS.</li>
<li><strong>Multi-structured big data</strong> is data of high volume that doesn&#8217;t fit well into a relational DBMS. <em>Alternative: Poly-structured big data.</em></li>
<li><strong>Conventional relational data</strong> is data of not-so-high volume that fits well into a relational DBMS. <em>Alternatives: Ordinary/normal/smaller relational data.</em></li>
<li><strong>Smaller poly-structured data</strong> is data for which <a href="../../../../../2011/07/31/dynamic-fixed-schema-databases/">dynamic schema</a> capabilities are important, but which doesn&#8217;t rise to &#8220;big data&#8221; volume.</li>
</ul>
<p><span id="more-5768"></span>Notes on all this include:</p>
<ul>
<li>&#8220;Relational big data&#8221; is commonly what you need a scalable analytic relational DBMS for. But there are non-analytic use cases as well.</li>
<li>The paradigmatic example of &#8220;multi-structured big data&#8221; is log files. Thus, multi-structured big data is commonly what you need a <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a> for.</li>
<li>One might want to equate non-analytic relational big data technology to &#8220;NewSQL&#8221;. However, I&#8217;m struggling to think of a database size range in which the entire NewSQL industry can match Oracle&#8217;s market share alone.</li>
<li>One might want to equate non-analytic multi-structured big data technology to &#8220;NoSQL&#8221;. However:
<ul>
<li>&#8220;NoSQL&#8221; is also used to encompass not-so-big-data use cases, such as prototyping in MongoDB.</li>
<li><a href="../../../../../2011/10/02/defining-nosql/">&#8220;NoSQL&#8221; has non-ACID/low(er)-data-integrity connotations</a> that aren&#8217;t appropriate for all non-relational systems.</li>
</ul>
</li>
<li>Up to a point, you can analyze relational big data in a conventional relational DBMS, but an analytic RDBMS will usually win on TCO (Total Cost of Ownership). In particular, reasonable thresholds for moving an analytic database off Oracle might be:
<ul>
<li>1-2 terabytes if you&#8217;ve never bought anything past Oracle Standard Edition.</li>
<li>5-10 terabytes if you&#8217;re already paying for Oracle Enterprise Edition.</li>
<li>A lot higher than that if you actually find Oracle Exadata to be cost-effective.</li>
</ul>
</li>
<li>Depending on how big one acknowledges as &#8220;big&#8221;, the market share leader in &#8220;big bit bucket&#8221; use cases is either Splunk or Hadoop.</li>
<li>If we look at multi-structured big data management overall, MarkLogic joins the list of market share contenders, as do various NoSQL alternatives.</li>
<li>It is wrong to say that the large web companies invented &#8220;big data&#8221; technology. But it is more reasonable to say they invented much of &#8220;multi-structured big data&#8221; management. In particular (and this is just a partial list), Google, Amazon, Yahoo, Facebook, et al. can reasonably be credited with Hadoop, Cassandra, HBase and various predecessors to same.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2012/01/08/big-data-terminology-and-positioning/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>DataStax pivots back to its original strategy</title>
		<link>http://www.dbms2.com/2011/09/22/datastax-pivots-back-to-its-original-strategy/</link>
		<comments>http://www.dbms2.com/2011/09/22/datastax-pivots-back-to-its-original-strategy/#comments</comments>
		<pubDate>Thu, 22 Sep 2011 23:23:12 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5331</guid>
		<description><![CDATA[The DataStax and Cassandra stories are somewhat confusing. Unfortunately, DataStax chose to clarify them in what has turned out to be a crazy news week. I&#8217;m going to use this post just to report on the status of the DataStax product line, without going into any analysis beyond that. Pro tip: If you choose to [...]]]></description>
			<content:encoded><![CDATA[<p>The DataStax and Cassandra stories are somewhat confusing. Unfortunately, DataStax chose to clarify them in what has turned out to be a crazy news week. I&#8217;m going to use this post just to report on the status of the DataStax product line, without going into any analysis beyond that.</p>
<p><span id="more-5331"></span><em>Pro tip: If you choose to announce at a conference where many other vendors will surely announce news also, you naturally run the risk of not garnering much attention.</em></p>
<p>For starters, it may help to realize or recall that:</p>
<ul>
<li><a href="http://www.dbms2.com/2008/07/21/project-cassandra-facebook-open-sourced-quasi-dbms/">Cassandra was originally developed and revealed at Facebook</a>, to much early NoSQL fanfare. Facebook later backed away from Cassandra use.</li>
<li>Rackspace guys in Texas became Cassandra&#8217;s biggest backers. They eventually founded a company called Riptano to commercialize Cassandra.</li>
<li>Texas company Riptano became the California company DataStax.</li>
<li>DataStax came out with a <a href="http://www.dbms2.com/2011/03/23/datastax-cassandrafs-hadoop-brisk/">Hadoop-on-Cassandra offering called Brisk</a>. For a while, it sounded as if Hadoop was as big a focus for DataStax as Cassandra is.</li>
<li>DataStax is now recommitted to being <strong>the Cassandra company,</strong> and has accordingly backed away from Hadoop and Brisk as a separate or coequal focus. However, it sees Hadoop capability as a nice, or even major, feature of its Cassandra-centric offering.</li>
<li>To finalize its open source obligations with respect to Brisk, DataStax is in essence:
<ul>
<li>Donating a Hive driver for Cassandra straight into the main Apache Cassandra project.</li>
<li>Releasing the rest of Brisk as a separate open source project.</li>
<li>Disclaiming interest in further advancing open source Brisk.</li>
</ul>
</li>
<li>There&#8217;s also something called Solandra &#8212; evidently SOLR-on-Cassandra &#8212; whose status is similar to Brisk&#8217;s.</li>
<li>There are three main ways that DataStax helps you to consume Cassandra.
<ul>
<li>DataStax is the principal sponsor of Apache Cassandra development, and presumably long will be.<strong> Apache Cassandra </strong>is both<strong> free-like-speech and free-like-beer.</strong></li>
<li>DataStax is also introducing a paid-subscription version of Cassandra called DataStax Enterprise, which features proprietary code, support, and so on. <strong>DataStax Enterprise </strong>is <strong>neither free-like-speech nor free-like-beer.</strong></li>
<li>There will also be something called DataStax Community Edition. <strong>DataStax Community Edition </strong>is<strong> free-like-beer, </strong>but<strong> not free-like-speech.</strong></li>
</ul>
</li>
</ul>
<p>Various posts on the <a href="http://www.datastax.com/blog">DataStax blog</a> give DataStax&#8217;s explanation of what it&#8217;s doing. Ben Werther, the ex-Greenplum guy who briefly worked at DataStax and was most associated with telling the Hadoop/Brisk story, has moved on to his own startup Platfora.</p>
<p>DataStax Enterprise has three main aspects:</p>
<ul>
<li><strong>DataStax Server,</strong> which is the actual database and analytics code. At this time, there is little closed-source code in DataStax Server, but DataStax reserves the right to widen that gap in the future.</li>
<li><strong>DataStax OpsCenter,</strong> which is management tools around DataStax Server. DataStax OpsCenter is entirely closed-source, even though DataStax gives a limited version away for free.</li>
<li><strong>Support.</strong></li>
</ul>
<p>To describe DataStax Community Edition, I&#8217;ll just quote the press release verbatim, which characterizes it as:</p>
<blockquote><p>&#8230; a free platform based on Apache Cassandra that bundles the open source database with smart installers, drivers and connectors for popular development languages, demo apps, documentation, and a free version of DataStax OpsCenter for Apache Cassandra.</p></blockquote>
<p>DataStax Community Edition is crippleware only in terms of feature set; there are no limitations on its database size, cluster size, or usage rights. A core mission of DataStax Community Edition is to create happy Cassandra users, who may then become customers for DataStax Enterprise.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/22/datastax-pivots-back-to-its-original-strategy/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Are there any remaining reasons to put new OLTP applications on disk?</title>
		<link>http://www.dbms2.com/2011/09/19/oltp-disk-solid-state/</link>
		<comments>http://www.dbms2.com/2011/09/19/oltp-disk-solid-state/#comments</comments>
		<pubDate>Mon, 19 Sep 2011 18:07:07 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Clustering]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Parallelization]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[dbShards and CodeFutures]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5257</guid>
		<description><![CDATA[Once again, I&#8217;m working with an OLTP SaaS vendor client on the architecture for their next-generation system. Parameters include: 100s of gigabytes of data at first, growing to &#62;1 terabyte over time. High peak loads. Public cloud portability (but they have private data centers they can use today). Simple database design &#8212; not a lot [...]]]></description>
			<content:encoded><![CDATA[<p>Once again, I&#8217;m working with an OLTP SaaS vendor client on the architecture for their next-generation system. Parameters include:</p>
<ul>
<li>100s of gigabytes of data at first, growing to &gt;1 terabyte over time.</li>
<li>High peak loads.</li>
<li>Public cloud portability (but they have <strong>private data centers they can use today).</strong></li>
<li>Simple database design &#8212; not a lot of tables, not a lot of columns, not a lot of joins, and everything can be distributed on the same customer_ID key.</li>
<li>Stream the data to a data warehouse, that will grow to a few terabytes. (Keeping only one year of OLTP data online actually makes sense in this application, but of course everything should go into the DW.)</li>
</ul>
<p>So I&#8217;m leaning to saying:   <span id="more-5257"></span></p>
<ul>
<li>They should go with a scalable, MySQL-based solution.
<ul>
<li>Lots of third-party software works with MySQL, in case that&#8217;s helpful.</li>
<li>Yes, any one vendor is small and not yet firmly established, but there are numerous vendors around with interesting MySQL scaling stories.</li>
<li>In a vendor emergency, just going with Oracle&#8217;s MySQL stuff would probably work &#8230;</li>
<li>&#8230; especially because there are these lovely things in the world called <strong>solid-state drives.</strong></li>
<li>There&#8217;s also good escapability if one wants to move away from MySQL, because everybody knows how to handle MySQL data.</li>
</ul>
</li>
<li>The first product to look at is dbShards, because it meets all the topology needs:
<ul>
<li>Local scale-out (<a href="http://www.dbms2.com/2011/02/24/transparent-sharding/">transparent sharding</a>).</li>
<li><a href="http://www.dbms2.com/2011/02/09/clarification-on-dbshards-shard-replication/">Local high availability</a>.</li>
<li>Remote disaster recovery (details of that are underway).</li>
</ul>
</li>
<li>The first analytic DBMS to look at is Infobright.
<ul>
<li>Yes, I know Infobright is focused more on machine-generated data these days, but this client&#8217;s analytic needs are so straightforward Infobright should pass with flying colors.</li>
<li>The MySQL-to-MySQL aspect should make ETL dead simple.</li>
<li>Again, there&#8217;s escapability.</li>
</ul>
</li>
</ul>
<p>Mainly, this is all fine. But I&#8217;m getting pushback on the solid-state aspect, for fear that it will compromise public cloud portability.</p>
<p>Am I missing something here? As far as I&#8217;m concerned, <strong>if you&#8217;re planning an OLTP system with a many-year lifespan today, </strong>of course <strong>you should assume solid-state storage.</strong> Maybe you scale out just as far as you would with disk, striping indexes or entire databases across the RAM of multiple servers. It that case, having solid-state backing reduces the risk of bottlenecks. Maybe you don&#8217;t scale out as far as you would with disk. In that case, solid-state backing saves you money.</p>
<p><strong>As for public-cloud support for solid-state storage, that&#8217;s coming fast, right? </strong>(Actually, I have data points in support of that theory, but they&#8217;re a bit tenuous.) A large fraction of web businesses with private data centers seem to be using solid-state storage &#8212; from Facebook on down &#8212; or so the NoSQL/NewSQL/<a href="http://www.dbms2.com/2011/03/02/short-request-processing/">short-request</a> DBMS guys tell me. Surely a number of public cloud vendors are close behind.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/09/19/oltp-disk-solid-state/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>HBase is not broken</title>
		<link>http://www.dbms2.com/2011/07/18/hbase-is-not-broken/</link>
		<comments>http://www.dbms2.com/2011/07/18/hbase-is-not-broken/#comments</comments>
		<pubDate>Mon, 18 Jul 2011 05:25:27 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4990</guid>
		<description><![CDATA[It turns out that my impression that HBase is broken was unfounded, in at least two ways. The smaller is that something wrong with the HBase/Hadoop interface or Hadoop&#8217;s HBase support cannot necessarily be said to be wrong with HBase (especially since HBase is no longer a Hadoop subproject). The bigger reason is that, according [...]]]></description>
			<content:encoded><![CDATA[<p>It turns out that my impression that <a href="http://www.dbms2.com/2011/07/10/hadoop-futures-and-enhancements/">HBase is broken</a> was unfounded, in at least two ways. The smaller is that something wrong with the HBase/Hadoop interface or Hadoop&#8217;s HBase support cannot necessarily be said to be wrong with HBase (especially since HBase is no longer a Hadoop subproject). The bigger reason is that, according to consensus, <strong>HBase has worked pretty well since the .90 release</strong> in January of this year.</p>
<p>After Michael Stack of StumbleUpon beat me up for a while,* Omer Trajman of Cloudera was kind enough to walk me through HBase usage. He is informed largely by 18 Cloudera customers using, plus a handful of other well-known HBase users such as Facebook, StumbleUpon, and Yahoo. Of the 18 Cloudera customers Omer was thinking of, 15 are in HBase production, one is in HBase &#8220;early production&#8221;, one is still doing R&amp;D in the area of HBase, and one is a classified government customer not providing such details.<span id="more-4990"></span></p>
<p><em>*Just kidding &#8212; he was actually extremely gentle.</em></p>
<p>In the use cases that Omer offered, what&#8217;s stored in HBase is almost always <strong>records of web or network activity. </strong>Specific examples included clickstream information (at 5 different ad companies), crash reports (at Mozilla), and messages (at Facebook). Sometimes the data gets into Hadoop twice &#8212; once excerpted via HBase and once as part of a full log &#8212; and may even live in two different Hadoop clusters.</p>
<p>What&#8217;s served out from HBase in Omer&#8217;s examples is usually <a href="../../../../../2011/06/19/investigative-analytics-derived-data/">derived data</a>, such as a user profile, an ad selection, a text index, etc. That makes sense, not least because if you&#8217;re going to keep enhancing your data, schema-free programming &#8212; which HBase offers &#8212; looks ever more appealing. Omer further said that there are a growing number of cases in which HBase is being used to serve up reference data for batch MapReduce jobs, but he didn&#8217;t have specifics. A counterexample to the derived data emphasis would be, if I understood correctly, a case where HBase manages shopping carts.</p>
<p>I haven&#8217;t put much effort into unearthing open source or other third-party HBase-based projects, but two examples are Open  TSDB  (Time Series DataBase) and Lily CMS (Content Management Systems). <em>(Edit: But see the comment about Lily below.)</em></p>
<p>Omer is perhaps my top go-to guy on <a href="../../../../../2011/07/06/petabyte-hadoop-clusters/">database and cluster sizes</a>, so of course I asked him for HBase metrics as well. He responded (approximately) that Cloudera HBase customer installations average 20-30 nodes, but that half a dozen are in the 100-200 node range.</p>
<p>Finally, there&#8217;s the matter of latency. As a general rule, the HBase users Omer sees are using HBase with at least several minutes latency. (Again , that shopping cart case would seem to be a counterexample.) So, for example, the data recorded when you click on a page isn&#8217;t immediately applied toward tweaking your profile to determine which ad you&#8217;ll see next &#8212; but it might come into play after you spend a few minutes reading the page you&#8217;re on. Naturally, Omer knows of efforts to use HBase with lower latency yet, and I won&#8217;t be surprised if already-working examples of low-latency HBase show up in the comment thread to this post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/18/hbase-is-not-broken/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Soundbites: the Facebook/MySQL/NoSQL/VoltDB/Stonebraker flap, continued</title>
		<link>http://www.dbms2.com/2011/07/15/facebook-mysql-nosql-voltdb-stonebraker/</link>
		<comments>http://www.dbms2.com/2011/07/15/facebook-mysql-nosql-voltdb-stonebraker/#comments</comments>
		<pubDate>Fri, 15 Jul 2011 08:27:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Akiban]]></category>
		<category><![CDATA[Cache]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Clustrix]]></category>
		<category><![CDATA[Couchbase]]></category>
		<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[Michael Stonebraker]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[ScaleBase]]></category>
		<category><![CDATA[ScaleDB]]></category>
		<category><![CDATA[Schooner Information Technology]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Tokutek]]></category>
		<category><![CDATA[VoltDB and H-Store]]></category>
		<category><![CDATA[dbShards and CodeFutures]]></category>
		<category><![CDATA[memcached]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4977</guid>
		<description><![CDATA[As a follow-up to the latest Stonebraker kerfuffle, Derrick Harris asked me a bunch of smart followup questions. My responses and afterthoughts include: Facebook et al. are in effect Software as a Service (SaaS) vendors, not enterprise technology users. In particular: They have the technical chops to rewrite their code as  needed. Unlike packaged software [...]]]></description>
			<content:encoded><![CDATA[<p>As a follow-up to the latest <a href="http://www.dbms2.com/2011/07/14/an-odd-claim-attributed-to-mike-stonebraker/">Stonebraker kerfuffle</a>, Derrick Harris asked me a bunch of smart followup questions. My responses and afterthoughts include:</p>
<ul>
<li>Facebook et al. are in effect Software as a Service (SaaS) vendors, not enterprise technology users. In particular:
<ul>
<li>They have the technical chops to rewrite their code as  needed.</li>
<li>Unlike packaged software vendors, they&#8217;re not answerable to anybody for keeping legacy code alive after a rewrite. That makes migration a lot easier.</li>
<li>If they want to write different parts of their system on different technical underpinnings, nobody can stop them. For example &#8230;</li>
<li>&#8230;  <a href="http://www.dbms2.com/2008/07/21/project-cassandra-facebook-open-sourced-quasi-dbms/">Facebook innovated Cassandra</a>, and is now heavily committed to HBase.</li>
</ul>
</li>
<li>It makes little sense to talk of Facebook&#8217;s use of &#8220;MySQL.&#8221; Better to talk of Facebook&#8217;s use of &#8220;MySQL +  memcached  + non-transparent sharding.&#8221; That said:
<ul>
<li>It&#8217;s hard to see why somebody today would use MySQL +  memcached  + non-transparent sharding for a new project. At least one of <a href="http://www.dbms2.com/2011/02/08/couchbase-membase-couchone-couchdb/">Couchbase</a> or <a href="http://www.dbms2.com/2011/02/24/transparent-sharding/">transparently-sharded</a> MySQL is very likely a superior alternative. Other alternatives might be better yet.</li>
<li>As noted above in the example of Facebook, the many major web businesses that are using MySQL +  memcached  + non-transparent sharding for existing projects can be presumed able to migrate away from that stack as the need arises.</li>
</ul>
</li>
</ul>
<p>Continuing with that discussion of DBMS alternatives:</p>
<ul>
<li>If you just want to write to the memcached API anyway, why not go with Couchbase?</li>
<li>If you want to go relational, why not go with MySQL? There are many alternatives for scaling or accelerating MySQL &#8212; dbShards, Schooner, Akiban, Tokutek, ScaleBase, ScaleDB, Clustrix, and Xeround come to mind quickly, so there&#8217;s a great chance that one or more will fit your use case. (And if you don&#8217;t get the choice of MySQL flavor right the first time, porting to another one shouldn&#8217;t be all THAT awful.)</li>
<li>If you really, really want to go in-memory, and don&#8217;t mind writing Java stored procedures, and don&#8217;t need to do the kinds of joins it isn&#8217;t good at, but do need to do the kinds of joins it is, VoltDB could indeed be a good alternative.</li>
</ul>
<p>And while we&#8217;re at it &#8212; going <strong>schema-free</strong> often makes a whole lot of sense. I need to write much more about the point, but for now let&#8217;s just say that I look favorably on the Big Four schema-free/NoSQL options of MongoDB, Couchbase, HBase, and Cassandra.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/15/facebook-mysql-nosql-voltdb-stonebraker/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Petabyte-scale Hadoop clusters (dozens of them)</title>
		<link>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/</link>
		<comments>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/#comments</comments>
		<pubDate>Wed, 06 Jul 2011 05:15:21 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cloudera]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[Web analytics]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4886</guid>
		<description><![CDATA[I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop. Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more [...]]]></description>
			<content:encoded><![CDATA[<p>I recently learned that there are <a href="../../../../../2011/06/20/columnar-dbms-vendor-customer-metrics/">7 Vertica clusters with a petabyte</a> (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.</p>
<p>Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo&#8217;s latest stated figures are:</p>
<ul>
<li>42,000 Hadoop nodes &#8230;</li>
<li>&#8230; holding 180-200 petabytes of data.</li>
</ul>
<p><span id="more-4886"></span>That works out near the low end of the range I came up with for Yahoo&#8217;s newest gear, namely <a href="http://www.dbms2.com/2011/07/06/hadoop-hardware-and-compression/">36-90 TB/node</a>. Yahoo&#8217;s biggest clusters are little over 4,000 nodes (a limitation that&#8217;s getting worked on), and Yahoo has over 20 clusters in total.</p>
<p>Based on those numbers, it would seem that 10 or more of Yahoo&#8217;s Hadoop clusters are probably in the petabyte range. Facebook no doubt has a few petabyte-scale Hadoop clusters as well. So we&#8217;re probably over 3 dozen petabyte+ Hadoop clusters, just counting Yahoo, Facebook, and CDH users. There surely are others too, running Apache Hadoop without Cloudera&#8217;s help.</p>
<p>We also have some more information about the scale of Hadoop usage, and the markets it is being used in, because Omer Trajman of Cloudera kindly wrote the following &#8212; lightly edited as usual &#8212; for quotation:</p>
<blockquote><p>The number of Petabyte+ Hadoop clusters expanded dramatically over the past year, with our recent count reaching 22 in production (in addition to the well-known clusters at Yahoo! and Facebook). Just as our poll back at Hadoop World 2010 showed the average cluster size at just over 60 nodes, today it tops 200. While mean is not the same as median (most clusters are under 30 nodes), there are some beefy ones pulling up that average. Outside of the well-known large clusters at Yahoo and Facebook, we count today 16 organizations running PB+ clusters running CDH across a diverse number of industries including online advertising, retail, government, financial services, online publishing, web analytics and academic research. We expect to see many more in the coming years, as Hadoop gets easier to use and more accessible to a wide variety of enterprise organizations.</p></blockquote>
<p>Omer went on to add:</p>
<blockquote><p>The biggest number of PB clusters are in the advertising space. I often tell people that every ad you see on the internet touched at least one Hadoop cluster (or the Google equivalent).</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/06/petabyte-hadoop-clusters/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>The essence of an application</title>
		<link>http://www.dbms2.com/2011/06/01/the-essence-of-an-application/</link>
		<comments>http://www.dbms2.com/2011/06/01/the-essence-of-an-application/#comments</comments>
		<pubDate>Wed, 01 Jun 2011 14:35:44 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4590</guid>
		<description><![CDATA[Once upon a time, information technology was strictly about &#8212; well, information. And by &#8220;information&#8221; what was meant was &#8220;data&#8221;.* An application boiled down to a database design, plus a straightforward user interface, in whatever the best UI technology of the day happened to be. Things rarely worked quite as smoothly as the design-database/press-button/generate-UI propaganda [...]]]></description>
			<content:encoded><![CDATA[<p>Once upon a time, information technology was strictly about &#8212; well, information. And by &#8220;information&#8221; what was meant was &#8220;data&#8221;.* An application boiled down to a database design, plus a straightforward user interface, in whatever the best UI technology of the day happened to be. Things rarely worked quite as smoothly as the design-database/press-button/generate-UI propaganda would have one believe, but database design was clearly at the center of application invention.</p>
<p><em>*Not coincidentally, two of the oldest names for &#8220;IT&#8221; were</em> data processing <em>and</em> management information systems.</p>
<p>Eventually, there came to be <a href="http://www.monashreport.com/2006/04/06/microsoft-underscores-its-core-paradigm/">three views of the essence of IT</a>:</p>
<ul>
<li><strong>Data </strong>&#8211; i.e., the traditional view, still exemplified by IBM and Oracle.<strong></strong></li>
<li><strong>People empowerment</strong> &#8212; i.e., Microsoft-style emphasis on UI friendliness and efficiency.</li>
<li><strong>Operational workflow</strong> &#8212; i.e., SAP-style emphasis on actual business processes.</li>
</ul>
<p>Graphical user interfaces were a major enabling technology for that evolution. Equally important, relational databases made some difficult problems easy(ier), freeing application designers to pursue more advanced functionality.</p>
<p>Based on further technical evolution, specifically in analytic and consumer technologies, I think we should now take that list up to five. The new members I propose are:</p>
<ul>
<li><strong>Investigative analytics.</strong></li>
<li><strong>Emotional response.</strong></li>
</ul>
<p><em><span id="more-4590"></span>I may want to rename that last one someday, but &#8220;emotional response&#8221; will serve well enough for now.</em></p>
<p>At first blush, it might seem that <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a> could be regarded as straightforward database processing, perhaps based on some analytics-friendly data structure; but that&#8217;s not true today, and I can&#8217;t pinpoint a past era when I think it was true either. <strong>Defining and structuring the data is just a starting point, </strong>and is<strong> not tantamount to saying what will be done with it</strong>.* For example, exactly the same data mart might be used for:</p>
<ul>
<li>Conventional business intelligence &#8212;  reporting, query, charting and drill-down (&#8220;multi-dimensional&#8221; or otherwise).</li>
<li>Similar activities, but with a <a href="../../../../../2010/06/12/the-underlying-technology-of-qlikview/">QlikView</a> or <a href="../../../../../2011/04/18/endeca-topics/">Endeca</a> kind of spin.</li>
<li>Statistics and machine learning.</li>
<li>Graphical analysis.</li>
</ul>
<p><em></em>Perhaps the best example of an investigative-analytics-oriented company is Google, where nothing is acknowledged as true until it&#8217;s been statistically verified, and whose core service is the world&#8217;s most flexible research tool.</p>
<p><em>*True, in some cases the introduction of <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">derived/cooked  data</a> could lead the data structure to evolve in an application-specific way. But I don&#8217;t think that seriously undermines the basic point.<br />
</em></p>
<p>Meanwhile, as Ray Wang likes to point out, <a href="http://blog.softwareinsider.org/2010/10/04/mondays-musings-how-the-five-consumer-tech-macro-pillars-influence-enterprise-software-innovation/">a large fraction of current IT innovation is consumer-centric</a>. I&#8217;m not sure I&#8217;d grant all his business-to-business sub-points to that claim, but it surely is true that:</p>
<ul>
<li>Consumer-oriented technology is much more sophisticated than it used to be.</li>
<li>A large fraction of what&#8217;s done in information technology is consumer- or at least customer-facing.</li>
</ul>
<p>The key defining trait of consumer/customer (as opposed to employee) computer use is that it is optional; your customers don&#8217;t have to do business with you, or use the applications and interfaces you offer. Consequently, it&#8217;s your responsibility to make sure that what you ask them to do is convenient and engaging (or fun, or at least not-annoying). And so, much more than before, you need to be concerned about users&#8217; emotional reactions to computing systems. Personalization fits in with both the investigative analytics and emotional-engagement themes. <strong>Investigative analytics is why you can personalize people&#8217;s experiences </strong>(web surfing, shopping, gaming, etc.); <strong>emotional response is why you must.</strong> Companies I&#8217;d put in the &#8220;emotional&#8221; camp include Facebook, whose central concept is &#8220;friend&#8221;; Zynga, whose customer-retention strategy seemingly boils down to addiction; and &#8212; back in its heyday &#8212; AOL.</p>
<p><strong><em>Related link</em></strong></p>
<ul>
<li><a href="../../../../../2010/03/23/software-innovation-patent/">Three kinds of software innovation</a><strong></strong></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/01/the-essence-of-an-application/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Notes from the Fusion-io S-1 filing</title>
		<link>http://www.dbms2.com/2011/05/24/notes-from-the-fusion-io-s-1-filing/</link>
		<comments>http://www.dbms2.com/2011/05/24/notes-from-the-fusion-io-s-1-filing/#comments</comments>
		<pubDate>Tue, 24 May 2011 08:53:23 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Storage]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4549</guid>
		<description><![CDATA[Fusion-io has filed for an initial public offering. With public offerings go S-1 filings which, along with 10-Ks, are the kinds of SEC filing that typically contain a few nuggets of business information. Notes from Fusion-io&#8217;s S-1 include: Fusion-io is growing very, very fast, doubling or better in revenue every 6 months. Fusion-io&#8217;s marketing message [...]]]></description>
			<content:encoded><![CDATA[<p>Fusion-io has filed for an initial public offering. With public offerings go S-1 filings which, along with 10-Ks, are the kinds of SEC filing that typically contain a few nuggets of business information. Notes from <a href="http://sec.gov/Archives/edgar/data/1383729/000095012311023375/f58285sv1.htm">Fusion-io&#8217;s S-1</a> include:</p>
<p>Fusion-io is growing very, very fast, <strong>doubling or better in revenue every 6 months.</strong></p>
<p>Fusion-io&#8217;s marketing message revolves around &#8220;data centralization&#8221;. <strong>Fusion-io is competing against storage-area networks and storage arrays.</strong></p>
<p>Fusion-io&#8217;s list of application types includes</p>
<blockquote><p>&#8230; systems dedicated to decision     support, high performance financial analysis, web search,     content delivery and enterprise resource planning.</p></blockquote>
<p>Fusion-io says it has shipped <strong>over 20 petabytes of storage.<br />
</strong></p>
<p>Fusion-io has a shifting array of big customers, including OEMs:  <span id="more-4549"></span></p>
<blockquote><p>Historically, large purchases by a relatively limited number of     customers have accounted for a substantial majority of our     revenue, and the composition of the group of our largest     customers changes from period to period. Many of our customers     make concentrated purchases to complete or upgrade specific     large-scale data storage installations. These concentrated     purchases are short-term in nature and are typically made on a     purchase order basis rather than pursuant to long-term     contracts. During fiscal 2010 and the six months ended     December 31, 2010, sales to the 10 largest customers in     each period, including the applicable OEMs, accounted for     approximately 75% and 92% of revenue, respectively. Facebook,     Inc. is currently our largest customer and accounted for a     substantial portion of revenue during the six months ended     December 31, 2010. We expect revenue from sales to Facebook     and one other end-user to account for a substantial portion of     revenue for the three months ending March 31, 2011, but     that revenue from sales to Facebook and the other end-user will     decline significantly for the three months ending June 30,     2011 as they complete their planned deployments.</p></blockquote>
<p>But Fusion-io invests enough in sales and marketing, including direct sales, that I&#8217;m guessing they&#8217;re out there persuading end-users to ask for product from Dell, HP, and IBM.</p>
<p>Fusion-io&#8217;s inventory growth of $23.3 million for the second half of 2010 is close to revenue of $26.0 million. Accounts receivable is a much smaller figure. I&#8217;m not sure what all that signifies, but I do find it ironic that Fusion-io&#8217;s marketing statements draw an analogy to &#8220;just-in-time&#8221; manufacturing.</p>
<p>As for what I think about Fusion-io, it starts:</p>
<ul>
<li>Fusion-io&#8217;s ideas are smart.</li>
<li>My skepticism about <a href="http://www.dbms2.com/2011/05/23/databases-ram/">specialized storage hardware for database applications</a> applies in part but not in whole to Fusion-io.</li>
<li>Right now, Fusion-io has won the market. Even if you don&#8217;t need Fusion-io hardware to optimize your use of solid-state memory, you&#8217;re apt to go with/partner with Fusion-io anyway.</li>
</ul>
<p>I don&#8217;t have strong opinions as to how long the last point will remain true.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/05/24/notes-from-the-fusion-io-s-1-filing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The technology of privacy threats</title>
		<link>http://www.dbms2.com/2011/01/11/the-technology-of-privacy-threats/</link>
		<comments>http://www.dbms2.com/2011/01/11/the-technology-of-privacy-threats/#comments</comments>
		<pubDate>Tue, 11 Jan 2011 15:15:21 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3542</guid>
		<description><![CDATA[This post is the second of a series. The first one was an overview of privacy dangers, replete with specific examples of kinds of data that are stored for good reasons, but can also be repurposed for more questionable uses. More on this subject may be found in my August, 2010 post Big Data is [...]]]></description>
			<content:encoded><![CDATA[<p><em>This post is the second of a series. The first one was <a href="http://www.dbms2.com/2011/01/10/privacy-dangers-an-overview/">an overview of privacy dangers</a>, replete with specific examples of kinds of data that are stored for good reasons, but can also be repurposed for more questionable uses. More on this subject may be found in my August, 2010 post <a href="http://www.dbms2.com/2010/08/11/big-data-is-watching-you/">Big Data is Watching You!</a><br />
</em></p>
<p>There are two technology trends driving electronic privacy threats. Taken together, these trends raise scenarios such as the following:</p>
<ul>
<li>Your web surfing behavior      indicates you&#8217;re a sports car buff, and you further like to look at pictures      of scantily-clad young women. A number of your Facebook friends are single      women. As a result, you&#8217;re deemed a risk to have a mid-life crisis and      divorce your wife, thus increasing the interest rate you have to pay when      refinancing your house.</li>
<li>Your cell phone GPS      indicates that you drive everywhere, instead of walking. There is no      evidence of you pursuing fitness activities, but forum posting activity      suggests you&#8217;re highly interested in several TV series. Your credit card      bills show that your taste in restaurant food tends to the fatty. Your      online photos make you look fairly obese, and a couple have ashtrays in      them. As a result, you&#8217;re judged a high risk of heart attack, and your      medical insurance rates are jacked up accordingly.</li>
<li>You did actually have that      mid-life crisis and get divorced. At the child-custody hearing, your ex-spouse&#8217;s      lawyer quotes a study showing that football-loving upper income      Republicans are 27% more likely to beat their children than yoga-class-attending      moderate Democrats, and the probability goes up another 8% if they ever      bought a jersey featuring a defensive lineman. What&#8217;s more, several of the      more influential people in your network of friends also fit angry-male      patterns, taking the probability of abuse up another 13%. Because of the      sound statistics behind such analyses, the judge listens.</li>
</ul>
<p>Not all these stories are quite possible today, but they aren&#8217;t far off either.</p>
<p><span id="more-3542"></span>One of the supporting trends, pretty obvious, is that there <strong>is a lot more electronic information than there used to be.</strong> Indeed:</p>
<ul>
<li>Sufficient information exists to provided a <strong>very detailed picture of our activities.</strong></li>
<li>Much of it is recorded for <strong>very good and beneficial reasons.</strong> We wouldn&#8217;t want that  part to stop.</li>
<li>This information is <strong>inevitably available to government.</strong></li>
</ul>
<p>Here&#8217;s what I mean by the inevitability claim. Whether or not you think anti-terrorism concerns are overblown, as a practical matter your fellow voters* will allow a broad range of governmental information access. Besides, just the widely-available credit card and similar commercial data is enough to provide a fairly detailed picture of what you&#8217;re up to. In most countries, anti-pornography, anti-file-sharing, and/or general civilian law enforcement efforts serve to strengthen the point further.</p>
<p><em>*If you live in a country too unfree for voters to much matter, then it is surely also the case that governmental information has few practical limits.</em></p>
<p>Examples of information being tracked (more particulars were covered in <a href="http://www.dbms2.com/2011/01/10/privacy-dangers-an-overview/">the first post of this series</a>):</p>
<ul>
<li>Almost everything we buy is recorded, via credit card transactions, point-of-sale data, and/or website transaction records. This data is summarized in files covering 100s of millions of individuals, with 1000s of fields per person. Those files can be used for a broad variety of business or law enforcement purposes.</li>
<li>That data gives a great picture of what we eat, where we commute or travel, what we pay attention to, and so on.</li>
<li>All our other financial information also passes through computer systems, such as at banks.</li>
<li>Increasingly, our physical movements are tracked more directly, via cell phones (our own), police cameras, and the like.</li>
<li>Other than face-to-face conversations, almost all our communications are electronic. Even social media non-adopters rely heavily on telephones, email, and the like.</li>
<li>Increasingly, our reading and viewing entertainment choices are electronically recorded as well.</li>
</ul>
<p><strong>Most of that data is available to law enforcement departments. </strong>Much of it is available to<strong> commercial companies </strong>as well.<strong></strong></p>
<p>And these vast amounts of data will hardly go to waste. The second major technological trend in play is that <strong>the data can be much more effectively analyzed </strong>than before. New kinds of or effectiveness in <strong>analytic profiling create whole new levels of exposure</strong> (using the word &#8220;exposure&#8221; in its most literal sense), in at least three ways:<strong></strong></p>
<ul>
<li><strong>Relationship profiling.*</strong> <a href="../../../../../2009/08/21/social-network-analysis-aka-relationship-analytics/">Relationship analytics</a> technology has been around for a while. When it&#8217;s used to find bad guys (terrorists, fraudsters, etc.), that&#8217;s one thing. But some of the marketing uses are spookier. Marketing-like uses applied back to governmental surveillance could be spookier yet.<strong></strong></li>
<li><strong>Propensity profiling.* </strong>A huge fraction of what happens in big data analytics is figuring out what you&#8217;re likely to buy, vote for, look at, click on, react to, or think. Marketers getting that right can be a bit creepy. So can marketers getting it wrong. Governments doing the same thing could be much creepier yet.</li>
<li><strong>De-anonymization</strong>.* You may think you can be anonymous online, but you really can&#8217;t. Also, it&#8217;s getting ever harder to keep your roles or activities online separate from each other.</li>
</ul>
<p><em>* I just coined the terms &#8220;relationship profiling&#8221; and &#8220;propensity profiling.&#8221; &#8220;De-anonymization,&#8221; however, has been in use for a while.</em></p>
<p>Classical <strong>relationship profiling</strong> questions include assessing who has a close relationship with whom, who influences whom, who influences lots of people, etc. The most obvious data to infer this from is communication &#8212; who called whom, how long they talked, who they called next, what time of day this all happened, and so on. Anti-terrorist uses are obvious. A major marketing use is telcos &#8212; who of course have this data &#8212; deciding who to offer their best deals to, by trying to identify who influences the most other customers. These calculations of course involve comparing lots of data, mainly about people who are NOT targets of terrorist investigation or preferential telephone service pricing.</p>
<p>Much of Facebook&#8217;s $50 billion valuation hinges on the assumption it can do similar things based on the &#8220;social graph&#8221; it infers from informal communication among friends. To date that assumption has been <a href="http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/">questionable</a>, but we&#8217;re still in the very early days. Meanwhile, cruder methods of analyzing social influence are used. But the trend is clear &#8212; <strong>marketers want to use technology to identify social leaders, influence them however they can, and hope that the rest of us follow along baaing.</strong> Up to a point, that&#8217;s actually OK &#8212; learning things from our friends and acquaintances is an important and pleasant part of living in a society. And political campaigners have been doing it for generations, in the most low-tech of fashions. Still, it&#8217;s one thing for such targeting of leaders to be transparent; if done surreptitiously, it suddenly starts to feel a lot more sinister.</p>
<p>For years, <strong>propensity profiling</strong> has been an area of huge investment and technological progress. It&#8217;s the central application of <a href="http://www.dbms2.com/category/analytics-technologies/data-warehouse/">big data analytics</a>, and the heart of the business for many companies I write about, or that are my clients. Credit files, web logs, other marketing responses, census information, and other data are combined to infer:</p>
<ul>
<li>Your income, household      composition, age, race, education, and other basic demographics.</li>
<li>Your buying, voting,      reading, viewing, and other consuming interests in minute psychographic      detail.</li>
<li>Your feelings about      particular companies and brands, your propensity to become or stop being      their customer, and what kinds of advertisements or offers it would take      to influence you.</li>
<li>Your status as a credit      risk.</li>
<li>The chance you are      committing or will commit fraud.</li>
</ul>
<p>This has been going on since at least the 1990s, especially in service industries with &#8220;loyalty card&#8221; kinds of programs, such as retail or travel/leisure. In the credit case it&#8217;s been going on longer than that. But new data sources, processed by new analytic technologies, have brought the practice to a vastly greater height.</p>
<p>Finally &#8212; in case you care about being anonymous online, you&#8217;re running out of luck. <strong>De-anonymization </strong>analytics are getting too good. The <a href="https://www.eff.org/deeplinks/2009/09/what-information-personally-identifiable">Electronic Freedom Foundation&#8217;s de-anonymization overview</a> in 2009 was one of many articles pointing out that it often was possible to attach a specific name to online activities that in theory don&#8217;t track personally identifiable information. Meanwhile, at a talk I attended in May, 2010, <a href="../../../../../2010/04/18/washington-dc-may-2010-big-data-summi/">comScore</a> spoke of its successful efforts to tie various anonymous online activities, such as visits to different websites, to each other. And after I entered &#8220;usinger.com&#8221; into my browser address bar, I started seeing ads for Usinger sausages at a variety of prominent websites.</p>
<p>I&#8217;m not sure how much of a privacy threat de-anonymization technology is in and of itself, but it certainly provides support to both relationship and propensity profiling.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/01/11/the-technology-of-privacy-threats/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Privacy dangers &#8212; an overview</title>
		<link>http://www.dbms2.com/2011/01/10/privacy-dangers-an-overview/</link>
		<comments>http://www.dbms2.com/2011/01/10/privacy-dangers-an-overview/#comments</comments>
		<pubDate>Mon, 10 Jan 2011 16:14:53 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Facebook]]></category>
		<category><![CDATA[GIS and geospatial]]></category>
		<category><![CDATA[Health care]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3511</guid>
		<description><![CDATA[This post is the first of a series. The second one delves into the technology behind the most serious electronic privacy threats. The privacy discussion has gotten more active, and more complicated as well. A year ago, I still struggled to get people to pay attention to privacy concerns at all, at least in the [...]]]></description>
			<content:encoded><![CDATA[<p>
<em>This post is the first of a series. The second one delves into <a href="http://www.dbms2.com/2011/01/11/the-technology-of-privacy-threats/">the technology behind the most serious electronic privacy threats</a>.</em> </p>
<p>The privacy discussion has gotten more active, and more complicated as well. A year ago, I still struggled to get people to pay attention to privacy concerns at all, at least in the United States, with <a href="../../../../../2010/01/31/data-based-snooping-threat-libert/">my first public breakthrough</a> coming at the end of January. But much has changed since then.</p>
<p>On the <strong>commercial</strong> side, Facebook modified its privacy policies, garnering great press attention and an intense user backlash, leading to a quick <a href="http://www.nytimes.com/2010/05/27/technology/27facebook.html">partial retreat</a>. The <em>Wall Street Journal</em> then launched a long series of articles &#8212; 13 so far &#8212; recounting multiple kinds of <a href="http://online.wsj.com/public/page/what-they-know-digital-privacy.html">privacy threats</a>. Other media joined in, from <em><a href="http://blogs.forbes.com/kashmirhill/">Forbes</a> </em>to <em><a href="http://news.cnet.com/privacy-inc/">CNet</a>.</em> Various forms of US government rule-making to <a href="http://www.ecommerce-guide.com/solutions/advertising/article.php/3906466/Ad-Groups-to-Rally-Against-Federal-Privacy-Rules.htm">inhibit advertising-related tracking</a> have been proposed as an apparent result.</p>
<p>In the US, the <strong>government</strong> had a lively year as well. The Transportation Security Administration (TSA) rolled out what have been dubbed &#8220;porn scanners,&#8221; and backed them up with &#8220;<a href="http://twitter.com/#%21/julie_craig/status/4336151733207040">enhanced patdowns</a>.&#8221; For somebody who is, for example, female, young, a sex abuse survivor, and/or a follower of certain religions, those can be highly unpleasant, if not traumatic. Meanwhile, the Wikileaks/Cablegate events have spawned a government reaction whose scope is only beginning to be seen. A couple of &#8220;highlights&#8221; so far are some very nasty <a href="http://www.salon.com/news/opinion/glenn_greenwald/2010/11/09/manning/index.html">laptop seizures</a>, and the recent demand for <a href="http://twitter.com/#%21/wikileaks/status/23939621570215936">information on over 600,000 Twitter accounts</a>. (<a href="http://paranoia.dubfire.net/2011/01/thoughts-on-doj-wikileakstwitter-court.html">Christopher Soghoian</a> provided a detailed, nuanced legal analysis of same.)</p>
<p>At this point, it&#8217;s fair to say there are at least<strong> six different kinds of legitimate privacy fear. </strong><span id="more-3511"></span>The first five I have in mind are:</p>
<ul>
<li><strong>Governmental force. </strong>Your web browsing history can be used against you in a court of law. Profiling &#8212; perhaps based on big data analytics &#8212; might make you a target for law enforcement or anti-terrorism investigation as well. And between financial transaction records, communication records, physical movement data, and more, <strong>the government&#8217;s ability to gather and process information about people is effectively unlimited.</strong></li>
<li><strong>Private sector discrimination. </strong>There are many ways private companies could use detailed profiles &#8212; or just incriminating photos &#8212; to your detriment. They could fire you, not hire you, <a href="http://yro.slashdot.org/firehose.pl?op=view&amp;type=story&amp;sid=10/11/23/0344243">deny you insurance or credit</a>, or simply not give you their best possible price.<strong></strong></li>
<li><strong>Identity theft. </strong>Phishing happens.<strong></strong></li>
<li><strong>Social pressure and stalking.</strong> What your friends, neighbors, and classmates know about you can become a serious problem, especially if they gang up and cyber-bully you.  That your violent ex-boyfriend can track you could be a bigger problem yet.<strong></strong></li>
<li><strong>Embarrassment and creep-out.</strong> Some people REALLY don&#8217;t like being viewed naked in busy public places. Some people are bothered by weirdly (in)appropriate advertisements. Maybe there&#8217;s no obvious tangible harm other than their uncomfortable feelings &#8212; but their discomfort is serious even so. <strong></strong></li>
</ul>
<p>The sixth is:</p>
<ul>
<li>Throwing out the baby with the bathwater in a<strong> backlash. </strong>I think medical privacy rules already cost lives, in <a href="../../../../../2010/10/10/xldb4-xldb/">research</a> and <a href="../../../../../2010/09/13/reconciling-medical-privacy-and-elder-care/">treatment</a> alike. <em>(Forbes</em> offers multiple examples of <a href="http://www.forbes.com/forbes/2009/0608/034-privacy-research-hidden-cost-of-privacy_2.html">life-saving research being stopped by HIPAA</a>.) I think there&#8217;s an inevitable tradeoff between intrusive physical security measures, such as the TSA&#8217;s various forms of security theater, and spooky behind-the-scenes profiling and surveillance. (Unfortunately, we can hardly live in a society without some kinds of security measures &#8212; and even if we could, voters would never allow it.) Proposals that would hamstring the internet advertising industry currently seem unlikely to pass &#8212; but how much uproar would it take before that changed?</li>
</ul>
<p>You probably knew most of that already. Even so, here are a number of examples and links.</p>
<p><a href="http://search.slashdot.org/story/11/01/04/2346201/Unwise-mdash-Search-History-of-Murder-Methods">Slashdot</a> has more on the Jensen case, in which a man was <strong>convicted of murder</strong> in no small part because his computer revealed that he had searched for information about murder methods, including one that duplicated the actual method of his wife&#8217;s death. The money quote in an <a href="http://www.wicourts.gov/ca/opinion/DisplayDocument.html?content=html&amp;seqNo=58315">appeals court decision</a> is in Section 37.1, which characterizes &#8220;computer evidence&#8221; as &#8220;probably the most incriminating other evidence.&#8221;</p>
<p>Obviously, that particular trial had the correct outcome. But would you want a court to hear about the research you did about minimizing your taxes? What about when you considered changing jobs, something that might be of interest both in employment and child custody litigation? How about your viewing of sexy images other than those of your spouse? And by the way &#8212; just what kinds of online viewing or writing should get you on a terrorist watch list?</p>
<p>It&#8217;s getting ever more practical to <strong>track our actions and movements.</strong> The privacy implications are potentially grave &#8212; would you want THAT kind of information used in court, or for decisions about your insurance? Accordingly, the Electric Freedom Foundation coined the word &#8220;<a href="https://www.eff.org/deeplinks/2010/12/what-traitorware">traitorware</a>&#8221; to describe and call attention to consumer (mainly) devices that keep or even transmit records of your movements and doings. And mind-bogglingly, <em>Forbes</em> says that Sprint &#8220;<a href="http://www.forbes.com/forbes/2010/1206/technology-chris-soghoian-federal-trade-commission-agent-provocateur.html">turned customers&#8217; GPS information over to law enforcement 8 million times in a year</a>.&#8221;</p>
<p>But it&#8217;s not just your own devices. The <em>New York Times</em> recently wrote of uses for <a href="http://www.nytimes.com/2011/01/02/science/02see.html">smart video technology</a>. Enforcing good hand-washing procedure on doctors sounds great. But how about enforcing busy-ness on cubicle workers? Watching movie previewers&#8217; emotional reactions to specific scenes and characters seems kosher. But what if <a href="http://www.myce.com/news/going-to-the-movies-prepare-to-be-watched-while-you-watch-36138/">all movie-goers</a> are watched? Or what if this is done at supermarkets or <a href="http://news.discovery.com/tech/future-advertising-has-eye-on-you.html">vending machines</a>, perhaps even repeatedly over time? Could be creepy, huh? And by the way, while it could be a way to get you great discounts and service, it could also be a way to categorize you as unworthy of same.</p>
<p>Meanwhile, the <em>Washington Post</em> reminds us that <a href="http://www.washingtonpost.com/wp-dyn/content/article/2011/01/01/AR2011010102690_2.html">battlefield video surveillance technology</a> is getting ever better. Obviously, such technology could be used for domestic law enforcement as well. More immediately, the <em><a href="http://projects.washingtonpost.com/top-secret-america/articles/monitoring-america/">Washington Post</a></em> tells us of things like automatic recognition of license plates being used by police departments that repurpose anti-terrorism grants to general law enforcement. (Analogies to the 1980&#8242;s habit of tying every grant proposal to the Strategic Defense Initiative &#8220;Star Wars&#8221; project are probably not misplaced.)</p>
<p>But the benefits foregone if we don&#8217;t use this technology are also scary. The biggest current examples are probably the ones I cited above &#8212; medical research, anti-terrorism, and the whole online advertising industry. But even more examples are coming down the pike, of clever electronic aids with privacy implications. Most notably &#8212; as the population ages, and nursing homes continue to be miserable (and costly) places, revolutionizing <a href="../../../../../2010/04/20/big-brother-watching-our-parents/">elder</a> <a href="http://www.nytimes.com/2010/07/29/garden/29parents.html">care</a> becomes ever more desirable. Well, sensor technology will soon be able to watch over old folks, keeping them safe(r) in their independence &#8212; but only if it is highly detailed or intrusive.</p>
<p>Also:</p>
<ul>
<li><em>Slashdot/</em><em>New      York Times</em> had an article on a <a href="http://news.slashdot.org/story/11/01/07/192201/California-County-Bans-SmartMeter-Installations">Marin      County ordinance</a> forbidding smart electrical meters. Part of the      reason is a San Francisco area backlash against the privacy implications      of having your electricity usage recorded moment to moment.</li>
<li><em>Slashdot</em> also points us at an article about <a href="http://tech.slashdot.org/story/11/01/03/203227/Using-Technology-To-Enforce-Good-Behavior">technology      you use for self-control</a> in various ways. In principle, however, the      &#8220;self&#8221; part is an optional feature.</li>
<li><em>Computing      Now</em> offered an overview of <a href="http://www.computer.org/cms/Computer.org/ComputingNow/homepage/2010/1110/W_CO_OnBodySensing.pdf">on-body      sensing technology</a>. Mainly it talks about gaming and medical      applications, but it also suggests that people receptive to putting      tracking technology on themselves in crowds in return for certain      emergency-response benefits.</li>
</ul>
<p>Finally, privacy-threatening observations of our web use, postings, and other internet communications are much too numerous to list. But some high/lowlights include:</p>
<ul>
<li><a href="http://www.readwriteweb.com/archives/web_user_interest_data.php">AddThis now claims to track 1 billion unique individuals online</a>.</li>
<li>In its transparency reports, Google publishes a count <a href="http://www.google.com/transparencyreport/governmentrequests/">of how often various governments ask it for data</a>. For the first half of 2010, leaders included the US (4287), Brazil (2435), India (1420), the UK (1343), and France (1017). Also over 100 requests each were Argentina, Australia, Chile, Germany, Italy, Singapore, Spain, and Taiwan.</li>
<li>Verizon had <a href="http://www.nytimes.com/2011/01/10/technology/10privacy.html?_r=1">over 90,000 such requests</a> in 2007.</li>
<li>An internet service provider who successfully fought off an FBI information request nonetheless was kept under a <a href="http://www.wired.com/threatlevel/2010/08/nsl-gag-order-lifted/">gag order</a> for six years about the case.</li>
<li>The Obama Administration wants ever more <a href="http://www.washingtonpost.com/wp-dyn/content/article/2010/07/28/AR2010072806141.html?wpisrc=nl_headline">new legal powers</a> to get electronic information without court order or notice to the investigation&#8217;s subjects.</li>
<li>The UK government, or parts thereof, wants to <a href="http://yro.slashdot.org/firehose.pl?op=view&amp;type=story&amp;sid=10/10/20/1958209">track everything you do online or telephonically</a>, except &#8212; supposedly &#8212; the actual contents of your calls and messages. Similarly <a href="http://yro.slashdot.org/story/10/10/29/026250/Australias-Privacy-Boss-Slams-Govt-Data-Retention-Scheme?from=headlines">Australia</a>. There&#8217;s an <a href="http://www.crn.com.au/News/215135,lessons-learned-from-europes-data-retention-laws.aspx">EU directive</a> along those lines as well, although the EU also has some strong <a href="http://yro.slashdot.org/story/10/11/05/0411231/EU-Commission-Says-People-Have-a-Right-To-Be-Forgotten-Online?from=headlines">data retention safeguards</a>.</li>
<li>Various governments &#8212; India, the <a href="http://www.theregister.co.uk/2010/07/27/blackberry_uae/">United Arab Emirates</a>, and many more &#8212; have recently insisted that communication offerings such as BlackBerry provide them with back doors to snoop, on pain of being banned countrywide.</li>
<li>Informal methods of data acquisition are also used. For example, police departments use fake identities (with photos of hot girls, of course) to <a href="http://lacrossetribune.com/news/local/article_0ff40f7a-d4d1-11de-afb3-001cc4c002e0.html">friend young folks on Facebook to look for photos of underage drinking</a>.</li>
<li>A CNN survey article showed <a href="http://www.cnn.com/2010/TECH/web/12/13/end.of.privacy.intro/index.html">how extreme public data revelation can get</a>, using the example of Louis Gray.</li>
<li>For its own marketing reasons, DuckDuckGo offers a simple overview of <a href="http://donttrack.us/">how your web searches could/can be used against you</a>.</li>
<li>And as an example of how <a href="http://yro.slashdot.org/story/10/08/17/1229217/HP-CEOs-Browsing-History-Used-Against-Him">even petty internet uses can cause problems</a>, one of many things that got former HP CEO Mark Hurd into trouble was looking at racy online pictures of business associate Jodie Fisher.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/01/10/privacy-dangers-an-overview/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

