<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DBMS 2 : DataBase Management System Services &#187; Telecommunications</title>
	<atom:link href="http://www.dbms2.com/category/applications/telecommunications/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dbms2.com</link>
	<description>Choices in data management and analysis</description>
	<lastBuildDate>Thu, 09 Feb 2012 09:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.3</generator>
		<item>
		<title>Agile predictive analytics &#8211; the heart of the matter</title>
		<link>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/</link>
		<comments>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/#comments</comments>
		<pubDate>Mon, 28 Nov 2011 19:40:26 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5746</guid>
		<description><![CDATA[I&#8217;ve already suggested that several apparent issues in predictive analytic agility can be dismissed by straightforwardly applying best-of-breed technology, for example in analytic data management. At first blush, the same could be said about the actual analysis, which comprises: Data preparation, which is tedious unless you do a good job of automating it. Running the [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve already suggested that several apparent issues in <a href="http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-easy-parts/">predictive analytic agility</a> can be dismissed by straightforwardly applying best-of-breed technology, for example in analytic data management. At first blush, the same could be said about the actual analysis, which comprises:</p>
<ul>
<li>Data preparation, which is tedious unless you do a good job of automating it.</li>
<li>Running the actual algorithms.</li>
</ul>
<p>Numerous statistical software vendors (or open source projects) help you with the second part; some make strong claims in the first area as well (e.g., my clients at KXEN). Even so, large enterprises typically have statistical silos, commonly featuring expensive annual SAS licenses and seemingly slow-moving SAS programmers.</p>
<p>As I see it, the predictive analytics workflow goes something like this<span id="more-5746"></span>:</p>
<ul>
<li>Business-knowledgeable people develop a theory as to what kinds of information and segmentation could be valuable in making better business micro-decisions.</li>
<li>Statistics-knowledgeable people determine a structure for modeling that reflects this theory.</li>
<li>Statistics-knowledgeable people tweak the model over time, within a fixed general structure, as new data comes in.</li>
<li>(Optional) Somebody sees to acquiring whatever data is needed that the organization doesn&#8217;t already have (and won&#8217;t get in the ordinary course of ongoing business).</li>
</ul>
<p>The optional last part can be a purchase of third-party information (relatively fast and easy) or the development of a business process (and if necessary associated software) to capture the information (not always so easy). But even if that&#8217;s taken care of, or not present, we have at least two hand-offs where agility can be lost:</p>
<ul>
<li>Businesspeople may throw a request &#8220;over the wall&#8221; to the statisticians, who then work on it as their schedule permits.</li>
<li>Once created, a model may be so set in stone that even small changes are as hard as building a new model from scratch.</li>
</ul>
<p>The second problem can be solved by the statisticians themselves, without outside involvement. Model research and model refinement should be separate processes. You can recheck your clustering on one schedule, but recalibrate your regressions against each cluster more frequently. If that all sounds forbiddingly difficult, perhaps your model recalibration process needs another level of automation.</p>
<p>So I&#8217;ve finally gotten to the point of saying what may have been obvious from the start: <strong>The only excusable impediment to predictive analytic agility is the hand-off from the people who know the business to the people who know the math.</strong> So let&#8217;s examine ways that difficulty can be resolved.</p>
<p>At big internet companies, the usual answer is something like</p>
<blockquote><p>Hey, it&#8217;s just data. From web logs. And network event logs. The data scientists know how to handle that.</p></blockquote>
<p>In financial trading firms, the answer is more</p>
<blockquote><p>The traders and analysts work closely together. Very closely. In fact, when the traders rip out their phones and throw them across the room, the analysts need to duck to avoid getting clobbered.</p></blockquote>
<p>In credit card or telecom marketing or insurance actuarial organizations, the answer may be</p>
<blockquote><p>Don&#8217;t worry; the stats geeks have been at this for a long time; they really do understand our business.</p></blockquote>
<p>All three approaches work.</p>
<p>But what about conventional enterprises, where line-of-business people may not be as math-savvy as internet developers or financial traders, and where the math experts may not have the business issues down cold? My flippant answer is that businesspeople should know some math too.* My more serious answer is that <strong>the &#8220;business analyst&#8221; role should be expanded </strong>beyond BI and planning<strong> to include lightweight predictive analytics as well.</strong></p>
<p><em>*I wasn&#8217;t being entirely flippant, of course. Statistics is even being taught in high school these days. And when I got a PhD in game theory, 2/3 of my thesis committee was at the Harvard Business School.</em></p>
<p>For example, at retailers:</p>
<ul>
<li>Market basket analysis is pretty simplistic (it only looks at small subsets of a basket at a time).</li>
<li>Seasonality is tricky. (Weather and so on can skew it.)</li>
<li>Each store or region can be its own universe.</li>
<li>Some of the results of analytics are rather coarse-grained &#8212; e.g., merchandise adjacencies &#8212; so precision in statistical analysis may not matter much anyway.</li>
</ul>
<p>And so truly rigorous statistical analysis may be both unfeasible and unnecessary; a lot of business-informed seat-of-the-pants reasoning needs to be mixed in. Consequently, there&#8217;s a lot to be said for pushing at least some retail predictive analytics pretty close to the merchandising department(s).</p>
<p>Similar stories could be told in many other industries and pursuits, including but emphatically not limited to:</p>
<ul>
<li>Event marketing.</li>
<li>College admissions.</li>
<li>Political campaigning.</li>
<li>Field maintenance at utility companies.</li>
<li>Price-setting (across many industries).</li>
</ul>
<p>In each case, it&#8217;s easy to see how statistical and predictive analytic techniques could add real value to the business. But it&#8217;s hard to imagine how the enterprise could support the kind of large, experienced, business-knowledge analytic operation one might find in hedge fund investing or telecom churn analysis. And absent that, it&#8217;s tough to see why the only people doing predictive analytics for the organization should sit in some silo of statistical expertise.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/11/28/agile-predictive-analytics-the-heart-of-the-matter/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>MongoDB users and use cases</title>
		<link>http://www.dbms2.com/2011/07/27/mongodb-users-and-use-cases/</link>
		<comments>http://www.dbms2.com/2011/07/27/mongodb-users-and-use-cases/#comments</comments>
		<pubDate>Wed, 27 Jul 2011 18:14:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data models and architecture]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MongoDB and 10gen]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Solid-state memory]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Splunk]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5031</guid>
		<description><![CDATA[I spoke with Eliot Horowitz and Max Schierson of 10gen last month about MongoDB users and use cases. The biggest clusters they came up with weren&#8217;t much over 100 nodes, but clusters an order of magnitude bigger were under development. The 100 node one we talked the most about had 33 replica sets, each with [...]]]></description>
			<content:encoded><![CDATA[<p>I spoke with Eliot Horowitz and Max Schierson of 10gen last month about MongoDB users and use cases. The biggest clusters they came up with weren&#8217;t much over 100 nodes, but clusters an order of magnitude bigger were under development. The 100 node one we talked the most about had 33 replica sets, each with about 100 gigabytes of data, so that&#8217;s in the 3-4 terabyte range total. In general, the largest MongoDB databases are 20-30 TB; I&#8217;d guess those really do use the bulk of available disk space.   <span id="more-5031"></span></p>
<p>10gen recommends solid-state storage in many cases. In some cases solid-state lets you get away with fewer total nodes. 10gen also likes Flashcache (Facebook-developed technology to put a flash cache in front of hard disks). But the 100-node example mentioned above uses spinning disk.</p>
<p>Use cases 10gen is proud of include:</p>
<ul>
<li>Lots of user profile maintenance, including at online ad companies. This includes full user ad impression data. (I&#8217;ve argued for a while that <a href="../../../../../2010/09/17/jp-morgan-chase-oracle-database-outage/">user profile information belongs in something like a NoSQL database</a>.)</li>
<li>A big-name web company that wants to inspect every packet that enters their network, and replaced Splunk with MongoDB for performance reasons.</li>
<li>A big-name photo/video site whose metadata is all in MongoDB. (That&#8217;s the kind of thing that often makes for good <a href="../../../../../2011/05/30/another-category-of-derived-data/">MarkLogic</a> use cases.)</li>
</ul>
<p>But actually, the reason we had the call was to review cases where MongoDB&#8217;s <strong>schemaless</strong> nature was significant. Examples of those included:</p>
<ul>
<li>A couple of top examples were of the kind &#8220;A bunch of apps, similar but not the same.&#8221; For MTV, it&#8217;s a single content management system for a bunch of websites. For Disney Playdom, it&#8217;s different schemas for every game.</li>
<li>For a wireless telco, the issue was a product catalog in which devices and service plans called for very different schemas, and which the telco felt had thus become unmanageable in Oracle.</li>
<li>For Craigslist, the issue wasn&#8217;t programming so much as performance &#8212; <a href="http://blog.zawodny.com/2010/04/27/i-want-a-new-data-store/">ALTER TABLE operations took months in MySQL</a>, and that&#8217;s not a typo, although I&#8217;ll confess to not understanding why this was the case.</li>
</ul>
<p>The 10gen guys went on to claim that schemalessness is helpful for incremental development in general, the point being that you don&#8217;t have a database-modification step. To some extent, changes can even be rolled back more easily than if you actually changed your schemas.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/27/mongodb-users-and-use-cases/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>McObject and eXtremeDB</title>
		<link>http://www.dbms2.com/2011/07/22/mcobject-extremedb/</link>
		<comments>http://www.dbms2.com/2011/07/22/mcobject-extremedb/#comments</comments>
		<pubDate>Fri, 22 Jul 2011 12:32:16 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[In-memory DBMS]]></category>
		<category><![CDATA[McObject]]></category>
		<category><![CDATA[Memory-centric data management]]></category>
		<category><![CDATA[Object]]></category>
		<category><![CDATA[Objectivity and Infinite Graph]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[solidDB]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=5004</guid>
		<description><![CDATA[I talked with McObject yesterday. McObject has two product lines, both of which are something like in-memory DBMS &#8212; eXtremeDB, which is the main one, and Perst. McObject has been around since at least 2003, probably has no venture capital, and probably has a very low double-digit number of employees.* *I could be wrong in [...]]]></description>
			<content:encoded><![CDATA[<p>I talked with McObject yesterday. McObject has two product lines, both of which are something like in-memory DBMS &#8212; eXtremeDB, which is the main one, and <a href="../../../../../2008/06/08/perst/">Perst</a>. McObject has been around since at least 2003, probably has no venture capital, and probably has a very low double-digit number of employees.*</p>
<p><em>*I could be wrong in those guesses; as small companies go, McObject is unusually prone to secrecy games.</em></p>
<p>As best I understand:</p>
<ul>
<li>eXtremeDB is something like an in-memory <a href="../../../../../2011/05/21/object-oriented-database-management-systems-oodbms/">object-oriented DBMS</a>, designed to be embeddable.</li>
<li>However, much as with Objectivity and other old-school OODBMS, eXtremeDB winds up being more of a toolkit with which to build DBMS than a full DBMS.</li>
<li>eXtremeDB has a few indexing schemes. The main one is good old B-trees. One customer wanted Patricia tries, so they&#8217;re in there. (Perhaps not coincidentally, solidDB relies on Patricia tries.) At least one wanted R-trees, so they&#8217;re in there too.</li>
<li>eXtremeDB has long had the option of persistent logs.</li>
<li>eXtremeDB newly has a hybrid memory-centric option, in which you can have more data in the database than fits into RAM.</li>
<li>eXtremeDB newly has multi-master two-phase-commit clustering.</li>
</ul>
<p>My guess three years ago that <a href="../../../../../2008/05/13/mcobject-extremedb-a-soliddb-alternative/">eXtremeDB might emerge as an alternative to solidDB</a> seems to have been borne out. McObject CEO Steve Graves says that the core of McObject&#8217;s business is OEMs, in sectors such as telecom equipment and defense/aerospace. That&#8217;s exactly solidDB&#8217;s traditional market, except that <a href="../../../../../2007/12/21/ibm-acquires-soliddb/">solidDB got acquired by IBM and deemphasized it</a>.</p>
<p>I&#8217;ve said before that if I were starting a SaaS effort &#8212; and it wasn&#8217;t just focused on analytics &#8212; <a href="../../../../../2011/05/21/object-oriented-database-management-systems-oodbms/">I&#8217;d look at using a memory-centric OODBMS</a>. Perhaps eXtremeDB is worth looking at in such scenarios.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/22/mcobject-extremedb/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Eight kinds of analytic database (Part 2)</title>
		<link>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/</link>
		<comments>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/#comments</comments>
		<pubDate>Tue, 05 Jul 2011 08:18:18 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Archiving and information preservation]]></category>
		<category><![CDATA[Business intelligence]]></category>
		<category><![CDATA[Buying processes]]></category>
		<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Complex event processing (CEP)]]></category>
		<category><![CDATA[Data mart outsourcing]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Data warehouse appliances]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Database diversity]]></category>
		<category><![CDATA[EAI, EII, ETL, ELT, ETLT]]></category>
		<category><![CDATA[Greenplum]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[MOLAP]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Netezza]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Rainstor]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Scientific research]]></category>
		<category><![CDATA[SenSage]]></category>
		<category><![CDATA[Software as a Service (SaaS)]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4867</guid>
		<description><![CDATA[In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-1/">Part 1</a> of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I&#8217;ll cover four more kinds of analytic database &#8212; even newer, for the most part, with a use case/product short list match that is even less clear.  <span id="more-4867"></span></p>
<p><strong><em>Bit bucket</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Logs, other technical/external</li>
<li><em>Likely use styles:</em> Staging/ETL, investigative</li>
<li><em>Canonical example: </em>Log files in a Hadoop cluster<em> </em></li>
<li><em>Stresses:</em> TCO, scale-out, transform/big-query performance, ETL functionality</li>
</ul>
<p>With the explosion of <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a> has come the need for a place to put it all, sometimes called the <a href="../../../../../2011/06/04/dirty-data-stored-dirt-cheap/">big bit bucket</a>. This is like the investigative data mart for big databases, but more <a href="../../../../../2011/05/17/poly-structured-database/">poly-structured</a>. In some cases it is focused on data staging and transformation; but it can also be used for analysis in place.</p>
<p>The list of candidate technologies to run your bit bucket starts with Hadoop and Splunk.</p>
<p><strong><em>Archival data store</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included: </em>Operational, CDR (call detail record), security log</li>
<li><em>Likely use styles:</em> Archival, reporting (for compliance), possibly also investigative</li>
<li><em>Examples:</em> Any long-term detailed historical store</li>
<li><em>Stresses: </em>TCO, compression, scale-out, performance (if multi-use)<em> </em></li>
</ul>
<p><em> </em></p>
<p>Analytic DBMS vendors have been insulting each other with the claim &#8220;that&#8217;s just an archival data store,&#8221; dating back at least to the first time Greenplum was deployed on an underpowered Sun Thumper system. Perhaps only <a href="../../../../../2010/06/11/rainstor-update/">Rainstor</a> truly embraces the archival positioning, and I&#8217;ve become pretty dubious about their technical claims and their company alike.</p>
<p>Still, there&#8217;s a legitimate need for data stores &#8212; especially relational analytic DBMS that:</p>
<ul>
<li>Store data cheaply, with high rates of compression.</li>
<li>Have decent performance if you do want to query the data.</li>
<li>May have archiving/compliance-specific features as well.</li>
</ul>
<p>Along with Rainstor, SAND and SenSage have at least partially targeted that use case. In addition, appliance vendors such as Teradata and Netezza try to have an archive-oriented product version in their lineups.</p>
<p><strong><em>Outsourced data mart</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> All</li>
<li><em>Likely use styles:</em> Traditional BI, investigative analytics, staging/ETL</li>
<li><em>Examples:</em> Advertising tracking, SaaS CRM</li>
<li><em>Stresses:</em> Performance, TCO, reliability, concurrency</li>
</ul>
<p>Much of what happens in analytic database management can also be outsourced. Some applications that run via SaaS (Software as a Service) are analytic. I&#8217;ve had three different clients whose main business is picking marketing targets in various vertical segments; others who wanted to add analytics to what were historically OLTP applications; and others yet who just offered online business intelligence. Also, if your fundamental business is gathering data and reselling it to a variety of user organizations, that&#8217;s an analytic data management challenge. The possibilities expand from there.</p>
<p>Data outsourcers are in the IT business, and so their IT development is &#8212; hopefully! &#8212; more serious and less politically encumbered than at many conventional enterprises. Thus, legacy systems and master data management issues are commonly less prevalent, or at least more aggressively disposed of. The same, up to a point, goes for vendor politics.*  <a href="../../../../../2011/06/26/what-to-think-about-before-you-make-a-technology-decision/">Multitenancy</a> is commonly an issue, as is running in the cloud.<em> </em></p>
<p><em>*Even so, there&#8217;s often That Guy who doesn&#8217;t want to migrate away from Oracle, no matter what.<strong> </strong></em></p>
<p>Vertica gets the nod in a number of these cases; it&#8217;s cloud-friendly, and often the problem is naturally columnar. Other columnar products can be good choices too, with added brownie points for Infobright if the shop is MySQL-oriented anyway. Running Netezza or other appliances makes sense mainly if you&#8217;re pretty sure you want to keep operating your own data centers, but some data outsourcers are just fine with that assumption.</p>
<p><strong><em>Operational analytic(s) server</em></strong></p>
<ul>
<li><em>Kinds of data likely to be included:</em> Customer-centric, log, financial trade</li>
<li><em>Likely use styles:</em> Advanced operational analytics</li>
<li><em>Examples:</em>
<ul>
<li>Lower latency: Web or call-center personalization, anti-fraud</li>
<li>Higher latency: Customer profiling, Basel 3 risk analysis</li>
</ul>
</li>
<li><em>Stresses:</em> Performance, reliability, analytic functionality, perhaps concurrency</li>
</ul>
<p>Even with eight different choices, I need a &#8220;catch-all&#8221; category; this is it.</p>
<p>Suppose you want to do reasonably sophisticated analytics, then use the results in operations. This is the classical challenge in <a href="../../../../../2011/03/30/short-request-and-analytic-processing/">integrating short-request and analytic processing</a>. There are multiple ways to tackle it, embodying different trade-offs in cost, convenience, or analytic accuracy. If the platform on which you want to run your investigative analytics also has the reliability and concurrency appropriate for mission-critical operations, you&#8217;re set. Otherwise, you may want to pipe <a href="../../../../../2010/11/29/data-that-is-derived-augmented-enhanced-adjusted-or-cooked/">derived data</a> into a more &#8220;industrial-strength&#8221; DBMS, ideally the one that runs your operational apps anyway</p>
<p>Another option is to integrate a limited amount of analytics immediately into your short-request processing system. For example, as bad as they are at the kinds of queries that require joins, NoSQL systems are often fast at simple aggregations. As MapReduce/NoSQL integrations mature, that option may not require pumping the data anywhere else for deeper analytics; even if it does, at least you&#8217;re starting out with the data in a convenient bit bucket.</p>
<p>Streaming/CEP-centric architectures could come into play as well. And it goes on from there. The possibilities in this last category are just too varied to generalize about.</p>
<p><em>So did I get them all? Or are there yet other analytic data management use cases that I don&#8217;t fit into my eight categories?</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/07/05/eight-kinds-of-analytic-database-part-2/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Temporal data, time series, and imprecise predicates</title>
		<link>http://www.dbms2.com/2011/06/20/temporal-data-time-series-and-imprecise-predicates/</link>
		<comments>http://www.dbms2.com/2011/06/20/temporal-data-time-series-and-imprecise-predicates/#comments</comments>
		<pubDate>Mon, 20 Jun 2011 06:11:43 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Analytic technologies]]></category>
		<category><![CDATA[Data types]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Theory and architecture]]></category>
		<category><![CDATA[Vertica Systems]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4786</guid>
		<description><![CDATA[I&#8217;ve been confused about temporal data management for a while, because there are several different things going on. Date arithmetic. This of course has been around for a very long &#8212; er, for a very long time. Time-series-aware compression. This has been around for quite a while too. &#8220;Time travel&#8221;/snapshotting &#8212; preserving the state of [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been confused about temporal data management for a while, because there are several different things going on.</p>
<ul>
<li><strong>Date      arithmetic.</strong> This of course has been around for a very long &#8212; er, for a very      long time.</li>
<li><strong>Time-series-aware      compression.</strong> This has been around for quite a while too.</li>
<li><strong>&#8220;Time      travel&#8221;/snapshotting</strong> &#8212; preserving the state of the database at      previous points in time. This is a matter of exposing (and not throwing      away) the information you capture via MVCC (Multi-Version Concurrency      Control) and/or append-only updates (as opposed to update-in-place). Those      update strategies are increasingly popular for pretty much anything except      update-intensive OLTP (OnLine Transaction Processing) DBMS, so      time-travel/snapshotting is an achievable feature for most vendors.</li>
<li><strong>Bitemporal      data access.</strong> This occurs when a fact has both a transaction timestamp and a      separate validity duration. <a href="http://en.wikipedia.org/wiki/Temporal_database">A Wikipedia article</a> seems to cover the subject pretty well, and I touched on <a href="http://www.dbms2.com/2009/08/02/teradata-13-focuses-on-advanced-analytic-performance/">Teradata&#8217;s      bitemporal plans</a> back in 2009.</li>
<li><strong>Time      series SQL extensions.</strong> <a href="http://www.dbms2.com/2011/06/20/vertica-as-an-analytic-platform/">Vertica</a> explained its version of these to me a few days ago. I      imagine Sybase IQ and other serious financial-trading market players have      similar features.</li>
</ul>
<p>In essence, the point of time series/event series SQL functionality is to<strong> do SQL against incomplete, imprecise, or derived data.*</strong> <span id="more-4786"></span>For example, suppose in one time series events happen at times 3.00, 3.01, 3.03, and 3.05; in another time series events happen at times 3.00, 3.02, 3.03, 3.04, and 3.05; and you want to join the time series together. Then you can do an <strong>event series join</strong> &#8212; i.e., you can join on each of the times 3.00, 3.01, 3.02, 3.03, 3.04, and 3.05, using interpolated values to check WHERE conditions. Vertica says that the only interpolation methods anybody ever wants are &#8220;first value in the interval,&#8221; &#8220;last value in the interval,&#8221; and &#8220;linear average of the endpoint values&#8221; (I forget whether that&#8217;s weighted by time-distance from the endpoints, or is a simple arithmetic mean).</p>
<p><em>*This is a </em>limited <em>counterexample to my dictum that <a href="../../../../../2011/06/19/investigative-analytics-derived-data/">you should explicitly store derived data because it&#8217;s too much trouble to keep re-deriving it on the fly</a>.</em></p>
<p>Also cool is the &#8220;CONDITION_TRUE_EVENT&#8221; syntax Vertica has had since Release 4.0, which generalized SQL 99 windowing; you now can look at all the rows that meet a specific criterion &#8212; via an arbitrary expression &#8212; rather than just being restricted to a row count or strict time duration. Vertica says it&#8217;s gone further in the direction of event series pattern matching in Vertica 5.0; I didn&#8217;t grasp the details, but it sounded philosophically akin to <a href="../../../../../2009/02/10/aster-data-npath/">Aster Data&#8217;s nPath</a>, albeit without the arbitrary-language procedural extensibility.</p>
<p>Finally, Vertica also gave me an imprecise-SQL example that has little to do with time series or other even series. Vertica has a concept of &#8220;range join,&#8221; implemented so that telecom firms can save space by storing partial IP addresses. I&#8217;ve noted before that while we should retain all human-generated data, <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">it will never be practical to retain all machine-generated data</a> (because its volume will keep going up based on the same technological factors that keep storage cost per unit volume going down). This sounds like one interesting (if specialized) approach to storing machine-generated data in summarized form.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/20/temporal-data-time-series-and-imprecise-predicates/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Columnar DBMS vendor customer metrics</title>
		<link>http://www.dbms2.com/2011/06/20/columnar-dbms-vendor-customer-metrics/</link>
		<comments>http://www.dbms2.com/2011/06/20/columnar-dbms-vendor-customer-metrics/#comments</comments>
		<pubDate>Mon, 20 Jun 2011 05:41:54 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Columnar database management]]></category>
		<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Games and virtual worlds]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[ParAccel]]></category>
		<category><![CDATA[Petabyte-scale data management]]></category>
		<category><![CDATA[SAND Technology]]></category>
		<category><![CDATA[Sybase]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Vertica Systems]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4742</guid>
		<description><![CDATA[Last April, I asked some columnar DBMS vendors to share customer metrics. They answered, but it took until now to iron out a couple of details. Overall, the answers are pretty impressive.  Sybase said that Sybase IQ had &#62; 2000 direct customers and &#62;500 indirect customers (i.e., end customers of OEMs). That&#8217;s counting by customers; [...]]]></description>
			<content:encoded><![CDATA[<p>Last April, I asked some columnar DBMS vendors to share customer metrics. They answered, but it took until now to iron out a couple of details. Overall, the answers are pretty impressive.  <span id="more-4742"></span></p>
<p>Sybase said that <strong>Sybase IQ </strong>had<strong> &gt; 2000 direct customers </strong>and<strong> &gt;500 indirect customers</strong> (i.e., end customers of OEMs). That&#8217;s counting by customers; I know from prior discussions that Sybase IQ is running at close to two installations per customer. I also believe that Sybase counts different divisions of the same large enterprise as separate customers.</p>
<p><strong>Vertica</strong> cited a figure of <strong>500 customers</strong> as of April (end Q1?), which is close to <strong>600</strong> now, about <strong>40% or a little more direct.</strong> The difference between this and a <a href="http://www.dbms2.com/2011/02/14/now-we-know-why-vertica-has-been-so-weirdly-evasive/">2010 year-end figure of 328</a> is not only new sales, but also slow reporting by OEMs.  One cool figure &#8212; a single OEM reported 82 end sales in a single (quarterly?) report. And a number of those direct customers are substantial; Vertica&#8217;s <a href="http://www.vertica.com/customers/">customer logo</a> page features lots of telcos, lots of internet companies, and the national operation of Blue Cross/Blue Shield.</p>
<p><em>Pay no attention to small inconsistencies in the number of Vertica direct  customers (250 at year-end, no more than that now); Colin Mahony just  estimates these numbers for me from memory, and minor inaccuracies are quite excusable.</em></p>
<p>Even cooler &#8212; <strong>Vertica </strong>reports <strong>7 customers with a petabyte or more of user data each.</strong> About 5 of the 7 are obvious-suspect big-name firms; but unsurprisingly, those big names are NDA. I did secure permission to say that there are 2 telecom companies, a mobile gaming vendor, another internet company, and 3 financial services outfits of various kinds.</p>
<p><strong>SAND Technology </strong>reported <strong>&gt;600 total customers,</strong> including<strong> &gt;100 direct. </strong>Since SAND has been around since the 1990s, those aren&#8217;t great average annual figures, but they&#8217;re probably more than many people (including me) thought.</p>
<p><strong>Infobright</strong> reported around <strong>200 total paying customers, 130 direct.</strong> There are surely a lot more users of open source Infobright, but precise numbers are of course hard to come by.</p>
<p>If I asked <strong>ParAccel</strong> in the April go-round, I&#8217;ve misplaced their answer, but back in October the figure was &gt;30 customers, 2 of them over 100 terabytes. I&#8217;ve seen published figures of 40+ for ParAccel since.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/20/columnar-dbms-vendor-customer-metrics/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Infobright 4.0</title>
		<link>http://www.dbms2.com/2011/06/14/infobright-4-0/</link>
		<comments>http://www.dbms2.com/2011/06/14/infobright-4-0/#comments</comments>
		<pubDate>Tue, 14 Jun 2011 08:46:24 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Data warehousing]]></category>
		<category><![CDATA[Database compression]]></category>
		<category><![CDATA[Infobright]]></category>
		<category><![CDATA[Investment research and trading]]></category>
		<category><![CDATA[Log analysis]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4685</guid>
		<description><![CDATA[Infobright is announcing its 4.0 release, with imminent availability. In marketing and product alike, Infobright is betting the farm on machine-generated data. This hasn&#8217;t been Infobright&#8217;s strategy from the getgo, but it is these days, with pretty good focus and commitment. While some fraction of Infobright&#8217;s customer base is in the Sybase-IQ-like data mart market [...]]]></description>
			<content:encoded><![CDATA[<p>Infobright is announcing its 4.0 release, with imminent availability. In marketing and product alike, Infobright is betting the farm on <a href="../../../../../2010/12/30/examples-and-definition-of-machine-generated-data/">machine-generated data</a>. This hasn&#8217;t been Infobright&#8217;s strategy from the getgo, but it is these days, with pretty good <a href="http://www.strategicmessaging.com/extending-the-layered-messaging-model/2011/06/13/">focus and commitment</a>. While some fraction of Infobright&#8217;s customer base is in the Sybase-IQ-like data mart market &#8212; and indeed Infobright put out <a href="http://www.prnewswire.com/news-releases/bell-helicopter-selects-zend-and-infobright-to-improve-enterprise-reporting-application-for-better-business-intelligence-123458269.html">a customer-win press release</a> in that market a few days ago &#8212; Infobright&#8217;s current customer targets seem to be mainly:</p>
<ul>
<li>Web companies, many of which are already MySQL users.</li>
<li>Telecommunication and similar log data, especially in OEM relationships.</li>
<li>Trading/financial services, especially at mid-tier companies.</li>
</ul>
<p>Key aspects of Infobright 4.0 include:  <span id="more-4685"></span></p>
<ul>
<li>&#8220;Rough Query,&#8221; which lets you get approximate query results &gt;10X faster than you could get precise ones, which is a good thing for iterative <a href="../../../../../2011/03/03/investigative-analytics/">investigative analytics</a>.</li>
<li>The start of a plan &#8212; &#8220;DomainExpert&#8221; &#8212; to compress and otherwise optimize data in specific, commonly machine-generated patterns, such as URLs or CDRs (call detail records).</li>
<li>&#8220;Distributed Load Manager&#8221; &#8212; i.e., load nodes that are separate from (and more parallelized than) query nodes.</li>
<li>A Hadoop connector.</li>
<li>Lots of cleanup and <a href="../../../../../2009/08/21/bottleneck-whack-a-mole/">Bottleneck Whack-A-Mole</a>, although I haven&#8217;t paid close attention as to which parts of that are truly new, and which were already handled in recent <a href="../../../../../2010/06/27/infobright-release-3-4/">Infobright point releases</a>.</li>
</ul>
<p>Items on that list focused on the machine-generated data market include:</p>
<ul>
<li>DomainExpert &#8212; obviously.</li>
<li>The Hadoop connector &#8212; also obviously.</li>
<li>The Distributed Load Manager &#8212; why would you need such load speeds unless the data is flowing in from machines?</li>
</ul>
<p>To understand Infobright Rough Query, recall the essence of <a href="../../../../../2007/10/22/infobright-brighthouse-mysql/">Infobright&#8217;s architecture</a>:</p>
<blockquote><p>Infobright’s core technical idea is to chop columns of data into 64K chunks, called <em>data packs,</em> and then store concise information about what’s in the packs. The more basic information is stored in <em>data pack nodes,*</em> one per data pack. If you’re familiar with Netezza <a href="../../../../../2006/09/20/netezza-vs-conventional-data-warehousing-rdbms/">zone maps</a>, data pack nodes sound like zone maps on steroids. They store maximum values, minimum values, and (where meaningful) aggregates, and also encode information as to which intervals between the min and max values do or don’t contain actual data values.</p></blockquote>
<p>I.e., a concise, imprecise representation of the database is always kept in RAM, in something Infobright calls the &#8220;Knowledge Grid.&#8221; Rough Query estimates query results based solely on the information in the Knowledge Grid &#8212; i.e., <strong>Rough Query always executes against information that&#8217;s already in RAM.</strong></p>
<p>To me, Rough Query is the most impressive part of the Infobright 4.0 announcement. DomainExpert sounds like it will be somewhat better than straightforward prefix/suffix compression, but Infobright hasn&#8217;t yet convinced me that the difference is substantial. Distributed Load Manager is indeed important, but only because Infobright doesn&#8217;t have a shared-nothing MPP (Massively Parallel Processing) option at this time. And the rest is mainly catch-up toward Infobright&#8217;s larger and more expensive peers.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/06/14/infobright-4-0/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Application areas for SAS HPA</title>
		<link>http://www.dbms2.com/2011/04/21/application-areas-for-sas-hpa/</link>
		<comments>http://www.dbms2.com/2011/04/21/application-areas-for-sas-hpa/#comments</comments>
		<pubDate>Thu, 21 Apr 2011 08:24:17 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Application areas]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[SAS Institute]]></category>
		<category><![CDATA[Telecommunications]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=4342</guid>
		<description><![CDATA[When I talked with SAS about its forthcoming in-memory parallel SAS HPA offering, we talked briefly about application areas. The three SAS cited were: Consumer financial services. The idea here is to combine information about customers&#8217; use of all kinds of services &#8212; banking, credit cards, loans, etc. SAS believes this is both for marketing [...]]]></description>
			<content:encoded><![CDATA[<p>When I talked with SAS about <a href="http://www.dbms2.com/2011/04/21/sas-hpa-does-make-sense-after-all/">its forthcoming in-memory parallel SAS HPA offering</a>, we talked briefly about application areas. The three SAS cited were:</p>
<ul>
<li><strong>Consumer financial services.</strong> The idea here is to combine information about customers&#8217; use of all kinds of services &#8212; banking, credit cards, loans, etc. SAS believes this is both for marketing and risk analysis purposes.</li>
<li><strong>Insurance.</strong> We didn&#8217;t go into detail.</li>
<li><strong>Mobile communications.</strong> SAS&#8217; customers aren&#8217;t giving it details, but they&#8217;re excited about geocoding/geospatial data.</li>
</ul>
<p>Meanwhile, in another interview I heard about, SAS emphasized <strong>retailers.</strong> Indeed, that&#8217;s what spawned my recent post about <a href="http://www.dbms2.com/2011/04/06/so-can-logistic-regression-be-parallelized-or-not/">logistic regression</a>.</p>
<p>The mobile communications one is a bit scary. Your cell phone &#8212; and hence your cellular company &#8212; <a href="http://petewarden.github.com/iPhoneTracker/">know where you are</a>, pretty much from moment to moment. Even without advanced analytic technology applied to it, that&#8217;s a pretty direct privacy threat. Throw in some analytics, and your cell company might know, for example, who you hang out with (in person), where you shop, and how those things predict your future behavior. And so the government &#8212; or just your employer &#8212; might know those things too.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/04/21/application-areas-for-sas-hpa/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Cassandra company DataStax (formerly Riptano) is on track</title>
		<link>http://www.dbms2.com/2011/02/01/datastax-opscenter-cassandra/</link>
		<comments>http://www.dbms2.com/2011/02/01/datastax-opscenter-cassandra/#comments</comments>
		<pubDate>Tue, 01 Feb 2011 09:26:36 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[DataStax]]></category>
		<category><![CDATA[Market share and customer counts]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Specific users]]></category>
		<category><![CDATA[Telecommunications]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3706</guid>
		<description><![CDATA[Riptano, the Cassandra company, has changed its name to DataStax. DataStax has opened headquarters in Burlingame and hired some database-experienced folks – notably Ben Werther from Greenplum and Michael Weir from ParAccel, with Zenobia Godschalk (who worked with Aster Data) somewhere in the outside PR mix. Other than that, what&#8217;s new at DataStax is pretty [...]]]></description>
			<content:encoded><![CDATA[<p>Riptano, the Cassandra company, has changed its name to DataStax. DataStax has opened headquarters in Burlingame and hired some database-experienced folks – notably Ben Werther from Greenplum and Michael Weir from ParAccel, with Zenobia Godschalk (who worked with Aster Data) somewhere in the outside PR mix. Other than that, what&#8217;s new at DataStax is pretty much what could have been expected based on <a href="../../../../../2010/07/06/riptano-and-cassandra-adoption/">what DataStax folks said last spring</a>.</p>
<p>Most notably, DataStax is introducing a software offering, whose full name is DataStax OpsCenter for Apache Cassandra. DataStax OpsCenter for Apache Cassandra seems to be, in essence, a monitoring tool for Cassandra clusters, with a bit of capacity planning bundled in. (If there are any outright operations parts to DataStax OpsCenter, they got overlooked in our conversation.)* <span id="more-3706"></span>OpsCenter has been in beta at a few places, with another beta version rolled out recently.</p>
<p><em>*Yeah, DataStax OpsCenter Release 1 sounds pretty boring. But it&#8217;s apt to be useful even so. And cooler stuff should come down the pike later on.</em></p>
<p>There will of course be a free-download version of DataStax OpsCenter, entirely uncrippled; you&#8217;re just not allowed  to use free-download DataStax OpsCenter with production applications. Production users of DataStax OpsCenter will need subscriptions. Much like Cloudera, DataStax is bundling product and support subscriptions, so that you can&#8217;t buy one without the other. The current Gold/Silver/Bronze trichotomy will be slimmed down to Mission-Critical/Premier, and you&#8217;ll be allowed to have different levels for different application clusters.</p>
<p>Finally, a few customer notes:</p>
<ul>
<li>DataStax has &gt;50 subscription support customers.</li>
<li>One DataStax customer has 400 Cassandra nodes.</li>
<li>DataStax&#8217;s major industry segments are web (of course), government/intelligence, and telecom.</li>
<li>Separately – I&#8217;m not sure why this is separate – DataStax thinks the next market it will penetrate is real-time analytics. That means online counts or other aggregations, although presumably not at a Skytide level of sophistication.</li>
<li><a href="http://techblog.netflix.com/2011/01/nosql-at-netflix.html">Netflix has nice things to say about HBase and Cassandra alike</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/02/01/datastax-opscenter-cassandra/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The technology of privacy threats</title>
		<link>http://www.dbms2.com/2011/01/11/the-technology-of-privacy-threats/</link>
		<comments>http://www.dbms2.com/2011/01/11/the-technology-of-privacy-threats/#comments</comments>
		<pubDate>Tue, 11 Jan 2011 15:15:21 +0000</pubDate>
		<dc:creator>Curt Monash</dc:creator>
				<category><![CDATA[Facebook]]></category>
		<category><![CDATA[Liberty and privacy]]></category>
		<category><![CDATA[Predictive modeling and advanced analytics]]></category>
		<category><![CDATA[Telecommunications]]></category>
		<category><![CDATA[Web analytics]]></category>

		<guid isPermaLink="false">http://www.dbms2.com/?p=3542</guid>
		<description><![CDATA[This post is the second of a series. The first one was an overview of privacy dangers, replete with specific examples of kinds of data that are stored for good reasons, but can also be repurposed for more questionable uses. More on this subject may be found in my August, 2010 post Big Data is [...]]]></description>
			<content:encoded><![CDATA[<p><em>This post is the second of a series. The first one was <a href="http://www.dbms2.com/2011/01/10/privacy-dangers-an-overview/">an overview of privacy dangers</a>, replete with specific examples of kinds of data that are stored for good reasons, but can also be repurposed for more questionable uses. More on this subject may be found in my August, 2010 post <a href="http://www.dbms2.com/2010/08/11/big-data-is-watching-you/">Big Data is Watching You!</a><br />
</em></p>
<p>There are two technology trends driving electronic privacy threats. Taken together, these trends raise scenarios such as the following:</p>
<ul>
<li>Your web surfing behavior      indicates you&#8217;re a sports car buff, and you further like to look at pictures      of scantily-clad young women. A number of your Facebook friends are single      women. As a result, you&#8217;re deemed a risk to have a mid-life crisis and      divorce your wife, thus increasing the interest rate you have to pay when      refinancing your house.</li>
<li>Your cell phone GPS      indicates that you drive everywhere, instead of walking. There is no      evidence of you pursuing fitness activities, but forum posting activity      suggests you&#8217;re highly interested in several TV series. Your credit card      bills show that your taste in restaurant food tends to the fatty. Your      online photos make you look fairly obese, and a couple have ashtrays in      them. As a result, you&#8217;re judged a high risk of heart attack, and your      medical insurance rates are jacked up accordingly.</li>
<li>You did actually have that      mid-life crisis and get divorced. At the child-custody hearing, your ex-spouse&#8217;s      lawyer quotes a study showing that football-loving upper income      Republicans are 27% more likely to beat their children than yoga-class-attending      moderate Democrats, and the probability goes up another 8% if they ever      bought a jersey featuring a defensive lineman. What&#8217;s more, several of the      more influential people in your network of friends also fit angry-male      patterns, taking the probability of abuse up another 13%. Because of the      sound statistics behind such analyses, the judge listens.</li>
</ul>
<p>Not all these stories are quite possible today, but they aren&#8217;t far off either.</p>
<p><span id="more-3542"></span>One of the supporting trends, pretty obvious, is that there <strong>is a lot more electronic information than there used to be.</strong> Indeed:</p>
<ul>
<li>Sufficient information exists to provided a <strong>very detailed picture of our activities.</strong></li>
<li>Much of it is recorded for <strong>very good and beneficial reasons.</strong> We wouldn&#8217;t want that  part to stop.</li>
<li>This information is <strong>inevitably available to government.</strong></li>
</ul>
<p>Here&#8217;s what I mean by the inevitability claim. Whether or not you think anti-terrorism concerns are overblown, as a practical matter your fellow voters* will allow a broad range of governmental information access. Besides, just the widely-available credit card and similar commercial data is enough to provide a fairly detailed picture of what you&#8217;re up to. In most countries, anti-pornography, anti-file-sharing, and/or general civilian law enforcement efforts serve to strengthen the point further.</p>
<p><em>*If you live in a country too unfree for voters to much matter, then it is surely also the case that governmental information has few practical limits.</em></p>
<p>Examples of information being tracked (more particulars were covered in <a href="http://www.dbms2.com/2011/01/10/privacy-dangers-an-overview/">the first post of this series</a>):</p>
<ul>
<li>Almost everything we buy is recorded, via credit card transactions, point-of-sale data, and/or website transaction records. This data is summarized in files covering 100s of millions of individuals, with 1000s of fields per person. Those files can be used for a broad variety of business or law enforcement purposes.</li>
<li>That data gives a great picture of what we eat, where we commute or travel, what we pay attention to, and so on.</li>
<li>All our other financial information also passes through computer systems, such as at banks.</li>
<li>Increasingly, our physical movements are tracked more directly, via cell phones (our own), police cameras, and the like.</li>
<li>Other than face-to-face conversations, almost all our communications are electronic. Even social media non-adopters rely heavily on telephones, email, and the like.</li>
<li>Increasingly, our reading and viewing entertainment choices are electronically recorded as well.</li>
</ul>
<p><strong>Most of that data is available to law enforcement departments. </strong>Much of it is available to<strong> commercial companies </strong>as well.<strong></strong></p>
<p>And these vast amounts of data will hardly go to waste. The second major technological trend in play is that <strong>the data can be much more effectively analyzed </strong>than before. New kinds of or effectiveness in <strong>analytic profiling create whole new levels of exposure</strong> (using the word &#8220;exposure&#8221; in its most literal sense), in at least three ways:<strong></strong></p>
<ul>
<li><strong>Relationship profiling.*</strong> <a href="../../../../../2009/08/21/social-network-analysis-aka-relationship-analytics/">Relationship analytics</a> technology has been around for a while. When it&#8217;s used to find bad guys (terrorists, fraudsters, etc.), that&#8217;s one thing. But some of the marketing uses are spookier. Marketing-like uses applied back to governmental surveillance could be spookier yet.<strong></strong></li>
<li><strong>Propensity profiling.* </strong>A huge fraction of what happens in big data analytics is figuring out what you&#8217;re likely to buy, vote for, look at, click on, react to, or think. Marketers getting that right can be a bit creepy. So can marketers getting it wrong. Governments doing the same thing could be much creepier yet.</li>
<li><strong>De-anonymization</strong>.* You may think you can be anonymous online, but you really can&#8217;t. Also, it&#8217;s getting ever harder to keep your roles or activities online separate from each other.</li>
</ul>
<p><em>* I just coined the terms &#8220;relationship profiling&#8221; and &#8220;propensity profiling.&#8221; &#8220;De-anonymization,&#8221; however, has been in use for a while.</em></p>
<p>Classical <strong>relationship profiling</strong> questions include assessing who has a close relationship with whom, who influences whom, who influences lots of people, etc. The most obvious data to infer this from is communication &#8212; who called whom, how long they talked, who they called next, what time of day this all happened, and so on. Anti-terrorist uses are obvious. A major marketing use is telcos &#8212; who of course have this data &#8212; deciding who to offer their best deals to, by trying to identify who influences the most other customers. These calculations of course involve comparing lots of data, mainly about people who are NOT targets of terrorist investigation or preferential telephone service pricing.</p>
<p>Much of Facebook&#8217;s $50 billion valuation hinges on the assumption it can do similar things based on the &#8220;social graph&#8221; it infers from informal communication among friends. To date that assumption has been <a href="http://www.dbms2.com/2010/06/08/profile-of-revealed-preferences/">questionable</a>, but we&#8217;re still in the very early days. Meanwhile, cruder methods of analyzing social influence are used. But the trend is clear &#8212; <strong>marketers want to use technology to identify social leaders, influence them however they can, and hope that the rest of us follow along baaing.</strong> Up to a point, that&#8217;s actually OK &#8212; learning things from our friends and acquaintances is an important and pleasant part of living in a society. And political campaigners have been doing it for generations, in the most low-tech of fashions. Still, it&#8217;s one thing for such targeting of leaders to be transparent; if done surreptitiously, it suddenly starts to feel a lot more sinister.</p>
<p>For years, <strong>propensity profiling</strong> has been an area of huge investment and technological progress. It&#8217;s the central application of <a href="http://www.dbms2.com/category/analytics-technologies/data-warehouse/">big data analytics</a>, and the heart of the business for many companies I write about, or that are my clients. Credit files, web logs, other marketing responses, census information, and other data are combined to infer:</p>
<ul>
<li>Your income, household      composition, age, race, education, and other basic demographics.</li>
<li>Your buying, voting,      reading, viewing, and other consuming interests in minute psychographic      detail.</li>
<li>Your feelings about      particular companies and brands, your propensity to become or stop being      their customer, and what kinds of advertisements or offers it would take      to influence you.</li>
<li>Your status as a credit      risk.</li>
<li>The chance you are      committing or will commit fraud.</li>
</ul>
<p>This has been going on since at least the 1990s, especially in service industries with &#8220;loyalty card&#8221; kinds of programs, such as retail or travel/leisure. In the credit case it&#8217;s been going on longer than that. But new data sources, processed by new analytic technologies, have brought the practice to a vastly greater height.</p>
<p>Finally &#8212; in case you care about being anonymous online, you&#8217;re running out of luck. <strong>De-anonymization </strong>analytics are getting too good. The <a href="https://www.eff.org/deeplinks/2009/09/what-information-personally-identifiable">Electronic Freedom Foundation&#8217;s de-anonymization overview</a> in 2009 was one of many articles pointing out that it often was possible to attach a specific name to online activities that in theory don&#8217;t track personally identifiable information. Meanwhile, at a talk I attended in May, 2010, <a href="../../../../../2010/04/18/washington-dc-may-2010-big-data-summi/">comScore</a> spoke of its successful efforts to tie various anonymous online activities, such as visits to different websites, to each other. And after I entered &#8220;usinger.com&#8221; into my browser address bar, I started seeing ads for Usinger sausages at a variety of prominent websites.</p>
<p>I&#8217;m not sure how much of a privacy threat de-anonymization technology is in and of itself, but it certainly provides support to both relationship and propensity profiling.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dbms2.com/2011/01/11/the-technology-of-privacy-threats/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

