Discussion of how data warehousing and analytic technologies are applied to clickstream analysis and other web analytics challenges. Related subjects include:
Edit: Please see the comment thread below for updates. Please also see a follow-on post about how the surveillance data is actually used.
US government surveillance has exploded into public consciousness since last Thursday. With one major exception, the news has just confirmed what was already thought or known. So where do we stand?
My views about domestic data collection start:
- I’ve long believed that the Feds — specifically the NSA (National Security Agency) — are storing metadata/traffic data on every telephone call and email in the US. The recent news, for example Senator Feinstein’s responses to the Verizon disclosure, just confirms it. That the Feds sometimes claim this has to be “foreign” data or they won’t look at it hardly undermines my opinion.
- Even private enterprises can more or less straightforwardly buy information about every credit card purchase we make. So of course the Feds can get that as well, as the Wall Street Journal seems to have noticed. More generally, I’d assume the Feds have all the financial data they want, via the IRS if nothing else.
- Similarly, many kinds of social media postings are aggregated for anybody to purchase, or can be scraped by anybody who invests in the equipment and bandwidth. Attensity’s service is just one example.
- I’m guessing that web use data (http requests, search terms, etc.) is not yet routinely harvested by the US government.* Ditto deanonymization of same. I guess that way basically because I’ve heard few rumblings to the contrary. Further, the consumer psychographic profiles that are so valuable to online retailers might be of little help to national security analysts anyway.
- Video surveillance seems likely to grow, from fixed cameras perhaps to drones; note for example the various officials who called for more public cameras after that Boston Marathon bombing. But for the present discussion, that’s of lesser concern to me, simply because it’s done less secretively than other kinds of surveillance. If there’s a camera that can see us, often we can see it too.
*Recall that these comments are US-specific. Data retention legislation has been proposed or passed in multiple countries to require recording of, among other things, all URL requests, with the stated goal of fighting either digital piracy or child pornography.
As for foreign data: Read more
|Categories: Hadoop, HP and Neoview, Petabyte-scale data management, Pricing, Surveillance and privacy, Telecommunications, Text, Vertica Systems, Web analytics||10 Comments|
I talk with a lot of companies, and repeatedly hear some of the same application themes. This post is my attempt to collect some of those ideas in one place.
1. So far, the buzzword of the year is “real-time analytics”, generally with “operational” or “big data” included as well. I hear variants of that positioning from NewSQL vendors (e.g. MemSQL), NoSQL vendors (e.g. AeroSpike), BI stack vendors (e.g. Platfora), application-stack vendors (e.g. WibiData), log analysis vendors (led by Splunk), data management vendors (e.g. Cloudera), and of course the CEP industry.
Yeah, yeah, I know — not all the named companies are in exactly the right market category. But that’s hard to avoid.
Why this gold rush? On the demand side, there’s a real or imagined need for speed. On the supply side, I’d say:
- There are vast numbers of companies offering data-management-related technology. They need ways to differentiate.
- Doing analytics at short-request speeds is an obvious data-management-related challenge, and not yet comprehensively addressed.
2. More generally, most of the applications I hear about are analytic, or have a strong analytic aspect. The three biggest areas — and these overlap — are:
- Customer interaction
- Network and sensor monitoring
- Game and mobile application back-ends
Also arising fairly frequently are:
- Algorithmic trading
- Risk measurement
- Law enforcement/national security
- Stakeholder-facing analytics
I’m hearing less about quality, defect tracking, and equipment maintenance than I used to, but those application areas have anyway been ebbing and flowing for decades.
In typical debates, the extremists on both sides are wrong. “SQL vs. NoSQL” is an example of that rule. For many traditional categories of database or application, it is reasonable to say:
- Relational databases are usually still a good default assumption …
- … but increasingly often, the default should be overridden with a more useful alternative.
Reasons to abandon SQL in any given area usually start:
- Creating a traditional relational schema is possible …
- … but it’s tedious or difficult …
- … especially since schema design is supposed to be done before you start coding.
Some would further say that NoSQL is cheaper, scales better, is cooler or whatever, but given the range of NewSQL alternatives, those claims are often overstated.
Sectors where these reasons kick in include but are not limited to: Read more
|Categories: Health care, Investment research and trading, Log analysis, NewSQL, NoSQL, Web analytics||8 Comments|
I recently opined that, especially for cutting-edge internet businesses, analytic applications were not a realistic option; rather, analytic application subsystems are the most you can currently expect. Erin Griffith further observed that the problem isn’t just confined to analytics:
“We didn’t need 90 percent of the stuff they were offering, and when we told them what we did need — integration with social, curation tools, individual boutiques and analytics — they had nothing”
… a suitable solution to merge his editorial staff’s output with his separate site for selling tickets to events and goods … was not available, so had to build his own hybrid publishing and commerce platform. Likewise, Birchbox had to build a custom backend so that it could include videos and editorial content alongside its e-commerce site.
… it’s DIY or die.
With that as background, let’s consider why building business-to-consumer internet software is so complicated.
I’d suggest that a consumer website starts with four major conceptual parts: Read more
With Strata/Hadoop World being next week, there is much Hadoop discussion. One theme of the season is BI over Hadoop. I have at least 5 clients claiming they’re uniquely positioned to support that (most of whom partner with a 6th client, Tableau); the first 2 whose offerings I’ve actually written about are Teradata Aster and Hadapt. More generally, I’m hearing “Using Hadoop is hard; we’re here to make it easier for you.”
If enterprises aren’t yet happily running business intelligence against Hadoop, what are they doing with it instead? I took the opportunity to ask Cloudera, whose answers didn’t contradict anything I’m hearing elsewhere. As Cloudera tells it (approximately — this part of the conversation* was rushed): Read more
|Categories: Business intelligence, Cloudera, EAI, EII, ETL, ELT, ETLT, Hadoop, HBase, Health care, Investment research and trading, MapR, Market share and customer counts, Telecommunications, Web analytics||4 Comments|
In a short October, 2011 post about Datameer, I wrote:
Datameer is designed to let you do simple stuff on large amounts of data, where “large amounts of data” typically means data in Hadoop, and “simple stuff” includes basic versions of a spreadsheet, of BI, and of EtL (Extract/Transform/Load, without much in the way of T).
That’s all still mainly true, although with the recent Datameer 2.0:
- You can run Datameer and the underlying Hadoop on a desktop or workgroup group.
- There are some infographics pretty-picture-drawing capabilities, which will surely delight those who like vector-based HTML 5 pictures of coffee cups, saucers and macaroons.
- No doubt Datameer has been generally enhanced on multiple fronts.
In essence, Datameer has two positionings.
- One is “OK, you’ve got Hadoop — now wouldn’t you like to do something useful with it?” That can include both business intelligence and ETL.
- Beyond that, Datameer founder/CEO Stefan Groschupf’s core argument is that schema-on-read is really, really useful, even at the cost of absorbing a potentially large performance hit. In other words, he’s making a case for a form of non-relational BI.
|Categories: Business intelligence, Data models and architecture, Datameer, EAI, EII, ETL, ELT, ETLT, Hadoop, Log analysis, Market share and customer counts, Web analytics||8 Comments|
I talked with MemSQL shortly before today’s launch. MemSQL technology basics are:
- In-memory relational DBMS.
- Being released single-box only. Transparent sharding is under development for release in the fall. Basic replication is under development too.
- Subset of SQL-92.
- MySQL wire-compatible (SQL coverage issues excepted).
MemSQL’s performance claims include:
- Read performance 10% or so worse than memcached.
- Write performance 20% or so better than memcached.
- 1.2 million inserts/second on a 64-core, 1/2 TB of RAM machine.
- Similarly, 1/2 billion records loaded in under 20 minutes.
MemSQL company basics include: Read more
|Categories: Database compression, In-memory DBMS, Investment research and trading, Market share and customer counts, memcached, MemSQL, OLTP, Pricing, Web analytics||3 Comments|
I previously dropped a few hints about my clients at Metamarkets, mentioning that they:
- Have built vertical-market analytic platform technology.
- Use a lot of Hadoop.
- Throw good parties. (That’s where the background photo on my Twitter page comes from.)
But while they’re a joy to talk with, writing about Metamarkets has been frustrating, with many hours and pages of wasted of effort. Even so, I’m trying again, in a three-post series:
Much like Workday, Inc., Metamarkets is a SaaS (Software as a Service) company, with numerous tiers of servers and an affinity for doing things in RAM. That’s where most of the similarities end, however, as Metamarkets is a much smaller company than Workday, doing very different things.
Metamarkets’ business is SaaS (Software as a Service) business intelligence, on large data sets, with low latency in both senses (fresh data can be queried on, and the queries happen at RAM speed). As you might imagine, Metamarkets is used by digital marketers and other kinds of internet companies, whose data typically wants to be in the cloud anyway. Approximate metrics for Metamarkets (and it may well have exceeded these by now) include 10 customers, 100,000 queries/day, 80 billion 100-byte events/month (before summarization), 20 employees, 1 popular CEO, and a metric ton of venture capital.
To understand how Metamarkets’ technology works, it probably helps to start by realizing: Read more
Last November, I wrote two posts on agile predictive analytics. It’s time to return to the subject. I’m used to KXEN talking about the ability to do predictive modeling, very quickly, perhaps without professional statisticians; that the core of what KXEN does. But I was surprised when Revolution Analytics told me a similar story, based on a different approach, because ordinarily that’s not how R is used at all.
Ultimately, there seem to be three reasons why you’d want quick turnaround on your predictive modeling: Read more
|Categories: Business intelligence, Investment research and trading, KXEN, Predictive modeling and advanced analytics, Revolution Analytics, Telecommunications, Web analytics||10 Comments|
There are several reasons it’s hard to confirm great analytic user stories. First, there aren’t as many jaw-dropping use cases as one might think. For as I wrote about performance, new technology tends to make things better, but not radically so. After all, if its applications are …
… all that bloody important, then probably people have already been making do to get it done as best they can, even in an inferior way.
Further, some of the best stories are hard to confirm; even the famed beer/diapers story isn’t really true. Many application areas are hard to nail down due to confidentiality, especially but not only in such “adversarial” domains as anti-terrorism, anti-spam, or anti-fraud.
Even so, I have two questions in my inbox that boil down to “What are the coolest or most significant analytics stories out there?” So let’s round up some of what I know. Read more
|Categories: Analytic technologies, Google, Health care, Investment research and trading, Predictive modeling and advanced analytics, Scientific research, Telecommunications, Web analytics||6 Comments|