Discussion of how data warehousing and analytic technologies are applied to clickstream analysis and other web analytics challenges. Related subjects include:
In a short October, 2011 post about Datameer, I wrote:
Datameer is designed to let you do simple stuff on large amounts of data, where “large amounts of data” typically means data in Hadoop, and “simple stuff” includes basic versions of a spreadsheet, of BI, and of EtL (Extract/Transform/Load, without much in the way of T).
That’s all still mainly true, although with the recent Datameer 2.0:
- You can run Datameer and the underlying Hadoop on a desktop or workgroup group.
- There are some infographics pretty-picture-drawing capabilities, which will surely delight those who like vector-based HTML 5 pictures of coffee cups, saucers and macaroons.
- No doubt Datameer has been generally enhanced on multiple fronts.
In essence, Datameer has two positionings.
- One is “OK, you’ve got Hadoop — now wouldn’t you like to do something useful with it?” That can include both business intelligence and ETL.
- Beyond that, Datameer founder/CEO Stefan Groschupf’s core argument is that schema-on-read is really, really useful, even at the cost of absorbing a potentially large performance hit. In other words, he’s making a case for a form of non-relational BI.
|Categories: Business intelligence, Data models and architecture, Datameer, EAI, EII, ETL, ELT, ETLT, Hadoop, Log analysis, Market share and customer counts, Web analytics||8 Comments|
I talked with MemSQL shortly before today’s launch. MemSQL technology basics are:
- In-memory relational DBMS.
- Being released single-box only. Transparent sharding is under development for release in the fall. Basic replication is under development too.
- Subset of SQL-92.
- MySQL wire-compatible (SQL coverage issues excepted).
MemSQL’s performance claims include:
- Read performance 10% or so worse than memcached.
- Write performance 20% or so better than memcached.
- 1.2 million inserts/second on a 64-core, 1/2 TB of RAM machine.
- Similarly, 1/2 billion records loaded in under 20 minutes.
MemSQL company basics include: Read more
|Categories: Database compression, In-memory DBMS, Investment research and trading, Market share and customer counts, memcached, MemSQL, OLTP, Pricing, Web analytics||3 Comments|
I previously dropped a few hints about my clients at Metamarkets, mentioning that they:
- Have built vertical-market analytic platform technology.
- Use a lot of Hadoop.
- Throw good parties. (That’s where the background photo on my Twitter page comes from.)
But while they’re a joy to talk with, writing about Metamarkets has been frustrating, with many hours and pages of wasted of effort. Even so, I’m trying again, in a three-post series:
Much like Workday, Inc., Metamarkets is a SaaS (Software as a Service) company, with numerous tiers of servers and an affinity for doing things in RAM. That’s where most of the similarities end, however, as Metamarkets is a much smaller company than Workday, doing very different things.
Metamarkets’ business is SaaS (Software as a Service) business intelligence, on large data sets, with low latency in both senses (fresh data can be queried on, and the queries happen at RAM speed). As you might imagine, Metamarkets is used by digital marketers and other kinds of internet companies, whose data typically wants to be in the cloud anyway. Approximate metrics for Metamarkets (and it may well have exceeded these by now) include 10 customers, 100,000 queries/day, 80 billion 100-byte events/month (before summarization), 20 employees, 1 popular CEO, and a metric ton of venture capital.
To understand how Metamarkets’ technology works, it probably helps to start by realizing: Read more
Last November, I wrote two posts on agile predictive analytics. It’s time to return to the subject. I’m used to KXEN talking about the ability to do predictive modeling, very quickly, perhaps without professional statisticians; that the core of what KXEN does. But I was surprised when Revolution Analytics told me a similar story, based on a different approach, because ordinarily that’s not how R is used at all.
Ultimately, there seem to be three reasons why you’d want quick turnaround on your predictive modeling: Read more
|Categories: Business intelligence, Investment research and trading, KXEN, Predictive modeling and advanced analytics, Revolution Analytics, Telecommunications, Web analytics||10 Comments|
There are several reasons it’s hard to confirm great analytic user stories. First, there aren’t as many jaw-dropping use cases as one might think. For as I wrote about performance, new technology tends to make things better, but not radically so. After all, if its applications are …
… all that bloody important, then probably people have already been making do to get it done as best they can, even in an inferior way.
Further, some of the best stories are hard to confirm; even the famed beer/diapers story isn’t really true. Many application areas are hard to nail down due to confidentiality, especially but not only in such “adversarial” domains as anti-terrorism, anti-spam, or anti-fraud.
Even so, I have two questions in my inbox that boil down to “What are the coolest or most significant analytics stories out there?” So let’s round up some of what I know. Read more
|Categories: Analytic technologies, Google, Health care, Investment research and trading, Predictive modeling and advanced analytics, Scientific research, Telecommunications, Web analytics||6 Comments|
It is a reasonable (over)simplification to say that my business boils down to:
- Advising vendors what/how to sell.
- Advising users what/how to buy.
One complication that commonly creeps in is that different groups of users have different buying practices and technology needs. Usually, I nod to that point in passing, perhaps by listing different application areas for a company or product. But now let’s address it head on. Whether or not you care about the particulars, I hope the sheer length of this post reminds you that there are many different market segments out there.
Last June I wrote:
In almost any IT decision, there are a number of environmental constraints that need to be acknowledged. Organizations may have standard vendors, favored vendors, or simply vendors who give them particularly deep discounts. Legacy systems are in place, application and system alike, and may or may not be open to replacement. Enterprises may have on-premise or off-premise preferences; SaaS (Software as a Service) vendors probably have multitenancy concerns. Your organization can determine which aspects of your system you’d ideally like to see be tightly integrated with each other, and which you’d prefer to keep only loosely coupled. You may have biases for or against open-source software. You may be pro- or anti-appliance. Some applications have a substantial need for elastic scaling. And some kinds of issues cut across multiple areas, such as budget, timeframe, security, or trained personnel.
I’d further say that it matters whether the buyer:
- Is a large central IT organization.
- Is the well-staffed IT organization of a particular business department.
- Is a small, frazzled IT organization.
- Has strong engineering or technical skills, but less in the way of IT specialists.
- Is trying to skate by without much technical knowledge of any kind.
Now let’s map those considerations (and others) to some specific market segments. Read more
|Categories: Data mart outsourcing, Games and virtual worlds, IBM and DB2, Investment research and trading, Microsoft and SQL*Server, Open source, Software as a Service (SaaS), Telecommunications, Web analytics||9 Comments|
Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra.
My client DataStax is announcing DataStax Enterprise 2.0. The big point of the release is that there’s a bunch of stuff integrated together, including at least:
- Cassandra — the NoSQL DBMS, which DataStax sometimes calls “DataStax Server”. Edit: That’s not really a fair criticism of DataStax’s messaging.
- Hadoop MapReduce, which DataStax sometimes calls “Hadoop”. Edit: That is indeed fair.
- Sqoop — the general way to connect relational DBMS to Hadoop, which DataStax sometimes calls “RDBMS integration”.
- Solr — the search-centric Apache project, or big parts of it, which DataStax generally calls either “Solr” or “Solr compatibility”.
- log4j – an Apache project that has something or other to do with logging, or parts of it, which DataStax sometimes calls “log file integration”.
- DataStax OpsCenter — some management tools and so on around Cassandra and the rest of the product line.
DataStax stresses that all this runs on the same cluster, with the same administrative tools and so on. For example, on a single cluster:
- You can manage the interactive data for a web site.
- You can store the logs for that website.
- You can analyze all of the above in Hadoop.
|Categories: Cassandra, Clustering, DataStax, EAI, EII, ETL, ELT, ETLT, Games and virtual worlds, Hadoop, Log analysis, Market share and customer counts, NoSQL, Parallelization, Text, Web analytics||5 Comments|
There’s a growing consensus that consumers require limits on the predictive modeling that is done about them. That’s a theme of the Obama Administration’s recent work on consumer data privacy; it’s central to other countries’ data retention regulations; and it’s specifically borne out by the recent Target-pursues-pregnant-women example. Whatever happens legally, I believe this also calls for a technical response, namely:
Consumers should be shown key factual and psychographic aspects of how they are modeled, and be given the chance to insist that marketers disregard any or all of those aspects.
I further believe that the resulting technology should be extended so that
information holders can collaborate by exchanging estimates for such key factors, rather than exchanging the underlying data itself.
To some extent this happens today, for example with attribution/de-anonymization or with credit scores; but I think it should be taken to another level of granularity.
My name for all this is translucent modeling, rather than “transparent”, the idea being that key points must be visible, but the finer details can be safely obscured.
Examples of dialog I think marketers should have with consumers include: Read more
|Categories: Predictive modeling and advanced analytics, Surveillance and privacy, Web analytics||Leave a Comment|
The most straightforward approach to the applications business is:
- Take general-purpose technology and think through how to apply it to a specific application domain.
- Produce packaged application software accordingly.
However, this strategy is not as successful in analytics as in the transactional world, for two main reasons:
- Analytic applications of that kind are rarely complete.
- Incomplete applications rarely sell well.
I first realized all this about a decade ago, after Henry Morris coined the term analytic applications and business intelligence companies thought it was their future. In particular, when Dave Kellogg ran marketing for Business Objects, he rattled off an argument to the effect that Business Objects had generated more analytic app revenue over the lifetime of the company than Cognos had. I retorted, with only mild hyperbole, that the lifetime numbers he was citing amounted to “a bad week for SAP”. Somewhat hoist by his own petard, Dave quickly conceded that he agreed with my skepticism, and we changed the subject accordingly.
Reasons that analytic applications are commonly less complete than the transactional kind include: Read more
|Categories: Business intelligence, Business Objects, Data mart outsourcing, Investment research and trading, Log analysis, Metamarkets and Druid, Oracle, SAP AG, SAS Institute, Web analytics, WibiData||16 Comments|
I checked in with James Phillips for a Couchbase update, and I understand better what’s going on. In particular:
- Give or take minor tweaks, what I wrote in my August, 2010 Couchbase updates still applies.
- Couchbase now and for the foreseeable future has one product line, called Couchbase.
- Couchbase 2.0, the first version of Couchbase (the product) to use CouchDB for persistence, has slipped …
- … because more parts of CouchDB had to be rewritten for performance than Couchbase (the company) had hoped.
- Think mid-year or so for the release of Couchbase 2.0, hopefully sooner.
- In connection with the need to rewrite parts of CouchDB, Couchbase has:
- Gotten out of the single-server CouchDB business.
- Donated its proprietary single-sever CouchDB intellectual property to the Apache Foundation.
- The 150ish new customers in 2011 Couchbase brags about are real, subscription customers.
- Couchbase has 60ish people, headed to >100 over the next few months.
|Categories: Basho and Riak, Cassandra, Couchbase, CouchDB, DataStax, Market share and customer counts, MongoDB and 10gen, NoSQL, Open source, Parallelization, Web analytics, Zynga||6 Comments|