Discussion of how data warehousing and analytic technologies are applied to clickstream analysis and other web analytics challenges. Related subjects include:
Databricks CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world includes:
- What I called Databricks’ “secondary business” of “licensing stuff to Spark distributors” was really about second/third tier support. Fair enough. But distributors of stacks including Spark, for whatever combination of on-premise and cloud as the case may be, may in many cases be viewed as competitors to Databricks cloud-only service. So why should Databricks help them?
- Databricks’ investment in Spark Summit and similar evangelism is larger than I realized.
- Ali suggests that the fraction of Databricks’ engineering devoted to open source Spark is greater than I understood during my recent visit.
Ali also walked me through customer use cases and adoption in wonderful detail. In general:
- A large majority of Databricks customers have machine learning use cases.
- Predicting and preventing user/customer churn is a huge issue across multiple market sectors.
The story on those sectors, per Ali, is: Read more
Five years ago, in a taxonomy of analytic business benefits, I wrote:
A large fraction of all analytic efforts ultimately serve one or more of three purposes:
- Problem and anomaly detection and diagnosis
- Planning and optimization
That continues to be true today. Now let’s add a bit of spin.
1. A large fraction of analytics is adversarial. In particular: Read more
|Categories: Business intelligence, Investment research and trading, Log analysis, Predictive modeling and advanced analytics, RDF and graphs, Surveillance and privacy, Web analytics||3 Comments|
Numerous tussles fit the template:
- A government wants access to data contained in one or more devices (mobile/personal or server as the case may be).
- The computer’s manufacturer or operator doesn’t want to provide it, for reasons including:
- That’s what customers prefer.
- That’s what other governments require.
- Being pro-liberty is the right and moral choice. (Yes, right and wrong do sometimes actually come into play. )
As a general rule, what’s best for any kind of company is — pricing and so on aside — whatever is best or most pleasing for their customers or users. This would suggest that it is in tech companies’ best interest to favor privacy, but there are two important quasi-exceptions: Read more
|Categories: Amazon and its cloud, Google, Microsoft and SQL*Server, Surveillance and privacy, Web analytics||2 Comments|
I last wrote about Couchbase in November, 2012, around the time of Couchbase 2.0. One of the many new features I mentioned then was secondary indexing. Ravi Mayuram just checked in to tell me about Couchbase 4.0. One of the important new features he mentioned was what I think he said was Couchbase’s “first version” of secondary indexing. Obviously, I’m confused.
Now that you’re duly warned, let me remind you of aspects of Couchbase timeline.
- 2 corporate name changes ago, Couchbase was organized to commercialize memcached. memcached, of course, was internet companies’ default way to scale out short-request processing before the rise of NoSQL, typically backed by manually sharded MySQL.
- Couchbase’s original value proposition, under the name Membase, was to provide persistence and of course support for memcached. This later grew into a caching-oriented pitch even to customers who weren’t already memcached users.
- A merger with the makers of CouchDB ensued, with the intention of replacing Membase’s SQLite back end with CouchDB at the same time as JSON support was introduced. This went badly.
- By now, however, Couchbase sells for more than distributed cache use cases. Ravi rattled off a variety of big-name customer examples for system-of-record kinds of use cases, especially in session logging (duh) and also in travel reservations.
- Couchbase 4.0 has been in beta for a few months.
Technical notes on Couchbase 4.0 — and related riffs — start: Read more
|Categories: Cache, Clustering, Couchbase, Data models and architecture, Databricks, Spark and BDAS, Exadata, Hadoop, MarkLogic, MongoDB, MySQL, NoSQL, Open source, Schema on need, Structured documents, Web analytics||1 Comment|
- My client Rocana is the renamed ScalingData, where Rocana is meant to signify ROot Cause ANAlysis.
- Rocana was founded by Omer Trajman, who I’ve referenced numerous times in the past, and who I gather is a former boss of …
- … cofounder Eric Sammer.
- Rocana recently told me it had 35 people.
- Rocana has a very small number of quite large customers.
Rocana portrays itself as offering next-generation IT operations monitoring software. As you might expect, this has two main use cases:
- Actual operations — figuring out exactly what isn’t working, ASAP.
Rocana’s differentiation claims boil down to fast and accurate anomaly detection on large amounts of log data, including but not limited to:
- The sort of network data you’d generally think of — “everything” except packet-inspection stuff.
- Firewall output.
- Database server logs.
- Point-of-sale data (at a retailer).
- “Application data”, whatever that means. (Edit: See Tom Yates’ clarifying comment below.)
|Categories: Business intelligence, Hadoop, Kafka and Confluent, Log analysis, Market share and customer counts, Petabyte-scale data management, Predictive modeling and advanced analytics, Pricing, Rocana, Splunk, Web analytics||1 Comment|
A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.
To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — — mine.)
Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:
- Inconsistent, in which case humans might not know how to look it up and database JOINs might fail.
- Unintegrated, in which case one application might not be able to use data that another happily maintains. (This is the classic data silo problem.)
Inconsistency can take multiple forms, including: Read more
1. There are multiple ways in which analytics is inherently modular. For example:
- Business intelligence tools can reasonably be viewed as application development tools. But the “applications” may be developed one report at a time.
- The point of a predictive modeling exercise may be to develop a single scoring function that is then integrated into a pre-existing operational application.
- Conversely, a recommendation-driven website may be developed a few pages — and hence also a few recommendations — at a time.
Also, analytics is inherently iterative.
- Everything I just called “modular” can reasonably be called “iterative” as well.
- So can any work process of the nature “OK, we got an insight. Let’s pursue it and get more accuracy.”
If I’m right that analytics is or at least should be modular and iterative, it’s easy to see why people hate multi-year data warehouse creation projects. Perhaps it’s also easy to see why I like the idea of schema-on-need.
2. In 2011, I wrote, in the context of agile predictive analytics, that
… the “business analyst” role should be expanded beyond BI and planning to include lightweight predictive analytics as well.
I gather that a similar point is at the heart of Gartner’s new term citizen data scientist. I am told that the term resonates with at least some enterprises. Read more
|Categories: Business intelligence, Data warehousing, Datameer, Hadoop, Log analysis, Oracle, Platfora, Predictive modeling and advanced analytics, SAS Institute, Software as a Service (SaaS), Tableau Software, Web analytics||2 Comments|
Most IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find myself in the mood for another survey post, I can’t think of any better idea for a unifying theme.
1. There are many kinds of machine-generated data. Important categories include:
- Web, network and other IT logs.
- Game and mobile app event data.
- CDRs (telecom Call Detail Records).
- “Phone-home” data from large numbers of identical electronic products (for example set-top boxes).
- Sensor network output (for example from a pipeline or other utility network).
- Vehicle telemetry.
- Health care data, in hospitals.
- Digital health data from consumer devices.
- Images from public-safety camera networks.
- Stock tickers (if you regard them as being machine-generated, which I do).
That’s far from a complete list, but if you think about those categories you’ll probably capture most of the issues surrounding other kinds of machine-generated data as well.
2. Technology for better information and analysis is also technology for privacy intrusion. Public awareness of privacy issues is focused in a few areas, mainly: Read more
Datameer checked in, having recently announced general availability of Datameer 5.0. So far as I understood, Datameer is still clearly in the investigative analytics business, in that:
- Datameer does business intelligence, but not at human real-time speeds. Datameer query durations are sometimes sub-minute, but surely not sub-second.
- Datameer also does lightweight predictive analytics/machine learning — k-means clustering, decision trees, and so on.
Key aspects include:
- Datameer runs straight against Hadoop.
- Like many other analytic offerings, Datameer is meant to be “self-service”, for line-of-business business analysts, and includes some “data preparation”. Datameer also has had some data profiling since Datameer 4.0.
- The main way of interacting with Datameer seems to be visual analytic programming. However, I Datameer has evolved somewhat away from its original spreadsheet metaphor.
- Datameer’s primitives resemble those you’d find in SQL (e.g. JOINs, GROUPBYs). More precisely, that would be SQL with a sessionization extension; e.g., there’s a function called GROUPBYGAP.
- Datameer lets you write derived data back into Hadoop.
|Categories: Business intelligence, Databricks, Spark and BDAS, Datameer, Hadoop, Log analysis, Market share and customer counts, Predictive modeling and advanced analytics, Web analytics||7 Comments|
As planned, I’m getting more active in predictive modeling. Anyhow …
1. I still believe most of what I said in a July, 2013 predictive modeling catch-all post. However, I haven’t heard as much subsequently about Ayasdi as I had expected to.
2. The most controversial part of that post was probably the claim:
I think the predictive modeling state of the art has become:
- Cluster in some way.
- Model separately on each cluster.
- It is always possible to instead go with a single model formally.
- A lot of people think accuracy, ease-of-use, or both are better served by a true single-model approach.
- Conversely, if you have a single model that’s pretty good, it’s natural to look at the subset of the data for which it works poorly and examine that first. Voila! You’ve just done a kind of clustering.
3. Nutonian is now a client. I just had my first meeting with them this week. To a first approximation, they’re somewhat like KXEN (sophisticated math, non-linear models, ease of modeling, quasi-automagic feature selection), but with differences that start: Read more
|Categories: Ayasdi, Databricks, Spark and BDAS, Log analysis, Nutonian, Predictive modeling and advanced analytics, Revolution Analytics, Scientific research, Web analytics||6 Comments|