Log analysis

Discussion of how data warehousing and analytic technologies are applied to logfile analysis. Related subjects include:

The use of analytic technologies to study web and network event data

September 17, 2015

Rocana’s world

For starters:

My client Rocana is the renamed ScalingData, where Rocana is meant to signify ROot Cause ANAlysis.
Rocana was founded by Omer Trajman, who I’ve referenced numerous times in the past, and who I gather is a former boss of …
… cofounder Eric Sammer.
Rocana recently told me it had 35 people.
Rocana has a very small number of quite large customers.

Rocana portrays itself as offering next-generation IT operations monitoring software. As you might expect, this has two main use cases:

Actual operations — figuring out exactly what isn’t working, ASAP.
Security.

Rocana’s differentiation claims boil down to fast and accurate anomaly detection on large amounts of log data, including but not limited to:

The sort of network data you’d generally think of — “everything” except packet-inspection stuff.
Firewall output.
Database server logs.
Point-of-sale data (at a retailer).
“Application data”, whatever that means. (Edit: See Tom Yates’ clarifying comment below.)

Categories: Business intelligence, Hadoop, Kafka and Confluent, Log analysis, Market share and customer counts, Petabyte-scale data management, Predictive modeling and advanced analytics, Pricing, Rocana, Splunk, Web analytics

1 Comment

August 3, 2015

Data messes

A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.

To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — 🙂 — mine.)

Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:

Inconsistent, in which case humans might not know how to look it up and database JOINs might fail.
Unintegrated, in which case one application might not be able to use data that another happily maintains. (This is the classic data silo problem.)

Inconsistency can take multiple forms, including: Read more

Categories: Business intelligence, ClearStory Data, Data integration and middleware, Data warehousing, Databricks, Spark and BDAS, Derived data, EAI, EII, ETL, ELT, ETLT, Gooddata, Greenplum, Hadoop, Log analysis, Streaming and complex event processing (CEP), Web analytics

11 Comments

May 26, 2015

IT-centric notes on the future of health care

It’s difficult to project the rate of IT change in health care, because:

Health care is suffused with technology — IT, medical device and biotech alike — and hence has the potential for rapid change. However, it is also the case that …
… health care is heavily bureaucratic, political and regulated.

Timing aside, it is clear that health care change will be drastic. The IT part of that starts with vastly comprehensive electronic health records, which will be accessible (in part or whole as the case may be) by patients, care givers, care payers and researchers alike. I expect elements of such records to include:

The human-generated part of what’s in ordinary paper health records today, but across a patient’s entire lifetime. This of course includes notes created by doctors and other care-givers.
Large amounts of machine-generated data, including:
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
  - Most tests exploit electronic technology. Progress in electronics is intense.
  - Biomedical research is itself intense.
  - In particular, most research technologies (for example gene sequencing) can be made cheap enough over time to be affordable clinically.
- The output of consumer health-monitoring devices — e.g. Fitbit and its successors. The buzzword here is “quantified self”, but what it boils down to is that every moment of our lives will be measured and recorded.

These vastly greater amounts of data cited above will allow for greatly changed analytics.
Read more

Categories: Business intelligence, Cassandra, Data integration and middleware, Hadoop, HBase, Health care, Log analysis, MongoDB, NoSQL, Predictive modeling and advanced analytics, RDF and graphs, Splunk, Streaming and complex event processing (CEP), Surveillance and privacy, Text

6 Comments

May 13, 2015

Notes on analytic technology, May 13, 2015

1. There are multiple ways in which analytics is inherently modular. For example:

Business intelligence tools can reasonably be viewed as application development tools. But the “applications” may be developed one report at a time.
The point of a predictive modeling exercise may be to develop a single scoring function that is then integrated into a pre-existing operational application.
Conversely, a recommendation-driven website may be developed a few pages — and hence also a few recommendations — at a time.

Also, analytics is inherently iterative.

Everything I just called “modular” can reasonably be called “iterative” as well.
So can any work process of the nature “OK, we got an insight. Let’s pursue it and get more accuracy.”

If I’m right that analytics is or at least should be modular and iterative, it’s easy to see why people hate multi-year data warehouse creation projects. Perhaps it’s also easy to see why I like the idea of schema-on-need.

2. In 2011, I wrote, in the context of agile predictive analytics, that

… the “business analyst” role should be expanded beyond BI and planning to include lightweight predictive analytics as well.

I gather that a similar point is at the heart of Gartner’s new term citizen data scientist. I am told that the term resonates with at least some enterprises. Read more

Categories: Business intelligence, Data warehousing, Datameer, Hadoop, Log analysis, Oracle, Platfora, Predictive modeling and advanced analytics, SAS Institute, Software as a Service (SaaS), Tableau Software, Web analytics

2 Comments

March 5, 2015

Cask and CDAP

For starters:

Continuuity toured in 2012 and touted its “app server for Hadoop” technology.
Continuuity recently changed its name to Cask and went open source.
Cask’s product is now called CDAP (Cask Data Application Platform). It’s still basically an app server for Hadoop and other “big data” — ouch do I hate that phrase — data stores.
Cask and Cloudera partnered.
I got a more technical Cask briefing this week.

Also:

App servers are a notoriously amorphous technology. The focus of how they’re used can change greatly every couple of years.
Partly for that reason, I was unimpressed by Continuuity’s original hype-filled positioning.

So far as I can tell:

Cask’s current focus is to orchestrate job flows, with lots of data mappings.
This is supposed to provide lots of developer benefits, for fairly obvious reasons. Those are pitched in terms of an integration story, more in a “free you from the mess of a many-part stack” sense than strictly in terms of data integration.
CDAP already has a GUI to monitor what’s going on. A GUI to specify workflows is coming very soon.
CDAP doesn’t consume a lot of cycles itself, and hence isn’t a real risk for unpleasant overhead, if “overhead” is narrowly defined. Rather, performance drags could come from …
… sub-optimal choices in data mapping, database design or workflow composition.

Categories: Application servers, Cloudera, Data integration and middleware, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Emulation, transparency, portability, Hadoop, Health care, Kafka and Confluent, Log analysis, Market share and customer counts, Schema on need, Software as a Service (SaaS), Splunk, Streaming and complex event processing (CEP), Telecommunications, WibiData

5 Comments

February 22, 2015

Data models

7-10 years ago, I repeatedly argued the viewpoints:

Relational DBMS were the right choice in most cases.
Multiple kinds of relational DBMS were needed, optimized for different kinds of use case.
There were a variety of specialized use cases in which non-relational data models were best.

Since then, however:

Hadoop has flourished.
NoSQL has flourished.
Graph DBMS have matured somewhat.
Much of the action has shifted to machine-generated data, of which there are many kinds.

So it’s probably best to revisit all that in a somewhat organized way.

Categories: Cassandra, Cloud computing, Data models and architecture, Data warehouse appliances, Data warehousing, Database diversity, Hadoop, In-memory DBMS, Log analysis, Mid-range, MongoDB, NoSQL, OLTP, RDF and graphs, Splunk, Structured documents

8 Comments

December 31, 2014

Notes on machine-generated data, year-end 2014

Most IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find myself in the mood for another survey post, I can’t think of any better idea for a unifying theme.

1. There are many kinds of machine-generated data. Important categories include:

Web, network and other IT logs.
Game and mobile app event data.
CDRs (telecom Call Detail Records).
“Phone-home” data from large numbers of identical electronic products (for example set-top boxes).
Sensor network output (for example from a pipeline or other utility network).
Vehicle telemetry.
Health care data, in hospitals.
Digital health data from consumer devices.
Images from public-safety camera networks.
Stock tickers (if you regard them as being machine-generated, which I do).

That’s far from a complete list, but if you think about those categories you’ll probably capture most of the issues surrounding other kinds of machine-generated data as well.

2. Technology for better information and analysis is also technology for privacy intrusion. Public awareness of privacy issues is focused in a few areas, mainly: Read more

Categories: Ayasdi, Business intelligence, Data models and architecture, Databricks, Spark and BDAS, Games and virtual worlds, Health care, Investment research and trading, Kafka and Confluent, Log analysis, Memory-centric data management, NoSQL, Predictive modeling and advanced analytics, Splunk, Surveillance and privacy, Telecommunications, Web analytics

11 Comments

December 12, 2014

Notes and links, December 12, 2014

1. A couple years ago I wrote skeptically about integrating predictive modeling and business intelligence. I’m less skeptical now.

For starters:

The predictive experimentation I wrote about over Thanksgiving calls naturally for some BI/dashboarding to monitor how it’s going.
If you think about Nutonian’s pitch, it can be approximated as “Root-cause analysis so easy a business analyst can do it.” That could be interesting to jump to after BI has turned up anomalies. And it should be pretty easy to whip up a UI for choosing a data set and objective function to model on, since those are both things that the BI tool would know how to get to anyway.

I’ve also heard a couple of ideas about how predictive modeling can support BI. One is via my client Omer Trajman, whose startup ScalingData is still semi-stealthy, but says they’re “working at the intersection of big data and IT operations”. The idea goes something like this:

Suppose we have lots of logs about lots of things.* Machine learning can help:
- Notice what’s an anomaly.
- Group* together things that seem to be experiencing similar anomalies.
That can inform a BI-plus interface for a human to figure out what is happening.

Makes sense to me. (Edit: ScalingData subsequently launched, under the name Rocana.)

* The word “cluster” could have been used here in a couple of different ways, so I decided to avoid it altogether.

Finally, I’m hearing a variety of “smart ETL/data preparation” and “we recommend what columns you should join” stories. I don’t know how much machine learning there’s been in those to date, but it’s usually at least on the roadmap to make the systems (yet) smarter in the future. The end benefit is usually to facilitate BI.

2. Discussion of graph DBMS can get confusing. For example: Read more

Categories: Business intelligence, Greenplum, Hadoop, Hortonworks, Log analysis, Neo Technology and Neo4j, Nutonian, Predictive modeling and advanced analytics, RDF and graphs, WibiData

5 Comments

October 26, 2014

Datameer at the time of Datameer 5.0

Datameer checked in, having recently announced general availability of Datameer 5.0. So far as I understood, Datameer is still clearly in the investigative analytics business, in that:

Datameer does business intelligence, but not at human real-time speeds. Datameer query durations are sometimes sub-minute, but surely not sub-second.
Datameer also does lightweight predictive analytics/machine learning — k-means clustering, decision trees, and so on.

Key aspects include:

Datameer runs straight against Hadoop.
Like many other analytic offerings, Datameer is meant to be “self-service”, for line-of-business business analysts, and includes some “data preparation”. Datameer also has had some data profiling since Datameer 4.0.
The main way of interacting with Datameer seems to be visual analytic programming. However, I Datameer has evolved somewhat away from its original spreadsheet metaphor.
Datameer’s primitives resemble those you’d find in SQL (e.g. JOINs, GROUPBYs). More precisely, that would be SQL with a sessionization extension; e.g., there’s a function called GROUPBYGAP.
Datameer lets you write derived data back into Hadoop.

Categories: Business intelligence, Databricks, Spark and BDAS, Datameer, Hadoop, Log analysis, Market share and customer counts, Predictive modeling and advanced analytics, Web analytics

7 Comments

October 10, 2014

Notes on predictive modeling, October 10, 2014

As planned, I’m getting more active in predictive modeling. Anyhow …

1. I still believe most of what I said in a July, 2013 predictive modeling catch-all post. However, I haven’t heard as much subsequently about Ayasdi as I had expected to.

2. The most controversial part of that post was probably the claim:

I think the predictive modeling state of the art has become:

Cluster in some way.

Model separately on each cluster.

In particular:

It is always possible to instead go with a single model formally.
A lot of people think accuracy, ease-of-use, or both are better served by a true single-model approach.
Conversely, if you have a single model that’s pretty good, it’s natural to look at the subset of the data for which it works poorly and examine that first. Voila! You’ve just done a kind of clustering.

3. Nutonian is now a client. I just had my first meeting with them this week. To a first approximation, they’re somewhat like KXEN (sophisticated math, non-linear models, ease of modeling, quasi-automagic feature selection), but with differences that start: Read more

Categories: Ayasdi, Databricks, Spark and BDAS, Log analysis, Nutonian, Predictive modeling and advanced analytics, Revolution Analytics, Scientific research, Web analytics

9 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Log analysis

Rocana’s world

Data messes

IT-centric notes on the future of health care

Notes on analytic technology, May 13, 2015

Cask and CDAP

Data models

Notes on machine-generated data, year-end 2014

Notes and links, December 12, 2014

Datameer at the time of Datameer 5.0

Notes on predictive modeling, October 10, 2014

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin