Most IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find myself in the mood for another survey post, I can’t think of any better idea for a unifying theme.
1. There are many kinds of machine-generated data. Important categories include:
- Web, network and other IT logs.
- Game and mobile app event data.
- CDRs (telecom Call Detail Records).
- “Phone-home” data from large numbers of identical electronic products (for example set-top boxes).
- Sensor network output (for example from a pipeline or other utility network).
- Vehicle telemetry.
- Health care data, in hospitals.
- Digital health data from consumer devices.
- Images from public-safety camera networks.
- Stock tickers (if you regard them as being machine-generated, which I do).
That’s far from a complete list, but if you think about those categories you’ll probably capture most of the issues surrounding other kinds of machine-generated data as well.
2. Technology for better information and analysis is also technology for privacy intrusion. Public awareness of privacy issues is focused in a few areas, mainly:
- Government snooping on the contents of communications.
- Communication traffic analysis.
- Photos and videos (airport scanners, public cameras, etc.)
- Commercial ad targeting.
- Traditional medical records.
Other areas, however, continue to be overlooked, with the two biggies in my opinion being:
- The potential to apply marketing-like psychographic analysis in other areas, such as hiring decisions or criminal justice.
- The ability to track people’s movements in great detail, which will be increased greatly yet again as the market matures — and some think this will happen soon — for consumer digital health.
3. The natural database structures for machine-generated data vary wildly. Weblog data structure is often remarkably complex. Log data from complex organizations (e.g. IT shops or hospitals) might comprise many streams, each with a different (even if individually simple) organization. But in the majority of my example categories, record structure is very simple and repeatable. Thus, there are many kinds of machine-generated data that can, at least in principle, be handled well by a relational DBMS …
4. … at least to some extent. In a further complication, much machine-generated data arrives as a kind of time series. Many (but not all) time series call for a strong commitment to event-series styles of analytics. Event series analytics are a challenge for relational DBMS, but Vertica and others have tried to step up with various kinds of temporal predicates or datatypes. Event series are also a challenge for business intelligence vendors, and a potentially significant driver for competitive rebalancing in the BI market.
5. Event series even aside, I wish I understood more about business intelligence for non-tabular data. I plan to fix that.
6. Streaming and memory-centric processing are closely related subjects. What I wrote recently about them for Hadoop still applies: Spark, Kafka, etc. is still the base streaming case going forward; Storm is still around as an alternative; Tachyon or something like it will change the game somewhat. But not all streaming machine-generated data needs to land in Hadoop at all. As noted above, relational data stores (especially memory-centric ones) can suffice. So can NoSQL. So can Splunk.
Not all these considerations are important in all use cases. For one thing, latency requirements vary greatly. For example:
- High-frequency trading is an extreme race; microseconds matter.
- Internet interaction applications increasingly require data freshness to the last click or other user action. Computational latency requirements can go down to the single-digit milliseconds. Real-time ad auctions have a race aspect that may drive latency lower yet.
- Minute-plus response can be fine for individual remote systems. Sometimes they ping home more rarely than that.
There’s also still plenty of true batch mode, but — and I say this as part of a conversation that’s been underway for over 40 years — interactive computing is preferable whenever feasible.
7. My views about predictive analytics are still somewhat confused. For starters:
- The math and technology of predictive modeling both still seem pretty simple …
- … but sometimes achieve mind-blowing results even so.
- There’s a lot of recent innovation in predictive modeling, but adoption of the innovative stuff is still fairly tepid.
- Adoption of the simple stuff is strong in certain market sectors, especially ones connected to customer understanding, such as marketing or anti-fraud.
So I’ll mainly just link to some of my past posts on the subject, and otherwise leave discussion of predictive analytics to another day.
- WibiData has some innovative ideas in predictive experimentation.
- Nutonian has some innovative ideas in non-linear modeling for pattern detection/root-cause analysis.
- It’s still at the anecdotal level, but there have been interesting ideas in the rapid retraining of models.
- Ayasdi reminded us that there’s room for innovation in clustering.
- My Thanksgiving round-up post points to a lot of my prior comments on predictive modeling.
Finally, back in 2011 I tried to broadly categorize analytics use cases. Based on that and also on some points I just raised above, I’d say that a ripe area for breakthroughs is problem and anomaly detection and diagnosis, specifically for machines and physical installations, rather than in the marketing/fraud/credit score areas that are already going strong. That’s an old discipline; the concept of statistical process control dates back before World War II. Perhaps they’re underway; the Conviva retraining example linked above is certainly imaginative. But I’d like to see a lot more in the area.
Even more important, of course, could be some kind of revolution in predictive modeling for medicine.