December 31, 2014

Notes on machine-generated data, year-end 2014

Most IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find myself in the mood for another survey post, I can’t think of any better idea for a unifying theme.

1. There are many kinds of machine-generated data. Important categories include:

That’s far from a complete list, but if you think about those categories you’ll probably capture most of the issues surrounding other kinds of machine-generated data as well.

2. Technology for better information and analysis is also technology for privacy intrusion. Public awareness of privacy issues is focused in a few areas, mainly:

Other areas, however, continue to be overlooked, with the two biggies in my opinion being:

My core arguments about privacy and surveillance seem as valid as ever.

3. The natural database structures for machine-generated data vary wildly. Weblog data structure is often remarkably complex. Log data from complex organizations (e.g. IT shops or hospitals) might comprise many streams, each with a different (even if individually simple) organization. But in the majority of my example categories, record structure is very simple and repeatable. Thus, there are many kinds of machine-generated data that can, at least in principle, be handled well by a relational DBMS …

4. … at least to some extent. In a further complication, much machine-generated data arrives as a kind of time series. Many (but not all) time series call for a strong commitment to event-series styles of analytics. Event series analytics are a challenge for relational DBMS, but Vertica and others have tried to step up with various kinds of temporal predicates or datatypes. Event series are also a challenge for business intelligence vendors, and a potentially significant driver for competitive rebalancing in the BI market.

5. Event series even aside, I wish I understood more about business intelligence for non-tabular data. I plan to fix that.

6. Streaming and memory-centric processing are closely related subjects. What I wrote recently about them for Hadoop still applies: Spark, Kafka, etc. is still the base streaming case going forward; Storm is still around as an alternative; Tachyon or something like it will change the game somewhat. But not all streaming machine-generated data needs to land in Hadoop at all. As noted above, relational data stores (especially memory-centric ones) can suffice. So can NoSQL. So can Splunk.

Not all these considerations are important in all use cases. For one thing, latency requirements vary greatly. For example:

There’s also still plenty of true batch mode, but — and I say this as part of a conversation that’s been underway for over 40 years — interactive computing is preferable whenever feasible.

7. My views about predictive analytics are still somewhat confused. For starters:

So I’ll mainly just link to some of my past posts on the subject, and otherwise leave discussion of predictive analytics to another day.

Finally, back in 2011 I tried to broadly categorize analytics use cases. Based on that and also on some points I just raised above, I’d say that a ripe area for breakthroughs is problem and anomaly detection and diagnosis, specifically for machines and physical installations, rather than in the marketing/fraud/credit score areas that are already going strong. That’s an old discipline; the concept of statistical process control dates back before World War II. Perhaps they’re underway; the Conviva retraining example linked above is certainly imaginative. But I’d like to see a lot more in the area.

Even more important, of course, could be some kind of revolution in predictive modeling for medicine.


11 Responses to “Notes on machine-generated data, year-end 2014”

  1. clive boulton on January 1st, 2015 5:31 am

    On business intelligence for non-tabular data. Josh Patterson’s grip on transforming time series data is to how Don Knuth grips algorithms.

  2. David Gruzman on January 1st, 2015 7:44 am

    I would add two more hadoop alternatives where machine generated data can be stored:
    1. amazon s3 as a place where data is very cheap to store with yet some access capability.
    2. Elastic Search, which can be viewed as splank alternative.

  3. Verslo Valdymo Sistemos on January 2nd, 2015 3:22 am


    It is very educative article. It have me a lot of new and interesting information and thoughts.
    Thank you.

    Best regards,

    Verslo Valdymo Sistemos

  4. Asaf Birenzvieg on January 4th, 2015 7:43 am

    What would you say today are the most mature solutions for pattern detection/root-cause analysis? especailly for machine generated data results from IT logs and click stream?

  5. Growth in machine-generated data | DBMS 2 : DataBase Management System Services on January 30th, 2015 2:32 pm

    […] recent survey of machine-generated data topics started with a list of many different kinds of the […]

  6. Data models | DBMS 2 : DataBase Management System Services on February 22nd, 2015 10:08 pm

    […] Much of the action has shifted to machine-generated data, of which there are many kinds. […]

  7. Health care tech | DBMS 2 : DataBase Management System Services on May 26th, 2015 1:02 am

    […] amounts of machine-generated data, […]

  8. league of legends bot changes on June 8th, 2015 10:36 pm

    league of legends bot changes

    Notes on machine-generated data, year-end 2014 | DBMS 2 : DataBase Management System Services

  9. Soft robots, Part 2 — implications | DBMS 2 : DataBase Management System Services on June 23rd, 2015 6:07 am

    […] One other point about data flows — suppose you have two kinds of machines that can do a task, one of which is flexible, the other rigid. The flexible one will naturally have much more variance in what happens from one instance of the task to the next one. That’s just another way in which soft robots will induce greater quantities of machine-generated data. […]

  10. Enterprise application software — vertical and departmental markets | Software Memories on November 11th, 2015 5:30 am

    […] I think this group could become much more important in the age of machine-generated data. […]

  11. Oracle as the new IBM — has a long decline started? | DBMS 2 : DataBase Management System Services on January 2nd, 2016 6:18 am

    […] interactions (here I’m drawing the (trans)action/interaction distinction) or even more purely machine-generated data (“Internet of Things”). The Oracle RDBMS has few advantages in those […]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.