December 30, 2010

Examples and definition of machine-generated data

In posts made last December, January, and April, I argued:

Much of the growth in analytic data volumes will come in the form of machine-generated data.
Unlike human-generated data, machine-generated data will grow at Moore’s Law kinds of speeds.
Thus, unlike human-generated data, which I advocate keeping pretty much in all its detail, machine-generated data will continue to be in large part thrown away.

Recently and somewhat belatedly, I added a somewhat obvious point — if we don’t keep all or even most of our machine-generated data, then what we keep is likely to be in some way massaged, extracted, or derived. The purpose of this post is to address a second oversight — giving a hopefully clear definition of what I actually mean by “machine-generated data.”

In classical human-generated data, what’s recorded is the direct result of human choices. Somebody buys something, makes an inquiry about it, fills an order from inventory, makes a payment in return for the object, makes a bank deposit to have funds for the next purchase, or promotes a manager who’s been particularly successful at selling stuff. Database updates ensue. Computers memorialize these human actions more quickly and cheaply than humans carry them out. Plenty of difficulties can occur with that kind of automation — applications are commonly too inflexible or confusing — but keeping up with data volumes is generally the least of the problems.

To a first approximation, machine-generated data is data that is not human-generated. I.e.,

Provisional definition: Machine-generated is data that was produced entirely by machines OR data that is more about observing humans than recording their choices.

(That is definitely an inclusive OR.) Suggestions for slicker wording will be gratefully received — but in making them, please try not to run afoul of Monash’s First Law of Commercial Semantics.

Let’s elucidate this definition by means of examples. Some cases of machine-generated data are fairly straightforward. Two of the posts linked above feature the list:

Computer, network, and other equipment logs
Satellite and similar telemetry (whether for espionage or science)
Location data such as RFID chip readings, GPS system output, etc.
Temperature and other environmental sensor readings
Sensor readings from factories, pipelines, etc.
Output from many kinds of medical device, in hospitals and (increasingly) homes alike

Only the first of those items is problematic. Otherwise, these are essentially cases of machine data all the way down.

So let’s consider some of the leading hybrid cases. Web logs mix together a wide variety of data, including:

Things the user types in.
Clicks the user makes.
Other indicators of the user’s attention.
Records of what was on the page when the user made these choices.
Large amounts of purely technical web server and network information.

Parsing these into reliable records of human activity — e.g. event extraction or sessionization — is an important computational task, and a precursor to almost any kind of analysis. Thus, raw records of human choices aren’t the essence of the database. Also, the network log part is typically 5x or more bigger than the pure web log. Putting that together, I’d say the whole thing feels largely like a machine-generated data challenge, but admittedly it’s in a bit of a gray area.

Call detail records (CDRs) initially feel machine-generated too, but it may be a bit misleading to view them as such. 1/2 a kilobyte of data (a typical length) for a several-minute human activity is not a whole lot. Obviously, if lots of network routing data gets attached — or if some intelligence agency parses the call’s contents — it could be a different matter. But for now I’m inclined to leave CDRs along with, say, in-store point-of-sale (POS) data as a category of particularly large human-generated data sets.

Social media and gaming records seem more like weblogs than CDRs — products of human choices so casual that they might as well be machine-generated. Obviously I’m not referring to WordPress authoring here, but rather to users who click or tap through a dizzying array of choices at ever higher speeds, with ever more log-style data created as byproducts of every user action.

And finally, there’s a different kind of edge case. Many stock trades are human-generated in the usual way. Even so, trade volume these days is dominated either by purely algorithmic trades, or else trades in which an algorithm turns one human decision into a dizzying array of individual trades. So I think stock trades can be fairly counted as machine-generated data. But I may reverse my opinion if rate-limiting regulations serve to limit or reduce their algorithmic aspect.

If you’ve noticed ways in which my definition of “machine-generated data” is less than ideal, please be so kind as to recall one thing — no product category definition can ever be perfect.

Categories: Data warehousing

Subscribe to our complete feed!

Comments

28 Responses to “Examples and definition of machine-generated data”

Daniel Abadi on December 30th, 2010 6:16 pm

This is an important topic — I am glad you are giving it some thought and visibility. I think I have more to say on this subject that can fit in a blog comment, so I wrote a response post on my blog:

http://dbmsmusings.blogspot.com/2010/12/machine-vs-human-generated-data.html
Curt Monash on December 31st, 2010 3:29 am

Hi Daniel,

Thanks for the rapid response!

As per our Twitter exchange, I stand by my points that you disagreed with in your post. My reason is that I think data/(human action) or data/(minute of human use) will continue to increase rapidly in line with advancing technology.

Also, you called out a good point when you added, in effect, “If a data set is so big that a lot of it will get thrown away, then a lot of the rest will be kept on cheap storage, e.g disk.”
Evolving definitions and technology categories for 2011 | DBMS 2 : DataBase Management System Services on December 31st, 2010 3:54 am

[…] The Wikipedia article on same doesn’t get the job done yet. (Edit: Here’s my take on defining machine-generated data. Be sure to read through to Daniel Abadi’s […]
Daniel Abadi on December 31st, 2010 11:14 am

Hi Curt,

Fair enough. The particular categories of individual applications are not the types of disagreements that need to be resolved.

I think we are in agreement on the basic point, which is that as long as there is machine-generated data, there will be “Big Data”, even though the definition of “Big” will change over time.
Toddler Science and Big Data @ Hyperextended Metaphor on January 3rd, 2011 1:17 am

[…] from Google and Facebook to Walmart and eBay. There is some debate about what big data means, with Curt Monash and Dan Abadi having recent posts on the […]
M-A-O-L » Machine Generated Data on January 5th, 2011 1:40 am

[…] Monash has been trying to define Machine Generated Data (but Daniel Abadi doesn’t fully agree) because machine generated data is what’ll be […]
Alan Scott on January 17th, 2011 3:26 pm

Maybe this will help to clarify things a bit, or maybe it will make things worse… With regard to business intelligence, “machine generated” data typically describes “what is or has happened”. Machine generated data typically cannot answer the “why did something happen” question because of its inherently “narrow” (as opposed to wide) context.
Curt Monash on January 17th, 2011 4:08 pm

Alan,

Huh?? If you want to know why the machine stopped working, its machine-generated log file contains the most important data to help you figure out why.

You also seem to be assuming that some kinds of raw data answer “why” questions just by virtue of having been collected, without further analysis. Except in very special cases — e.g., answers to survey questions that have “Why” in them — I don’t see what your basis for that assumption is.
Mega-trends driving data warehousing and business intelligence | DBMS 2 : DataBase Management System Services on January 22nd, 2011 3:08 pm

[…] Machine-generated, such as web log or sensor data. […]
No market categorization is ever precise | Strategic Messaging on March 1st, 2011 2:21 am

[…] been on a terminology binge recently, defining terms such as machine-generated data, analytic platform, internet request processing, and transparent sharding. So perhaps this is a […]
Traditional databases will eventually wind up in RAM | DBMS 2 : DataBase Management System Services on May 23rd, 2011 11:06 am

[…] by every nuance in that post, which may differ slightly from those in my more recent posts about machine-generated data and poly-structured databases. But one general idea is hard to […]
Three broad categories of data | DBMS 2 : DataBase Management System Services on June 20th, 2011 12:36 am

[…] data is a whole other can of worms. Paradigmatic examples of what I mean by machine-generated data […]
Hadapt update | DBMS 2 : DataBase Management System Services on July 6th, 2011 6:43 pm

[…] Hadapt use cases are centered around keeping machine-generated or other poly-structured data in Hadoop, and extracting, enhancing, or otherwise deriving some of […]
Clarifying SAND’s customer metrics, positioning and technical story : DBMS 2 : DataBase Management System Services on November 14th, 2011 11:16 am

[…] “people data” — customer loyalty, health care, etc . — rather than purely machine-generated data, with the paradigmatic target application being personalized […]
Historical notes on analytics — terminology | Software Memories on January 17th, 2012 3:05 am

[…] Terms I’ve recently sponsored, such as investigative analytics or machine-generated data. […]
Comments on the analytic DBMS industry and Gartner’s Magic Quadrant for same : DBMS 2 : DataBase Management System Services on February 8th, 2012 12:18 pm

[…] said, Infobright is small and focused on machine-generated data. So I wouldn’t be confident in Infobright’s future technology path for human-generated […]
Investigative analytics and derived data: Enzee Universe 2011 talk : DBMS 2 : DataBase Management System Services on April 24th, 2012 12:10 am

[…] out that this is all supported by cheap data creation and acquisition, specifically in the area of machine-generated data, which gets the full benefit of Moore’s […]
Three quick notes about derived data : DBMS 2 : DataBase Management System Services on April 24th, 2012 3:43 am

[…] All human-generated data should be retained. […]
Notes and comments — October 31, 2012 | DBMS 2 : DataBase Management System Services on October 31st, 2012 4:55 am

[…] other was “phone home” — i.e., the ingest of machine-generated data from a lot of different devices. This is something that’s obviously been coming for several […]
Trends that will continue in 2013 | DBMS 2 : DataBase Management System Services on December 12th, 2012 8:35 pm

[…] machine-generated data. Human-generated data grows at the rate business activity does, plus 0-25%. Machine-generated data grows at the rate of Moore’s Law, also plus 0-25%, which is a much higher total. In […]
Layering of database technology & DBMS with multiple DMLs | DBMS 2 : DataBase Management System Services on September 8th, 2013 4:52 am

[…] Machine-generated data and “content” both call for multi-datatype DBMS. And taken together, those are a large fraction of the future of computing. Consequently … […]
Glassbeam instantiates a lot of trends | DBMS 2 : DataBase Management System Services on October 30th, 2013 10:38 am

[…] Glassbeam has an analytic technology stack focused on poly-structured machine-generated data. […]
Machine Data at Strata: “BigData++” | Sumo Logic on March 5th, 2014 10:00 am

[…] is associated with multiple (sometimes conflicting) definitions, two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or […]
Machine Data at Strata: BigData++ - Data on March 5th, 2014 1:05 pm

[…] is associated with multiple (sometimes conflicting) definitions, two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or […]
Machine Data at Strata: BigData++ - O'Reilly Radar on June 20th, 2014 7:39 am

[…] is associated with multiple (sometimes conflicting) definitions, two prominent ones come from Curt Monash and Daniel Abadi. The focus of the machine data track is on data which is generated and/or […]
An idealized log management and analysis system — from whom? | DBMS 2 : DataBase Management System Services on September 7th, 2014 8:38 am

[…] of this discussion could apply to machine-generated data in general. But right now I think more players are doing product management with an explicit […]
Streaming for Hadoop | DBMS 2 : DataBase Management System Services on October 5th, 2014 4:58 am

[…] In general, candidate application areas for streaming-to-Hadoop match those that involve large volumes of machine-generated data. […]
Notes on machine-generated data, year-end 2014 | DBMS 2 : DataBase Management System Services on December 31st, 2014 10:49 pm

[…] IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Examples and definition of machine-generated data

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin