- Much of the growth in analytic data volumes will come in the form of machine-generated data.
- Unlike human-generated data, machine-generated data will grow at Moore’s Law kinds of speeds.
- Thus, unlike human-generated data, which I advocate keeping pretty much in all its detail, machine-generated data will continue to be in large part thrown away.
Recently and somewhat belatedly, I added a somewhat obvious point — if we don’t keep all or even most of our machine-generated data, then what we keep is likely to be in some way massaged, extracted, or derived. The purpose of this post is to address a second oversight — giving a hopefully clear definition of what I actually mean by “machine-generated data.”
In classical human-generated data, what’s recorded is the direct result of human choices. Somebody buys something, makes an inquiry about it, fills an order from inventory, makes a payment in return for the object, makes a bank deposit to have funds for the next purchase, or promotes a manager who’s been particularly successful at selling stuff. Database updates ensue. Computers memorialize these human actions more quickly and cheaply than humans carry them out. Plenty of difficulties can occur with that kind of automation — applications are commonly too inflexible or confusing — but keeping up with data volumes is generally the least of the problems.
To a first approximation, machine-generated data is data that is not human-generated. I.e.,
Provisional definition: Machine-generated is data that was produced entirely by machines OR data that is more about observing humans than recording their choices.
(That is definitely an inclusive OR.) Suggestions for slicker wording will be gratefully received — but in making them, please try not to run afoul of Monash’s First Law of Commercial Semantics.
Let’s elucidate this definition by means of examples. Some cases of machine-generated data are fairly straightforward. Two of the posts linked above feature the list:
- Computer, network, and other equipment logs
- Satellite and similar telemetry (whether for espionage or science)
- Location data such as RFID chip readings, GPS system output, etc.
- Temperature and other environmental sensor readings
- Sensor readings from factories, pipelines, etc.
- Output from many kinds of medical device, in hospitals and (increasingly) homes alike
Only the first of those items is problematic. Otherwise, these are essentially cases of machine data all the way down.
So let’s consider some of the leading hybrid cases. Web logs mix together a wide variety of data, including:
- Things the user types in.
- Clicks the user makes.
- Other indicators of the user’s attention.
- Records of what was on the page when the user made these choices.
- Large amounts of purely technical web server and network information.
Parsing these into reliable records of human activity — e.g. event extraction or sessionization — is an important computational task, and a precursor to almost any kind of analysis. Thus, raw records of human choices aren’t the essence of the database. Also, the network log part is typically 5x or more bigger than the pure web log. Putting that together, I’d say the whole thing feels largely like a machine-generated data challenge, but admittedly it’s in a bit of a gray area.
Call detail records (CDRs) initially feel machine-generated too, but it may be a bit misleading to view them as such. 1/2 a kilobyte of data (a typical length) for a several-minute human activity is not a whole lot. Obviously, if lots of network routing data gets attached — or if some intelligence agency parses the call’s contents — it could be a different matter. But for now I’m inclined to leave CDRs along with, say, in-store point-of-sale (POS) data as a category of particularly large human-generated data sets.
Social media and gaming records seem more like weblogs than CDRs — products of human choices so casual that they might as well be machine-generated. Obviously I’m not referring to WordPress authoring here, but rather to users who click or tap through a dizzying array of choices at ever higher speeds, with ever more log-style data created as byproducts of every user action.
And finally, there’s a different kind of edge case. Many stock trades are human-generated in the usual way. Even so, trade volume these days is dominated either by purely algorithmic trades, or else trades in which an algorithm turns one human decision into a dizzying array of individual trades. So I think stock trades can be fairly counted as machine-generated data. But I may reverse my opinion if rate-limiting regulations serve to limit or reduce their algorithmic aspect.
If you’ve noticed ways in which my definition of “machine-generated data” is less than ideal, please be so kind as to recall one thing — no product category definition can ever be perfect.