I refer often to machine-generated data, which is commonly generated inexpensively and in log-like formats, and is often best aggregated in a big bit bucket before you try to do much analysis on it. The term has caught on, to the point that perhaps it’s time to distinguish more carefully among different kinds of machine-generated data. In particular, I think it may be useful to distinguish between:
- Log-stream machine-generated data, when what you’re looking at — at least initially — is the entire output of verbose logging systems.
- Remote machine-generated data.
Here’s what I’m thinking of for the second category. I rather frequently hear of cases in which data is generated by large numbers of remote machines, which occasionally send messages home. For example:
- I heard yesterday about a case with 10s of millions of machines, phoning home every 5 minutes, and another case with 10s of 1000s calling in every 5 seconds, both of them sending data initially to MySQL. (Application details weren’t given.)
- I heard not long ago about a set-top box case that the vendor hoped would also grow to 10s of millions of machines, which I guessed might send a small number of messages per hour each.
- I also heard recently about a remote security monitoring case whose data was destined for (probably) Netezza, although in that case I’m not sure about the “occasionally” aspect of the communication.
- The last time I visited Splunk, I got the sense that energy-sensor use cases (especially in the electric grid) had finally emerged. I believe these sensors are periodic message senders — they wake up, take their temperature (figuratively or literally as the case may be), send a message, snooze, repeat.
- I would guess that the energy use cases Infobright talked about in 2009 were of a similar kind.
- An April, 2010 comment on the post linked above talks about many kinds of sensor data.
- Back in 2007, Coral8 talked of a truck phone-home use case (on-board GPS data and also, e.g., refrigeration level, sending messages once a minute or so). Truviso seemed to have one similar deal before one of its frequent changes in strategic direction, and not coincidentally cites UPS as an investor.
- In principle, there are a lot of RFID use cases out there, even if I rarely seem to hear of any. (That would be a shorter “phone call” home than most of the other examples, of course, but might be otherwise technically similar.)
Many technologies can be used to collect and manage remote machine-generated data, but a few common points are worth nothing.
- If a device takes the trouble to send a message across a wide-area network, that message may be somewhat more valuable than the average piece of log-vomit. Perhaps such information doesn’t need to be stored in the cheapest possible way.
- Similarly, a message that is sent occasionally over time, or upon a specified event, may be more structured than a random log entry. Perhaps such data is suitable for sending straight to a relational database.
- If there’s no central place the data originates, there may also be no favored place for the data to end up. It may make great sense to collect and analyze remote machine-generated data in the cloud. (Exceptions may of course arise if you want to use the data in connection with other information, and you hence want to bring it to that other information’s location.)
- In a number of use cases, the whole point is to identify anomalies, and respond to them rapidly. I.e., remote machine-generated data use cases commonly raise challenges in low-latency integration of short-request and analytic processing.