Not long ago I pointed out that much future Big Data growth will be in the area of machine-generated data, examples of which include:
- Computer, network, and other equipment logs
- Satellite and similar telemetry (whether for espionage or science)
- Location data such as RFID chip readings, GPS system output, etc.
- Temperature and other environmental sensor readings
- Sensor readings from factories, pipelines, etc.
- Output from many kinds of medical device, in hospitals and (increasingly) homes alike
The core idea here is that human-generated data can grow only as fast as human data-generating activities allow it to, but machine-generated data is limited only by capital budgets and Moore’s Law. So machines’ ability to generate data is growing a lot faster than humans’.
Up to this point, I think there’s broad agreement, at least on the part of anybody who’s thought about it this way for very long. But that still leaves open questions as to which kinds of machine-generated data will matter first. The big five that matter right now are:
- Web logs (partially machine-generated, but tied to human actions)
- Call detail records (CDRs — ditto)
- Financial instrument trades (some purely machine-generated, some human-based)
- Network event logs (commonly associated with web logs)
- Telemetry collected by the government (especially for intelligence purposes)
A large fraction of all the 100 TB+ or petabyte+ data warehouse activity I know of falls into those areas.
Following along quickly are:
- Online game data. Since late last year, online game companies have come up over and over again as an important category of data warehousing/analytics users. Like most of the categories above, the gaming area actually features a hybrid between human- and machine-generated data.
- Genetic research data, although I don’t know to what extent the investment in data gathering is concentrated among the few obvious big pharmaceutical companies. Other health care data (research or clinical) will come along too, but doesn’t seem to be there yet.
Until recently I would have added:
- Energy exploration, energy production, energy refining, and/or utility network data
But while those areas seemed poised to get hot last year, I haven’t heard much about them the past few months, with a few exceptions:
- Accenture’s observation that new smart grids will generate up to eight orders of magnitude more data than old dumb grids do
- The recent article about the Terralliance fiasco (new kinds of oil exploration analytics, going beyond seismological data)
- Lots of concern about security flaws in utility smart grids.
Finally, I’ve been assuming that a big area going forward is location data, especially personal movement data. The data volumes involved could be similar to or even greater than those of CDRs. But privacy concerns with that are obviously immense. (Of course, in the case of Foursquare, this sort of overlaps with freely-shared game data.)
If you want to make all this more tangible in your mind, one area to look for ideas is in the huge amount of news about various kinds of innovative sensors. Sources include:
- Somebody named Landon Cox, who maintains a couple feeds of sensor news.
- A Twitter feed, apparently associated with a Sensor Expo.
- Another Twitter feed, this one from Sun Labs. (I have no idea what Oracle is or isn’t doing with the Sun SPOT project that links to.)
- Yet another Twitter feed.