April 8, 2010

Examples of machine-generated data

Not long ago I pointed out that much future Big Data growth will be in the area of machine-generated data, examples of which include:

Computer, network, and other equipment logs
Satellite and similar telemetry (whether for espionage or science)
Location data such as RFID chip readings, GPS system output, etc.
Temperature and other environmental sensor readings
Sensor readings from factories, pipelines, etc.
Output from many kinds of medical device, in hospitals and (increasingly) homes alike

The core idea here is that human-generated data can grow only as fast as human data-generating activities allow it to, but machine-generated data is limited only by capital budgets and Moore’s Law. So machines’ ability to generate data is growing a lot faster than humans’.

Up to this point, I think there’s broad agreement, at least on the part of anybody who’s thought about it this way for very long. But that still leaves open questions as to which kinds of machine-generated data will matter first. The big five that matter right now are:

Web logs (partially machine-generated, but tied to human actions)
Call detail records (CDRs — ditto)
Financial instrument trades (some purely machine-generated, some human-based)
Network event logs (commonly associated with web logs)
Telemetry collected by the government (especially for intelligence purposes)

A large fraction of all the 100 TB+ or petabyte+ data warehouse activity I know of falls into those areas.

Following along quickly are:

Online game data. Since late last year, online game companies have come up over and over again as an important category of data warehousing/analytics users. Like most of the categories above, the gaming area actually features a hybrid between human- and machine-generated data.
Genetic research data, although I don’t know to what extent the investment in data gathering is concentrated among the few obvious big pharmaceutical companies. Other health care data (research or clinical) will come along too, but doesn’t seem to be there yet.

Until recently I would have added:

Energy exploration, energy production, energy refining, and/or utility network data

But while those areas seemed poised to get hot last year, I haven’t heard much about them the past few months, with a few exceptions:

Accenture’s observation that new smart grids will generate up to eight orders of magnitude more data than old dumb grids do
The recent article about the Terralliance fiasco (new kinds of oil exploration analytics, going beyond seismological data)
Lots of concern about security flaws in utility smart grids.

Finally, I’ve been assuming that a big area going forward is location data, especially personal movement data. The data volumes involved could be similar to or even greater than those of CDRs. But privacy concerns with that are obviously immense. (Of course, in the case of Foursquare, this sort of overlaps with freely-shared game data.)

If you want to make all this more tangible in your mind, one area to look for ideas is in the huge amount of news about various kinds of innovative sensors. Sources include:

Somebody named Landon Cox, who maintains a couple feeds of sensor news.
A Twitter feed, apparently associated with a Sensor Expo.
Another Twitter feed, this one from Sun Labs. (I have no idea what Oracle is or isn’t doing with the Sun SPOT project that links to.)
Yet another Twitter feed.

Categories: Analytic technologies, Data warehousing, Games and virtual worlds, Investment research and trading, Log analysis, Oracle, Telecommunications, Web analytics

Subscribe to our complete feed!

Comments

28 Responses to “Examples of machine-generated data”

J. Andrew Rogers on April 8th, 2010 7:32 pm

Machine-generated text-and-number data is widely used and collected en masse but sensor data has additional problems.

I’ve worked with a number of non-military sensor-based data sources, both terrestrial and satellite. Tens of petabytes of working set and tens of terabytes of new ingest per day for single sources is becoming normal. Managing this data is a serious problem since conventional spatio-temporal indexing of the type that would often be useful doesn’t scale; big file systems and something only slightly better than brute-force search are the norm. If you add in latent network-connected data sources that have secondary value outside of their primary non-aggregated purpose, I would guess new sensor data being systematically generated annually by machines is in the low exabyte range and growing fast, most of which is discarded.

In my own conversations with organizations that have valuable latent sensor data sets, there is a lot of interest in aggregating and analyzing this data but no cost-effective way of doing so. These are the myriad machine-generated data sources that no one ever hears about. Software like Hadoop doesn’t solve this particular problem very well yet. Once the benefit exceeds the cost of dealing with this data and turning it into an analyzable form I expect we will “discover” sensor-based data sets popping up like weeds as organizations attempt to monetize them.
Curt Monash on April 8th, 2010 9:58 pm

Andrew,

Would SciDB address some of the issues you’re talking about?
J. Andrew Rogers on April 9th, 2010 1:46 am

SciDB looks like a big improvement, it would appear to provide a pretty rich toolset for basic data handling. At a minimum, it looks like it can do most of the first-pass processing of the input data.

The single most common bottleneck (in my past experience) beside raw computational power was indexing the features pulled out of the raw data. R-tree-like data structures at that scale have really poor ingest and update rate for the dynamic polygonal features that were the usable end product. It doesn’t look like this is addressed. Point clouds scale out very nicely for basic read-write but are compute intensive for contextual/analytical queries relative to polygons and there is a modest practical upper bound on the number of records for conventional polygon handling. It is a tradeoff.
Curt Monash on April 9th, 2010 3:53 am

Andrew,

I’m not sure that there’s any reasonable solution to polygons other than “Scan a lot of data and do what you have to do.” I.e., while R-trees and the like don’t sufficiently help w/ polygon-style analysis, they also don’t particularly get in way. The same would likely be true of SciDB.

One thing SciDB is designed for is to store “cooked” post-processed results in line w/ the raw data. That could actually help with the kind of problems you’re raising.

At least in theory …
Jim Saunders on April 9th, 2010 7:18 am

The data issue we are having is related to this as well. Looking to expand a current 4 TB Oracle RAC DB hosting various aviation data. Growth rates of 50/100 ? TB year of radar, GPS, terminal, weather, and other data feeds. Our analysts want to be able to take any flight and perform end to end analysis, model all airspace — whatever else they can think of. We have a small hadoop cluster as a test to see if this can meet our needs. Right now we are willing to experiment with almost any software to find a solution as the current Oracle just is not performing as the customer expects.
Georges Bory on April 22nd, 2010 12:29 pm

We all knew the time would come when computing power would generate more data than can be physically handled by humans. Working in the treasury and capital markets world, it’s no surprise to see that the data generated by financial instrument trades makes your top 5 of machine-generated data that ‘matters’. However, it’s worth noting that computer-generated data comes not only from algo trading but also from the production of simulation scenarios, which are produced by large grid computing infrastructures, so traders can compute Value at Risk, Sensitivities and Counterparty Risk. Simulations of this nature will be relied upon more and more as financial institutions desperately try to avoid a repetition of the recent events in the financial markets. Ultimately, the challenge lies not only in managing this explosion of computer-generated data but also in having the ability to effectively analyse and act on it.
Curt Monash on April 22nd, 2010 2:15 pm

George,

To what extent are those simulations ever STORED?

I understand that simulations are huge in various parts of high-performance computing, but I’ve never gotten more than throwaway comments helping me understand whether actually storing the output of the simulations is a big deal.

Thanks,

CAM
Georges Bory on April 23rd, 2010 12:54 pm

Hi Curt,

It is true that most of the data generated during simulation process for computing indicators such as Value at Risk (VaR), Potential Exposure etc used to be thrown away once key statistics where computed ( loss quartile, closed form VaR, MPE etc).

However, there is now, after the financial crisis, a requirement ( business and regulatory) to go beyond a single number to understand and manage risk. Furthermore, one needs to monitor changes in risk, and be able able to look back in the past, drill thru down to individual scenarios or trades. For all those reasons, the results of those simulations are now stored and need to be analyzed. You might want to look at this benchmark, it describes a realistic volume scenario for an international bank.

http://www.quartetfs.com/benchmark.php

I hope this helps,

Georges Bory
Accelerometer Pumpkin Launching on May 24th, 2010 5:51 pm

[…] Examples of machine-generated data | DBMS2 — DataBase Management System Services […]
Warren Davidson on June 14th, 2010 10:51 am

We’ve been involved with numerous military programs collects large volumes of sensor data, and the volumes are certainly increasing all the time. The big issue for some of these folks is the big gains in hardware to collect higher quality data.
Collecting large amounts of data for our customers probably isn’t as big a problem as to how to figure out what’s important (or how to make decisions on the data you get). What is increasingly important is how to fuse multiple sources of data so you can increase the value of that data. Petabytes of data is no longer an issue for many data stores, its the complexity of the data that will be the real challenge over the next few years.
Big Data is Watching You! | DBMS2 -- DataBase Management System Services on August 11th, 2010 8:57 am

[…] may change some day, with the growing importance of machine-generated data, and of big-data science in particular. But I think it’s a fair […]
Big Data is Watching You – The Intelligent Enterprise Blog on September 20th, 2010 4:54 am

[…] those four categories is the first one. That may change some day, with the growing importance of machine-generated data, and of big-data science in particular. But I think it’s a fair assessment at the present, […]
Evolving definitions and technology categories for 2011 | DBMS 2 : DataBase Management System Services on December 31st, 2010 3:53 am

[…] Not many terms I coin gets marketing traction, but machine-generated data has grown some legs. Clients (Infobright, Cloudera) and non-clients alike have adopted it. I need […]
Examples and definition of machine-generated data | DBMS 2 : DataBase Management System Services on March 1st, 2011 2:48 am

[…] posts made last December, January, and April, I […]
promotionvoucher.co.uk on April 15th, 2015 5:06 am

promotionvoucher.co.uk

Examples of machine-generated data | DBMS 2 : DataBase Management System Services
ใครสร้าง Big Data? มนุษย์ หรือ Machine - Data Science Thailand on June 2nd, 2016 7:27 pm

[…] ข้อมูลอื่นๆที่น่าสนใจ เช่น online game data, online research data (reference) […]
Data generation – An idea (Part 1) | sidharthabardoloye on November 30th, 2017 9:15 am

[…] Example: http://www.dbms2.com/2010/04/08/machine-generated-data-example/ […]
said vs stated on July 4th, 2022 9:04 am

Thanks for finally talking about > Examples of machine-generated data | DBMS 2 :
DataBase Management System Services said vs stated
recommends a link on February 8th, 2023 2:08 pm

This is the perfect blog for everyone who hopes to find out
about this topic. You understand recommends a link whole lot its almost tough to
argue with you (not that I really will need to…HaHa).
You certainly put a fresh spin on a subject that has
been written about for many years. Excellent stuff, just excellent!
http://theodoreutunas.mee.nu/?entry=3448885
watches shoes on March 1st, 2023 5:21 pm

watches shoes

blog topic
dedicated server protection on March 12th, 2023 7:45 pm

dedicated server protection

Examples of machine-generated data | DBMS 2 : DataBase Management System Services
Prince Wickham on August 3rd, 2023 1:40 pm

You said it adequately..

my webpage https://radio4000.com/stephenhendel
Clark Dundas on August 21st, 2023 4:07 pm

Really a lot of very good tips.
judi slot deposit Pulsa tanpa potongan on September 13th, 2023 3:05 am

Finally, the as soon as-chiseled Fitbit Flex has begun to how its age subsequent to its younger,
more agile brother, the Fitbit Drive.

Alsoo vieit my website :: judi slot deposit Pulsa tanpa potongan
Rudra Elevators Raipur on October 9th, 2023 12:14 am

Why users still use to read news papers when in this technological globe all
is available on net?
doodle jump on March 6th, 2024 11:49 pm

First off I would like to say awesome blog! I had a quick question that I’d like to ask if you do not mind.

I was curious to know how you center yourself and clear your mind before
writing. I’ve had a hard time clearing my mind in getting my ideas out there.

I truly do take pleasure in writing however it just seems
like the first 10 to 15 minutes are wasted just trying to figure out how to begin. Any ideas or hints?
Kudos!
driving directions on October 31st, 2024 5:39 am

I have the utmost confidence that you are well aware of the huge impact that you have had on both my personal life and my professional life. Regarding you, I have a great deal of admiration, and I am thankful for the fact that such admiration exists.
Kristan on January 25th, 2025 3:42 am

The Townshend Revenue Act have been two tax legal guidelines passed by Parliament in 1767; they had been proposed by Charles Townshend, Chancellor of the Exchequer.

My web site: https://Lil.so/AaChR

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Examples of machine-generated data

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin