April 8, 2010

Examples of machine-generated data

Not long ago I pointed out that much future Big Data growth will be in the area of machine-generated data, examples of which include:

The core idea here is that human-generated data can grow only as fast as human data-generating activities allow it to, but machine-generated data is limited only by capital budgets and Moore’s Law.  So machines’ ability to generate data is growing a lot faster than humans’.

Up to this point, I think there’s broad agreement, at least on the part of anybody who’s thought about it this way for very long. But that still leaves open questions as to which kinds of machine-generated data will matter first. The big five that matter right now are:

A large fraction of all the 100 TB+ or petabyte+ data warehouse activity I know of falls into those areas.

Following along quickly are:

Until recently I would have added:

But while those areas seemed poised to get hot last year, I haven’t heard much about them the past few months, with a few exceptions:

Finally, I’ve been assuming that a big area going forward is location data, especially personal movement data. The data volumes involved could be similar to or even greater than those of CDRs. But privacy concerns with that are obviously immense. (Of course, in the case of Foursquare, this sort of overlaps with freely-shared game data.)

If you want to make all this more tangible in your mind, one area to look for ideas is in the huge amount of news about various kinds of innovative sensors. Sources include:

Comments

14 Responses to “Examples of machine-generated data”

  1. J. Andrew Rogers on April 8th, 2010 7:32 pm

    Machine-generated text-and-number data is widely used and collected en masse but sensor data has additional problems.

    I’ve worked with a number of non-military sensor-based data sources, both terrestrial and satellite. Tens of petabytes of working set and tens of terabytes of new ingest per day for single sources is becoming normal. Managing this data is a serious problem since conventional spatio-temporal indexing of the type that would often be useful doesn’t scale; big file systems and something only slightly better than brute-force search are the norm. If you add in latent network-connected data sources that have secondary value outside of their primary non-aggregated purpose, I would guess new sensor data being systematically generated annually by machines is in the low exabyte range and growing fast, most of which is discarded.

    In my own conversations with organizations that have valuable latent sensor data sets, there is a lot of interest in aggregating and analyzing this data but no cost-effective way of doing so. These are the myriad machine-generated data sources that no one ever hears about. Software like Hadoop doesn’t solve this particular problem very well yet. Once the benefit exceeds the cost of dealing with this data and turning it into an analyzable form I expect we will “discover” sensor-based data sets popping up like weeds as organizations attempt to monetize them.

  2. Curt Monash on April 8th, 2010 9:58 pm

    Andrew,

    Would SciDB address some of the issues you’re talking about?

  3. J. Andrew Rogers on April 9th, 2010 1:46 am

    SciDB looks like a big improvement, it would appear to provide a pretty rich toolset for basic data handling. At a minimum, it looks like it can do most of the first-pass processing of the input data.

    The single most common bottleneck (in my past experience) beside raw computational power was indexing the features pulled out of the raw data. R-tree-like data structures at that scale have really poor ingest and update rate for the dynamic polygonal features that were the usable end product. It doesn’t look like this is addressed. Point clouds scale out very nicely for basic read-write but are compute intensive for contextual/analytical queries relative to polygons and there is a modest practical upper bound on the number of records for conventional polygon handling. It is a tradeoff.

  4. Curt Monash on April 9th, 2010 3:53 am

    Andrew,

    I’m not sure that there’s any reasonable solution to polygons other than “Scan a lot of data and do what you have to do.” I.e., while R-trees and the like don’t sufficiently help w/ polygon-style analysis, they also don’t particularly get in way. The same would likely be true of SciDB.

    One thing SciDB is designed for is to store “cooked” post-processed results in line w/ the raw data. That could actually help with the kind of problems you’re raising.

    At least in theory …

  5. Jim Saunders on April 9th, 2010 7:18 am

    The data issue we are having is related to this as well. Looking to expand a current 4 TB Oracle RAC DB hosting various aviation data. Growth rates of 50/100 ? TB year of radar, GPS, terminal, weather, and other data feeds. Our analysts want to be able to take any flight and perform end to end analysis, model all airspace — whatever else they can think of. We have a small hadoop cluster as a test to see if this can meet our needs. Right now we are willing to experiment with almost any software to find a solution as the current Oracle just is not performing as the customer expects.

  6. Georges Bory on April 22nd, 2010 12:29 pm

    We all knew the time would come when computing power would generate more data than can be physically handled by humans. Working in the treasury and capital markets world, it’s no surprise to see that the data generated by financial instrument trades makes your top 5 of machine-generated data that ‘matters’. However, it’s worth noting that computer-generated data comes not only from algo trading but also from the production of simulation scenarios, which are produced by large grid computing infrastructures, so traders can compute Value at Risk, Sensitivities and Counterparty Risk. Simulations of this nature will be relied upon more and more as financial institutions desperately try to avoid a repetition of the recent events in the financial markets. Ultimately, the challenge lies not only in managing this explosion of computer-generated data but also in having the ability to effectively analyse and act on it.

  7. Curt Monash on April 22nd, 2010 2:15 pm

    George,

    To what extent are those simulations ever STORED?

    I understand that simulations are huge in various parts of high-performance computing, but I’ve never gotten more than throwaway comments helping me understand whether actually storing the output of the simulations is a big deal.

    Thanks,

    CAM

  8. Georges Bory on April 23rd, 2010 12:54 pm

    Hi Curt,

    It is true that most of the data generated during simulation process for computing indicators such as Value at Risk (VaR), Potential Exposure etc used to be thrown away once key statistics where computed ( loss quartile, closed form VaR, MPE etc).

    However, there is now, after the financial crisis, a requirement ( business and regulatory) to go beyond a single number to understand and manage risk. Furthermore, one needs to monitor changes in risk, and be able able to look back in the past, drill thru down to individual scenarios or trades. For all those reasons, the results of those simulations are now stored and need to be analyzed. You might want to look at this benchmark, it describes a realistic volume scenario for an international bank.

    http://www.quartetfs.com/benchmark.php

    I hope this helps,

    Georges Bory

  9. Accelerometer Pumpkin Launching on May 24th, 2010 5:51 pm

    [...] Examples of machine-generated data | DBMS2 — DataBase Management System Services [...]

  10. Warren Davidson on June 14th, 2010 10:51 am

    We’ve been involved with numerous military programs collects large volumes of sensor data, and the volumes are certainly increasing all the time. The big issue for some of these folks is the big gains in hardware to collect higher quality data.
    Collecting large amounts of data for our customers probably isn’t as big a problem as to how to figure out what’s important (or how to make decisions on the data you get). What is increasingly important is how to fuse multiple sources of data so you can increase the value of that data. Petabytes of data is no longer an issue for many data stores, its the complexity of the data that will be the real challenge over the next few years.

  11. Big Data is Watching You! | DBMS2 -- DataBase Management System Services on August 11th, 2010 8:57 am

    [...] may change some day, with the growing importance of machine-generated data, and of big-data science in particular. But I think it’s a fair [...]

  12. Big Data is Watching You – The Intelligent Enterprise Blog on September 20th, 2010 4:54 am

    [...] those four categories is the first one. That may change some day, with the growing importance of machine-generated data, and of big-data science in particular. But I think it’s a fair assessment at the present, [...]

  13. Evolving definitions and technology categories for 2011 | DBMS 2 : DataBase Management System Services on December 31st, 2010 3:53 am

    [...] Not many terms I coin gets marketing traction, but machine-generated data has grown some legs. Clients (Infobright, Cloudera) and non-clients alike have adopted it. I need [...]

  14. Examples and definition of machine-generated data | DBMS 2 : DataBase Management System Services on March 1st, 2011 2:48 am

    [...] posts made last December, January, and April, I [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.