January 17, 2010

Three broad categories of data

People often try to draw a distinction between:

Traditional data of the sort that’s stored in relational databases, aka “structured.”
Everything else, aka “unstructured” or “semi-structured” or “complex.”

There are plenty of problems with these formulations, not the least of which is that the supposedly “unstructured” data is the kind that actually tends to have interesting internal structures. But of the many reasons why these distinctions don’t tend to work very well, I think the most important one is that:

Databases shouldn’t be divided into just two categories. Even as a rough-cut approximation, they should be divided into three, namely:

Human/Tabular data –i.e., human-generated data that fits well into relational tables or arrays
Human/Nontabular data — i.e., all other data generated by humans
Machine-Generated data

Even that trichotomy is grossly oversimplified, for reasons such as:

These categories overlap.
There are kinds of data that get into fuzzy border zones.
Not all data in each category has all the same properties.

But at least as a starting point, I think this basic categorization has some value.

By human-generated data that fits well into relational tables or arrays, what I really mean is: the input from most conventional kinds transactions – purchase/sale, inventory/manufacturing, employment status change, etc. This is the core data managed by OLTP relational DBMS everywhere. It is also the core data in analytic relational or MOLAP databases. The vast majority of what we think or know about “database management” applies primarily to data of this kind, in large part because of two fundamental properties of this information:

It is meaningful to contemplate this data as being 100% accurate and complete (even if that goal is difficult to achieve in the real world).
This data is precise – i.e., one can check predicates against it and (give or take regrettable data imperfections) get inarguable yes/no answers.

For most enterprises, this is the most important data they have. It was created as a result of expensive business activities. It deals directly with money, employees, physical goods, and the rest of the things that make an enterprise go. It can be fruitfully analyzed in ever more ways, which is why it should never be thrown out or even entirely relegated to tape, now that data warehouse software, hardware, and storage has become so cheap. (“Disk is the new tape.”) And because of the importance of both preserving and accessing it, it should often be stored in multiple copies – OLTP, data warehouse, data mart, in-memory analytics, near-line quasi-archive, MOLAP cubes (if you must) and so on, plus of course replicas for high throughput and availability.

But humans generate many other kinds of data as well, especially in a form directly suitable for communication – text (in many formats), documents (text or otherwise), pictures, videos, etc. Traditional relational databases are a poor home for this kind of data because:

This data often deals with opinions or aesthetic judgments – there is little concept of perfect accuracy.
Similarly, there is little concept of perfect completeness.
There’s also little concept of perfectly, unarguably accurate query results – different people will have different opinions as to what comprises good results for a search.
Queries don’t lend themselves to binary answers; rather, documents can have differing degrees of relevancy.

Systems for managing this kind of data are much less advanced than relational database managers. Nobody knows how to get all the information out of a text document, or query all of it if they could, and the story is even worse for non-text examples. The systems that give the best query results aren’t necessarily the same ones that have the best database administration features. Basically, this area is still a mess, and it’s a mess that consumes a huge fraction of all the data storage products sold today.

But give or take questions of storage efficiency and deduplication, if humans created that kind of data, they put a lot of effort into it, so it’s worth keeping. Besides, compliance regulations commonly mandate that we do so – except, perhaps, when they mandate that we throw it away.

Machine-generated data is a whole other can of worms. Paradigmatic examples of what I mean by machine-generated data include:

Computer, network, and other equipment logs
Satellite and similar telemetry (whether for espionage or science)
Location data such as RFID chip readings, GPS system output, etc.
Temperature and other environmental sensor readings
Sensor readings from factories, pipelines, etc.
Output from many kinds of medical device, in hospitals and (increasingly) homes alike

Unlike human-generated data, whose growth is constrained by macro factors such as population and total level of economic activity, machine-generated data will continue to grow as fast as Moore’s Law lets it. That fact has two profound consequences:

It is unrealistic to hope ever to keep most or all machine-generated data, whereas I think that’s exactly what should and will happen with human-generated data
Before long, most data (by volume) will be machine-generated

And so it is not really an exaggeration to say that machine-generated data is the future of data management.

I’d like to close this long post by immediately pointing out some of the flaws in this simple trichotomy. One obvious gray area lies in hybrid human/machine-generated data, three big examples of which are:

Web clickstreams
Call detail records (CDR)
Stock trades

In all three cases, we are quickly getting to the point where this data is preserved in its entirety (even if the network event data associated with the web logs is reduced before storage). And in each case it fits pretty well into RDBMS, although Hadoop has a role to play as well. So pretending it’s purely human-generated probably isn’t all that misleading.

Another gray area lies in text that gets linguistically processed – i.e. via text-mining tools – with the output placed into a relational database. I don’t immediately see a workaround for that flaw in my labeling scheme. So let’s just say no taxonomy is perfect.*

*Come to think of it, that’s one of the problems holding back text-mining technology. 🙂

And of course some of the NoSQL folks would note that I was oversimplifying when I tied my first category specifically to relational DBMS. So would the folks at Intersystems.

But the biggest oversimplification stems from this:

As Mike Stonebraker* and I argued a couple of years ago, I really think that database management technologies should be divided into 10+ categories.

*Note: The links to Stonebraker’s own posts will be broken until Vertica’s webmaster gets his/her act together. But you can find them under other URLs via web search.

Categories: Database diversity, Investment research and trading, Log analysis, Telecommunications, Web analytics

Subscribe to our complete feed!

Comments

19 Responses to “Three broad categories of data”

David Walker on January 17th, 2010 1:38 pm

Curt,

The main area of confusion here always appears to stem from the attempt to merge both the volume and complexity of data. The link above has a discussion of this relationship and a useful diagram on page 11 about the relationship between volume and complexity.
Omer Trajman on January 17th, 2010 1:42 pm

Curt – this is another excellent post. I trust it will provide readers some structure to how they think about their data and data management challenges.

I believe this is the new URL for Mike’s post:

http://databasecolumn.vertica.com/database-innovation/in-response-to-monashs-post-on-the-four-categories-of-rdbms/

Mike was focusing on engines for structured data. I’ve also spoken about the challenges in analyzing unstructured and semi-structured data (http://www.slideshare.net/otrajman/hadoop-world-vertica). Both need to transformed or parsed into some kind of schema to be analyzed.

Looking at the types of systems you need to handle big data there are at least five (http://databasecolumn.vertica.com/database-miscellaneous/five-tools-for-high-velocity-analytics/) and when you include unstructured data or multi-dimensional data you can easily add two more (MR and Array databases).
Curt Monash on January 17th, 2010 2:16 pm

Omer,

Thanks for the compliment. And thanks for disclosing your new gig at Cloudera — and best wishes at same!

My point is to suggest a first cut between:

“Data can be managed in traditional ways”
“Data that can’t be managed in traditional ways because it’s in highly complex units such as documents”
“Data that can’t be managed in traditional ways because there’s just too damned much of it, and that’s probably anyway coming out of a hose which has a bit of a mismatch with RDBMS.”

WHICH traditional — or nontraditional! — way we choose to manage data in the first category is the second question at best. 😉
David Aldridge on January 18th, 2010 5:13 am

Let me suggest, slightly tongue-in-cheek, another category: “Structured, but lets manage it as unstructured so we don’t have to put the work in on working out what the structure is”.

AKA “Let’s just use XML as a storage format”.
Curt Monash on January 18th, 2010 7:53 am

Thanks, Omer. Your link got me close!

http://databasecolumn.vertica.com/database-innovation/responding-to-monashs-recent-post-on-diversity-of-database-systems/ was even more relevant. More generally, there’s a “Stonebraker” link at the top of each of his posts that gets to a two-page index of all his posts for the Vertica blog.
Search Facets » Monash, are there really only three kinds of data? on January 21st, 2010 12:25 pm

[…] Curt Monash provided yet another take on this never-ending data management question in a blog post earlier this week. This topic has generated tons of discussion over time, but despite this, common […]
Flash, other solid-state memory, and disk | DBMS2 -- DataBase Management System Services on February 21st, 2010 6:40 am

[…] much cheaper than silicon for data storage. And cheaper silicon in sensors will lead to ever more machine-generated data that fills up a lot of […]
Dan on March 4th, 2010 9:22 pm

Where would structuring data in an ontology fit into all this. Is that “super-structured”?
Curt Monash on March 5th, 2010 7:23 am

How ontology data is managed hardly matters. But it’s human, and tabular or non-tabular as you choose to represent it. Non-tabular is more natural, but the volume is so low that shoehorning it into a tabular structure is fine. Some of my posts on RDF cover that.
Thoughts on IBM’s anti-Oracle announcements | DBMS2 -- DataBase Management System Services on April 7th, 2010 12:12 pm

[…] both highly important, those are very different things. IBM has not in the past shown much impressive technology in either of those two areas, and based […]
Examples of machine-generated data | DBMS2 -- DataBase Management System Services on April 8th, 2010 3:20 pm

[…] long ago I pointed out that much future Big Data growth will be in the area of machine-generated data, examples of which […]
How to Tame Big Bad Data | Kalido Conversations on August 25th, 2010 12:35 pm

[…] try to get a better understanding of the nature of Big Bad Data. Curt Monash wrote about the difference between machine-generated data and human-generated data. (For the purpose of […]
Mega-trends driving data warehousing and business intelligence | DBMS 2 : DataBase Management System Services on January 22nd, 2011 3:07 pm

[…] A year ago, I divided data into three kinds: […]
Examples and definition of machine-generated data | DBMS 2 : DataBase Management System Services on March 1st, 2011 2:46 am

[…] posts made last December, January, and April, I […]
al kumar on March 23rd, 2011 9:27 pm

so does machine generated data have *structure*? that is to say, does it lend itself to a data model in the relational sense?
al kumar on March 23rd, 2011 9:28 pm

to be clear, using Curt’s taxonomy — is machine generated data non-tabular or tabular?
Curt Monash on March 24th, 2011 5:16 am

That depends, but often the answer is “not tabular” or “awkward fit for tabular.”

There are three main reasons for that. First, the list of possible event types is commonly too long for people to enjoy making separate columns for each one. See for example my post on eBay Singularity: http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/

Second, the temporal relationships between events are commonly awkward to represent relationally. In many cases you can timestamp everything and then also store a derived field as to what’s part of the same event — still, there can be awkwardness.

Third, shoehorning logs into a tabular format might just lead to expense and bloat.
Traditional databases will eventually wind up in RAM | DBMS 2 : DataBase Management System Services on May 23rd, 2011 11:05 am

[…] In January, 2010, I posted that it might be helpful to view data as being divided into three categories: […]
Three kinds of software innovation, and whether patents could possibly work for them | DBMS 2 : DataBase Management System Services on June 8th, 2011 11:36 pm

[…] things that are described by terms like “unstructured” or “semi-structured” […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Three broad categories of data

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin