People often try to draw a distinction between:
- Traditional data of the sort that’s stored in relational databases, aka “structured.”
- Everything else, aka “unstructured” or “semi-structured” or “complex.”
There are plenty of problems with these formulations, not the least of which is that the supposedly “unstructured” data is the kind that actually tends to have interesting internal structures. But of the many reasons why these distinctions don’t tend to work very well, I think the most important one is that:
Databases shouldn’t be divided into just two categories. Even as a rough-cut approximation, they should be divided into three, namely:
- Human/Tabular data –i.e., human-generated data that fits well into relational tables or arrays
- Human/Nontabular data — i.e., all other data generated by humans
- Machine-Generated data
Even that trichotomy is grossly oversimplified, for reasons such as:
- These categories overlap.
- There are kinds of data that get into fuzzy border zones.
- Not all data in each category has all the same properties.
But at least as a starting point, I think this basic categorization has some value.
By human-generated data that fits well into relational tables or arrays, what I really mean is: the input from most conventional kinds transactions – purchase/sale, inventory/manufacturing, employment status change, etc. This is the core data managed by OLTP relational DBMS everywhere. It is also the core data in analytic relational or MOLAP databases. The vast majority of what we think or know about “database management” applies primarily to data of this kind, in large part because of two fundamental properties of this information:
- It is meaningful to contemplate this data as being 100% accurate and complete (even if that goal is difficult to achieve in the real world).
- This data is precise – i.e., one can check predicates against it and (give or take regrettable data imperfections) get inarguable yes/no answers.
For most enterprises, this is the most important data they have. It was created as a result of expensive business activities. It deals directly with money, employees, physical goods, and the rest of the things that make an enterprise go. It can be fruitfully analyzed in ever more ways, which is why it should never be thrown out or even entirely relegated to tape, now that data warehouse software, hardware, and storage has become so cheap. (“Disk is the new tape.”) And because of the importance of both preserving and accessing it, it should often be stored in multiple copies – OLTP, data warehouse, data mart, in-memory analytics, near-line quasi-archive, MOLAP cubes (if you must) and so on, plus of course replicas for high throughput and availability.
But humans generate many other kinds of data as well, especially in a form directly suitable for communication – text (in many formats), documents (text or otherwise), pictures, videos, etc. Traditional relational databases are a poor home for this kind of data because:
- This data often deals with opinions or aesthetic judgments – there is little concept of perfect accuracy.
- Similarly, there is little concept of perfect completeness.
- There’s also little concept of perfectly, unarguably accurate query results – different people will have different opinions as to what comprises good results for a search.
- Queries don’t lend themselves to binary answers; rather, documents can have differing degrees of relevancy.
Systems for managing this kind of data are much less advanced than relational database managers. Nobody knows how to get all the information out of a text document, or query all of it if they could, and the story is even worse for non-text examples. The systems that give the best query results aren’t necessarily the same ones that have the best database administration features. Basically, this area is still a mess, and it’s a mess that consumes a huge fraction of all the data storage products sold today.
But give or take questions of storage efficiency and deduplication, if humans created that kind of data, they put a lot of effort into it, so it’s worth keeping. Besides, compliance regulations commonly mandate that we do so – except, perhaps, when they mandate that we throw it away.
Machine-generated data is a whole other can of worms. Paradigmatic examples of what I mean by machine-generated data include:
- Computer, network, and other equipment logs
- Satellite and similar telemetry (whether for espionage or science)
- Location data such as RFID chip readings, GPS system output, etc.
- Temperature and other environmental sensor readings
- Sensor readings from factories, pipelines, etc.
- Output from many kinds of medical device, in hospitals and (increasingly) homes alike
Unlike human-generated data, whose growth is constrained by macro factors such as population and total level of economic activity, machine-generated data will continue to grow as fast as Moore’s Law lets it. That fact has two profound consequences:
- It is unrealistic to hope ever to keep most or all machine-generated data, whereas I think that’s exactly what should and will happen with human-generated data
- Before long, most data (by volume) will be machine-generated
And so it is not really an exaggeration to say that machine-generated data is the future of data management.
I’d like to close this long post by immediately pointing out some of the flaws in this simple trichotomy. One obvious gray area lies in hybrid human/machine-generated data, three big examples of which are:
- Web clickstreams
- Call detail records (CDR)
- Stock trades
In all three cases, we are quickly getting to the point where this data is preserved in its entirety (even if the network event data associated with the web logs is reduced before storage). And in each case it fits pretty well into RDBMS, although Hadoop has a role to play as well. So pretending it’s purely human-generated probably isn’t all that misleading.
Another gray area lies in text that gets linguistically processed – i.e. via text-mining tools – with the output placed into a relational database. I don’t immediately see a workaround for that flaw in my labeling scheme. So let’s just say no taxonomy is perfect.*
*Come to think of it, that’s one of the problems holding back text-mining technology.
But the biggest oversimplification stems from this:
As Mike Stonebraker* and I argued a couple of years ago, I really think that database management technologies should be divided into 10+ categories.
*Note: The links to Stonebraker’s own posts will be broken until Vertica’s webmaster gets his/her act together. But you can find them under other URLs via web search.