We hear much these days about unstructured or semi-structured (as opposed to) structured data. Those are misnomers, however, for at least two reasons. First, it’s not really the data that people think is un-, semi-, or fully structured; it’s databases.* Relational databases are highly structured, but the data within them is unstructured — just lists of numbers or character strings, whose only significance derives from the structure that the database imposes.
*Here I’m using the term “database” literally, rather than as a concise synonym for “database management system”. But see below.
Second, a more accurate distinction is not whether a database has one structure or none – it’s whether a database has one structure or many. The easiest way to see this is for databases that have clearly-defined schemas. A relational database has one schema (even if it is just the union of various unrelated sub-schemas); an XML database, however, can have as many schemas as it contains documents.
One small terminological problem is easily handled, namely that people don’t talk about true databases very often, at least when they’re discussing generalities; rather, they talk about data and DBMS.* So let’s talk of DBMS being “structured” singly or multiply or whatever, just as the databases they’re designed to manage are.
*And they refer to the DBMS as “databases,” because they don’t have much other use for the word.
All that said — I think that single vs. multiple database structures isn’t a bright-line binary distinction; rather, it’s a spectrum. For example:
- IMS is the most structured DBMS of all. The data in an IMS database is in a hierarchy, and that’s that.
- CODASYL and other kinds of what used to be called network DBMS (before the word got so overloaded) — e.g. RDB, IDMS, or TOTAL — are/were almost as structured as IMS.
- Relational databases were invented because their structure was more flexible than that of linked-list databases. The whole point of relational DBMS is that you can view the data in a multitude of ways. Still, I see classical relational databases as being toward the single-structure end of the spectrum. (I say “classical” because Oracle and DB2 actually can manage combinations of XML, text, and traditional relational tables, if you choose.)
- A multivalue DBMS is a little more multi-structured than a relational one, because of how a field can be filled one or multiple times.
- eBay Singularity (as implemented on Teradata gear) has, in essence, two structures (that I know of). One structure is just the relational schema. The other is the structure you would get if each kind of name-value pair truly had its own column.
- A Splunk collection of log data can reasonably be said to have a different structure for every type or source of log. It further can be said to have multiple structures in somewhat the same way that eBay Singularity does.
- So-called document stores can be very multi-structured. MongoDB, Couchbase, et al. let you have a different structure for every document, if you choose. The same goes for XML-based MarkLogic.
- HBase and Cassandra are also very multi-structured. Theoretically, each record gets to decide which column sets it does or doesn’t fit into.
As a general rule — the more structures a database can have at once, the easier it is to change those structures, even on the fly (e.g., by inserting yet another bit of self-describing data). Thus, I sometimes use the term polystructured instead of multi-structured or multistructured. Thoughts as to which term I should choose going forward would be much appreciated.
As for an actual definition — well, here’s something I drafted 3 1/2 years ago but never published:
These problems with the relational paradigm are big enough to be worth coining a word for – polystructured. Polystructured data is data with structure that:
- Can be exploited to provide most of the benefits of a highly structured database (e.g., a tabular/relational one) …
- … but cannot be described in the concise, consistent form such highly structured systems require.
Specifically, we’ll call a database “polystructured” if it is characterized by at least two of the following:
- Data suitable for being queried by simple predicate-based matching (e.g., equality to certain values, falling with in ranges, etc.)
- (Other) data suitable for being queried by more complex matching (e.g., text search relevancy rankings)
- Subsets that are more neatly structured than the whole.
Equivalently, we’ll just say that polystructured data is data that has considerable structure, but whose structure is in some important way unpredictable.
NoSQL document or “column” stores would satisfy #1 and #3, as would Splunk. MarkLogic would satisfy all three criteria. #1 + #2 is sort of like what happens when text queries are allowed to go against (groups of) relational columns … and the vagueness with which I’m saying that makes me suspect that at least the unbolded/first definition doesn’t really fly.
Finally, here’s what led up to those definitions (the whole thing is from the introduction to a never-completed white paper). Please forgive any anachronisms in it. A number of the points in it have also been addressed in posts here; e.g.,
- In December, 2005 I expounded on the mismatch between text data and the relational model.
- In June, 2010 I elucidated the variety of data that could go into an individual’s marketing-oriented profile.
- In February, 2008 I predicted that flexible-schema DBMS would gain share.
The case for polystructured data
Traditional computer databases amount to sets of records. There usually are a limited number of record formats, which each instance of a particular format containing parallel kinds of information. Business transactions, web page visits, instrument readings– whatever the nature of the information, application designers stick it into the simplest structure they think makes sense.
These records are arranged into a variety of data structures.
- Log files are widely used, especially to track web site visits, in other networking uses, and for other kinds of instrument readings.
- Computer user administration is commonly in LDAP (Lightweight Directory Access Protocol) format.
- There are still a lot of installations of legacy “linked-list” DBMS (DataBase Management Systems) such as IBM’s IMS.
- Some decision support applications use data in multidimensional arrays.
Even so, most new business applications are written over relational DBMS, in the well-known rows-and-tables paradigm.
There are good reasons for the dominance of the relational model and of rows and tables. (Strictly speaking, “relational” equates neither to “rows and tables” nor to “SQL”, but in practice the three concepts are closely linked.) In particular:
- Data integrity is (fairly) easy to ensure.
- From some standpoints, relational databases are flexible; you can construct almost any kind of query, without having to do any kind of database reorganization (except perhaps for performance).
- SQL programmers are easy to find.
- There’s simply been much more engineering effort invested in making good relational DBMS than in any other kind.
But the relational database paradigm also has some major drawbacks. Three of the big ones are:
- Queries must have strictly match/fail answers; there’s no natural way for a relational DBMS to handle “somewhat relevant” hits.
- Relational databases can get cumbersome when large fractions of the potential data happen to be missing. (Hence the decades-long debates about the problems with NULL values.)
- While you have good flexibility in querying against any particular data structure, you do have to predefine your structure before you start accepting input.
The last point is why you wind up with all those NULL values in the first place; if a kind of information can be in any record in a set, the database is set up to assume that its present in all of them. Or if you normalize your database so highly as to avert missing values, then you wind up with a huge number of tables, making queries (and updates) complicated from both the programmer’s and the machine’s standpoint.
Text apps suffer from RDBMS’ inelegant handing of relevancy. What’s more, documents can have almost unlimited internal structures, in three senses:
- They can have chapters, sections, subsections, sidebars, footnotes, and so on, in any combination.
- Semantic references can link words, phrases, sentences, and paragraphs in a near-infinite number of ways.
- Documents can explicitly contain fielded data, such as numbers, addresses, dates, or geo-encodings.
Another group of apps that suffer from RDBMS’ limitations are in the area of personalization and similar fine-grained marketing analysis. Analysis of web clicks throws away most kinds of path information. Analysis of written or verbal communication isn’t well-integrated with that of fielded data. Different customers and prospects give different kinds of contact information, and are “touched” by different marketing initiatives; current systems do a poor job of integrating all that scattered information.