My recent argument that the common terms “unstructured data” and “semi-structured data” are misnomers, and that a word like “multi-” or “poly-structured”* would be better, seems to have been well-received. But which is it — “multi-” or “poly-“?
*Everybody seems to like “poly-structured” better when it has a hyphen in it — including me.
The big difference between the two is that “multi-” just means there are multiple structures, while “poly-” further means that the structures are subject to change. Upon reflection, I think the “subject to change” part is essential, so poly-structured it is.
The definitions I’m proposing are:
- A database is poly-structured to the extent that its structure is apt to be changed in the ordinary course of query, update, or programming.
- Data is poly-structured to the extent that it is best represented in a poly-structured database.
- A DBMS is poly-structured to the extent that it is oriented to managing poly-structured databases.
- There are many different degrees of being poly-structured; that’s why I used the phrase “to the extent that”, instead of a simple “if”.
- And as always, no technology categorization is ever precise.
Examples of poly-structure include:
- XML or JSON documents/objects describe themselves. Add a new one to a database with a different structure than the others and — presto! — you have changed the overall structure. Thus:
- XML and JSON data is apt to be poly-structured.
- XML and JSON databases are apt to be poly-structured.
- MarkLogic, MongoDB, et al. are poly-structured DBMS.
- A text document is inherently poly-structured. Some queries might look at it as a bag of words; others might group the words via stemming and synonyms; others might actually exploit the document’s grammatical structure. Text search engines are poly-structured because they support all those kinds of queries.
- A single log file can be somewhat poly-structured, in that different views of it might extract different kinds of name-value pair, or different temporal relationships.
- A database that seamlessly includes a variety of log files, each with its own structure(s), is quite poly-structured.
- A classic relational database is not very poly-structured, because DDL (Data Description Language) isn’t really in “the ordinary course” of programming or update.
- However, views add a bit of poly-structure to relational databases that is not present in, say, IMS databases.
- An object-oriented DBMS is highly poly-structured, as is Workday’s internal data store.
So what do you think? Do these definitions work?