May 17, 2011

Terminology: poly-structured data, databases, and DBMS

My recent argument that the common terms “unstructured data” and “semi-structured data” are misnomers, and that a word like “multi-” or “poly-structured”* would be better, seems to have been well-received. But which is it — “multi-” or “poly-”?

*Everybody seems to like “poly-structured” better when it has a hyphen in it — including me. :)

The big difference between the two is that “multi-” just means there are multiple structures, while “poly-” further means that the structures are subject to change. Upon reflection, I think the “subject to change” part is essential, so poly-structured it is.

The definitions I’m proposing are:

Please note:

Examples of poly-structure include:

So what do you think? Do these definitions work?

Comments

20 Responses to “Terminology: poly-structured data, databases, and DBMS”

  1. Morgan Goeller on May 17th, 2011 9:11 am

    I would also consider technologies like data federation, virtualization, and abstraction in your definition.

    Technologies like Composite Software, IBM Infosphere, and others can really change the dynamics of your data ecosphere.

  2. Curt Monash on May 17th, 2011 9:36 am

    Morgan,

    Federation and abstraction certainly fit into what I called “DBMS2″ in the middle of the last decade. But I’m not sure that they tie closely into the definitions in this particular post.

  3. unholyguy on May 17th, 2011 9:37 am

    I like this definition overall

    I think there needs to be more emphasis in there somewhere around change over time. XML, JSON, etc can maintain multiple different versions of the schema simultaneously and still be accessed in the same way. So it is not always a “change it” scenario

  4. Curt Monash on May 17th, 2011 4:41 pm

    unholyguy,

    I’m not sure what distinction you’re drawing.

    If a database has a bunch of different structures, the structures aren’t entirely physical; there’s something “virtual” about them. So you can have multiple structures at once, different ones of which are invoked/emphasized in different operations.

    What I’ve struggled a bit to capture is how this is deeper than the flexibility of the traditional relational model.

  5. Dave Duggal on May 18th, 2011 6:31 am

    ‘Subject to change’ is a much more powerful concept than just ‘many’.

    How about ‘dynamic-structure’, as the structure is virtual and configurable at run-time (which is what we do)? Alternatively, ‘polymorphic-structure’. “Poly” by itself doesn’t really connote ‘change’.

  6. Curt Monash on May 18th, 2011 6:34 am

    Dave,

    I find the term “polymorphic” a bit pretentious in most of its uses. And it’s easy to oversell flexibility anyhow. When things are flexible, often any one user will do one thing with it and not change all that much going forward (thus using only a small fraction of its power).

  7. Dave Duggal on May 18th, 2011 6:58 am

    Understood, but technically polymorphic is a better fit to what you are describing, it’s also a term fairly well understood by software development / computer science community writ large.

    I support your decomposition effort so I’m not trying to knit pick.

    In our case, dynamic structure is a function of the system method, it’s canonical so the flexibility is inherent – every interaction benefits from late-binding for situational awareness.

    Ultimately, data serves the business. Static structures constrain variance, preclude context and are the enemy of agility.

  8. Curt Monash on May 18th, 2011 7:54 am

    I think the word “polymorphic” has become rather — as it were — overloaded. Hence my reluctance to pile yet more duties on it.

    Also, shorter terms are better than longer ones.

  9. Dave Duggal on May 18th, 2011 9:43 am

    Makes sense. The thinking is more important than the term, and the term works. Great series of posts.

  10. Duncan Irving on May 19th, 2011 1:04 am

    Curt – I like the general thrust of your thoughts and it is an interesting debate. The concept of a dynamically changing framework in data structures or in the DBMS is appropriate and there is scope for more rigour in the naming convention.
    My 2p’th:
    Poly- and Multi- have the same meaning being Greek and Latin prefixes for many.
    Polymorphic means “many forms” and as Dave points out, is well used in the programming community but doesn’t carry the sense of time-dependence or time-variance.
    To express the concept of change, or indeed the potential for change, through time, might I suggest “mutable”? Mutable objects and mutable constructs such as arrays are already well-understood by programmers and there is no ambiguity around the nomenclature either.

  11. Curt Monash on May 19th, 2011 1:16 am

    Duncan,

    I like the idea of “mutably structured” or something like that. My only objections, really, are:

    1. I’m looking to save syllables. (Phonemes, really, but to a first approximation it’s reasonable to say “syllables”.)
    2. “Mutably structured” sounds like it should be in a movie starring Sigourney Weaver.

  12. Traditional databases will eventually wind up in RAM | DBMS 2 : DataBase Management System Services on May 24th, 2011 1:41 am

    [...] post, which may differ slightly from those in my more recent posts about machine-generated data and poly-structured databases. But one general idea is hard to [...]

  13. The Data Blog: Aster Data Blog » Blog Archive » Multi-structured Data: Platform Capabilities Required for Big Data Analytics on June 13th, 2011 9:57 am

    [...] been a topic of discussion lately with IDC Research (where we first saw the term referenced) and other industry analysts. It is also the upcoming topic of a webcast we’re doing with the IDC on June [...]

  14. Hadapt update | DBMS 2 : DataBase Management System Services on July 6th, 2011 6:48 pm

    [...] Hadapt use cases are centered around keeping machine-generated or other poly-structured data in Hadoop, and extracting, enhancing, or otherwise deriving some of it to live in the [...]

  15. Charlie Reitzel on September 14th, 2011 9:12 am

    Strong vs. Loose Types – Data Quality

    In programming, a similar distinction is drawn between strongly typed and loosely typed languages. As most programmers know, C++ is the textbook “strongly typed” language. Smalltalk is the the archtypal “loosely typed” language, where methods can be attached to individual objects, dynamically at runtime. Look, Ma, no classes!

    Perl is another loosely typed language, albeit with a decent Class system. Java is generally thought of as having strong types, but with reflection and various byte code manipulation techniques, you can actually approach smalltalk like flexibility.

    Anyway, I think the analogy to database schemas is clear enough. How does the DBMS control the schema? You can see a continuum, with Codd & Date on the on end and a bag ‘o tags on the other.

    In most cases, the DBMS is not the limiting factor. Rather, as others have alluded, the logic that consumes the data will impose constraints on what “fields” must be present, the domain the values, etc. In analytics, I think they call this “data quality”.

  16. Michael Brackett on September 14th, 2011 2:46 pm

    The discussion is great, because ‘unstructured’ is the wrong term. It’s part of the lexical challenge in data resoruce management, and is promoted by people pumping the words without really knowing what they are saying.

    However, all the discussion is very physically oriented toward databases and database procession. How about stepping outside the database environment and looking at an organizations total data resource? Then look at the possible ways that data are structured.

    I’ve done just that and prefer to use the terms highly-structured to replace ‘semi-structured’ and super-structured’ to replace unstructured. The terms seem to work quite well.

    The terms poly- and multi- imply that something could have many different structures or exist in many different forms. That’s not quite true. The point is that the structure is more intricate, more detailed, more inter-twined, etc., than typical tabular data that can be accessed by SQL.

  17. Another category of derived data : DBMS 2 : DataBase Management System Services on April 24th, 2012 12:00 am

    [...] surprisingly important kind of derived data is metadata, especially for large, poly-structured data sets. For example, CERN has vastly quantities of experiment sensor data, stored as files; just [...]

  18. Glassbeam instantiates a lot of trends | DBMS 2 : DataBase Management System Services on October 30th, 2013 10:37 am

    [...] has an analytic technology stack focused on poly-structured machine-generated [...]

  19. Hadoop/RDBMS integration: Aster SQL-H and Hadapt | DBMS 2 : DataBase Management System Services on January 31st, 2014 9:07 am

    [...] right there you have an argument for flexible investigative or iterative analytics, over multi-structured (and relational) data. And if you think about how to combine information from all those data [...]

  20. Hadapt Version 2 | DBMS 2 : DataBase Management System Services on January 31st, 2014 9:07 am

    [...] multi-structured data into [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.