July 31, 2011

Terminology: Dynamic- vs. fixed-schema databases

E. F. “Ted” Codd taught the computing world that databases should have fixed logical schemas (which protect the user from having to know about physical database organization). But he may not have been as universally correct as he thought. Cases I’ve noted in which fixed schemas may be problematic include:

And if marketing profile analysis is ever done correctly, that will be a huge example for the list.

So what do we call those DBMS — for example NoSQL, object-oriented, or XML-based systems — that bake the schema into the applications or the records themselves? In the MongoDB post I went with “schemaless,” but I wasn’t really comfortable with that, so I took the discussion to Twitter. Comments from Vlad Didenko (in particular), Ryan Prociuk, Merv Adrian, and Roland Bouman favored the idea that schemas in such systems are changeable or late-bound, rather than entirely absent. I quickly agreed.

The discussion wasn’t entirely serious; wise-ass comments were contributed by at least Merv, Neil Raden, Yiorgos Adamopoulos, and myself.

I like that approach for the same reason I favor saying that databases are poly- or multi-structured (rather than un- or semi-):  Every database has structure, the only question being when that structure is determined. I wouldn’t precisely equate “poly-structured” to “has a late-bound schema”; for example, I’d say that mucking with the DDL (Data Description Language) of a relational database shows that it’s a little bit poly-structured, even though it’s not at all late-bound. But the concepts are definitely related.

So what actual wording should we use here? The only alternative I see to fixed schema is “static”, and that feels like it has too much of a connotation of “unchangeable”. The simplest word I can think of for changeable/late-bound/whatever is dynamic schema; that choice also has the virtue of some traction, as per the Vlad Didenko tweet linked above. Casual googling is also supportive of “fixed” and “dynamic”, at least over whatever alternatives I came up with. So those are my choices.

For actual definitions, I’ll say:

Suit yourself as to what you say about relational DBMS when they also have a bit of XML, text, or whatever support.

By these definitions:

What do you think? Do these definitions work for you?

Comments

31 Responses to “Terminology: Dynamic- vs. fixed-schema databases”

  1. The Ted Codd guarantee | DBMS 2 : DataBase Management System Services on July 31st, 2011 6:04 pm

    [...] my next post, I’ll return to the subject of why fixed schemas might not always be such a good idea after all. Categories: Data models and architecture, IBM and DB2, MOLAP, NoSQL  Subscribe to our [...]

  2. Mark Perrett on August 1st, 2011 2:45 am

    Not to get hung up on language but fixed versus variable (rather than dynamic) sounds quite appealing to me.

  3. Curt Monash on August 1st, 2011 4:16 am

    Mark,

    I’m embarrassed to say didn’t think of “variable”. That said — “dynamic” has the connotation that the variation happens THROUGH the actions of a program, and that’s an important distinction.

  4. Daniel Weinreb on August 1st, 2011 7:17 am

    “Changable” and “late-bound” both sound right to me.

  5. Curt Monash on August 1st, 2011 7:30 am

    Dan,

    Is there a way in which they’re better than “dynamic”?

  6. Roland Bouman on August 1st, 2011 7:32 am

    Hi Curt!

    thanks for the mention – frankly I contributed nothing to the discussion, I just liked your terminology :)

    There’s one related notion that I picked up at the freebase mailing list that I think is helpful in this context, the notions of “data-first” and “schema-last”.

    The database that powers freebase (metaweb) is a graph database, the unit of storage is a node, which is kinda like a rdf subject-predicate-object triple (except that it is a 4-tuple because the “object” part can be a scalar, a reference to another node, or a combination of both a scalar and a node reference)

    As you may know, freebase (www.freebase.com) is powered by metaweb, a proprietary graph database. Logically it is like a rdf triple store, except the units are 4-tuples (not 3-tuples).

    In theory the database allows you to write whatever nodes you like, at any time. So if you have a collection of arbitrary facts and you express them in nodes you can always store them: “data-first”.

    However, freebase does have a well-defined schema: on the top level, there is a “domain” which groups a collection of “types”, which is itself a collection of “properties”. Each node can belong to one or more types. Properties themselves have an expected type, which defines whether it can hold a scalar value, a reference to another node, or both. The property also defines cardinality, so a “value” can be an array.

    The metadata (domains, types, properties etc) are stored inside the same database, also as nodes, and can be linked afterwards to other nodes to record schema information: “schema-last”

    This “data-first”, “schema-last” allows people to record data as soon as it is available to them, and defers the proper characterization of the data to a later stage. In theory you can even have conflicting schemas to tag the same data to reflect the fact that different people may have a different interpretation of data.

    The nice thing about the “data-first”, “schema-last” terminology is that a reversel accurately describes what you have in a typical RDBMS: those are “schema-first, data-last”, because without a schema you literally have nothing to target the data to.

  7. Dave Duggal on August 1st, 2011 11:06 am

    Hi Curt,

    Great post. I’m glad to seeing you methodically working through and classifying new capabilities enabled by emerging technologies.

    Our Ideate Framework uses a web-style architecture for an EnterpriseWeb. We use it as a graph information model where all relationships are logical and all interactions fully dynamic.

    We have been using the term ‘Dynamic Schemas’ to describe the adaptive qualities of our business entities and their linked relations.

    Committing to relatively static Entity definitions and ERDs is a form of premature optimization. It presumes that if you spend enough time upfront you can build perfect definitions, but this is simply not true – it actually results in technical debt experienced by exceptions, inefficient change management and the rise of shadow systems.

    It’s important to note, dynamic doesn’t mean ‘always changing’, but rather the latent capacity for change. It’s a requirement for enterprise agility.

    Really appreciate you surfacing these notions!

    Dave

  8. Scott Meyer on August 1st, 2011 8:12 pm

    “before the program is written” seems a bit vague.

    In a world that includes “ALTER TABLE” it seems like a better distinction is how hard it is change schema.

    In a traditional SQL implementation where the table translates more-or-less directly into physical row structures, a schema change is a big hairy fire drill in which the database has to touch all the physical data.

    In something like Freebase which you would characterize as “dynamic schema”, a schema change amounts to appending the graph primitives necessary to describe the change. Basically instantaneous.

    So I’d suggest that a better distinction would be based on the effect of a schema change on the physical data. Basically, either you have a layer of indirection there (dynamic schema), or you don’t (static schema).

    Of databases that have indirection, I think that there’s a further distinction between those that have some manipulable representation of schema which relates to physical storage, and those which do not. If your “schema” amounts to whatever JSON keys are present in your blobs of JSON data then querying becomes much more akin to search. In the first case, the indirection is physical (like a pointer), in the second it is symbolic (a dictionary lookup).

  9. Gregory Primosch on August 2nd, 2011 3:53 pm

    Great discussion, guys. Keep that there is a considerable amount of leeway even within a RDBMS how specific you want to get in your schema design. Silverston talks about generic models in his books. As long as you have experienced data modelers, you can come up with nice generic schema will allow you defer schema decisions until you need to materialize them for querying, all without having to make the leap into NoSQL.

  10. Paul Boal on August 2nd, 2011 4:56 pm

    I really like the comparison here between databases and programming languages. Strongly typed language = fixed schema / weakly typed = dynamic schema. It’s important to keep in mind that there are degrees of flexibility in both contexts and tradeoffs in both contexts. And… even in weakly typed programming languages, there are still base types with basic features from which everything else is typically derived.

    I also think that it’s important to remember that there’s a concept in databases that is similar to “interfaces” in programming languages. Interfaces have to provide a certain level of stability so that behavior is predictable. I think that if start to think about databases more in that same way (private logical model / public interface model) it will encourage us to have conversations about design first… and then technology.

  11. Enrico on August 3rd, 2011 11:43 am

    I like more the approach that define the database with variable schema as an extension of Fixed Schema. Variable schema is include, xml or poly/multi-structured data.

    The database schema-less definition looks very naïf since in my understanding is again the DB Math Theory.

  12. Neil Hepburn on August 8th, 2011 10:00 am

    Relatively few application developers are responsible for data analytics (i.e. working in the data warehouse). By allowing schemas to be flexible and “dynamic” effectively just shifts the workload of analyzing the data from the application developers to the ETL developers who will be forced to re-map and cleanse the data, typically with suboptimal results.

    I’m still surprised how short shrift “reporting” gets in a world where all business and government decisions are supposed to be “fact based”.

    Take for example, Pandora. Imagine if they changed the attributes they classified songs by every year. Let’s say you wanted to listen to some uptempo jazz. You might only hear songs cataloged in 2004, given that the following year that attribute was dropped. This would be a sub-optimal experience for users. However, since consumers are exposed to the effects of poor data management, Pandora is careful not to muck around with its data like this. But in many businesses, IT will do this with abandon.

    It’s all about trade-offs. But from where I stand, the cost of application development is only going down, whereas the cost of analytics is only going up.

  13. Curt Monash on August 8th, 2011 3:11 pm

    Neil,

    Analytic schemas are going to change no matter what, because of derived data. Do an analysis; learn something important to derive; add it to the schema for future use.

    Now, that may be more of an issue for investigative analytics and operational analytics than for ordinary BI. But also — BI is in itself application development. And the apps developed probably have or should have shorter shelf lives than typical operational applications.

  14. Stephen Pace on August 25th, 2011 1:22 pm

    Maybe this is just a matter of semantics, but I struggle with your comment “relational databases are fixed-schema” in your split between fixed and dynamic schemas. This seems to imply that relational schemas cannot be dynamic, and I would disagree with that.

    For example, there are a lot of commercial tools out there that use flexible modeling techniques like Entity-Attribute-Value modeling to generate dynamic schemas. To the extent the EAV store is ‘fixed’, if adding a entity to the fixed store results in an automatic downstream change of another relational table, is the table that was changed static or dynamic? At what point does a ‘continuously generated static’ structure meet your definition of what ‘dynamic’ is?

    Is it timing? I could almost agree for the Oracle Designer case. That is, if I define a table, it stores that definition in a fixed EAV structure and when I’m ready, it creates the DDL for me and the table—‘dynamically’. But it quickly becomes a non-trivial problem to change those tables when there are millions or billions of rows sitting there, not to mention the ETL and BI processes hard-coded to those tables at each end. However, if you take a solution like Kalido that takes a high-level graphical model and converts it to metadata that drives all of the downstream processing (stored EAV), and where updates to that model (within the typical framework of dev, test, production) can modify staging, load routines, update reporting schemas, and even update the BI tool semantic layer (e.g. a Business Objects Universe)—all while you are three years into your existing 20TB warehouse–then I think the distinction becomes blurred.

    Taken another way, I think most of us recognize the need to adapt systems to the ever increasing pace of change, and flexible modeling ideas and tools, coupled with agile development methodologies have certainly helped with that. However, whether you are Phasic Systems using NoSQL to achieve that flexibility, or Kalido doing it with associative relational, or QlikView using in-memory associative schemas, the true measure is really how long does it take to take a customer requirement and implement it from end to end.

    Maybe I’m reading more into it than there is (because ‘dynamic’ is a sexier word than ‘fixed’), but your terminology seems to imply that relational technology is somehow worse and less flexible than some newer technologies, but given available tools and techniques, I think relational technology can be made to be just as flexible and ‘dynamic’—while at the same time delivering great performance.

  15. Curt Monash on August 25th, 2011 1:47 pm

    Stephen,

    You’re making the case that with the right tools, one can change fixed schemas pretty easily.

    Conversely, one could imagine a tool that defines a fixed schema for a NoSQL programmer.

    I don’t think either one makes the terminological distinction useless.

  16. Oracle NoSQL is unlikely to be a big deal | DBMS 2 : DataBase Management System Services on September 30th, 2011 1:20 pm

    [...] Dynamic schemas. This is the only one of the three that truly depends on NoSQL. [...]

  17. Robert Brewer on December 13th, 2012 5:36 pm

    Late to the party, here, but I’ve been using the terms “prescriptive schema” vs. “descriptive schema” to approach the problem. Classical RDBMS tend to be prescriptive (your data must fit into our ideal schema or we error). A descriptive schema tells you what the data looks like after the fact (and should give the user tools to coerce it if needed back to an ideal).

  18. Attack of the Frankenschemas | DBMS 2 : DataBase Management System Services on January 28th, 2013 9:04 am

    [...] *I now call that dynamic schemas. [...]

  19. The refactoring of everything | DBMS 2 : DataBase Management System Services on July 23rd, 2013 7:16 am

    [...] and dynamic schemas turn that around. For any one application, application design and database design are tightly [...]

  20. Timm Rüger on August 13th, 2013 8:22 am

    Thanks for the insightful discussion! It reminded me on a technique which we used to get around the “fixedness” of relational schemas and the cumbersome refactoring of ETL processes in DWHs: We called it the “universal” or “generic” schema which consisted only of a few or even a single table with a structure similiar to (key:char, attribute:char, value:char). We basically pivoted all tables to reduce the number of tables and to be able to introduce, remove or change attributes or entities without changing too much of the existing ETL code. With it we also had the concept of late typing, i.e. casting to the native type only when required (e.g. to sum up values for an attribute). Of course the flexibility was bought with additional CPU and storage costs. But who cares about CPU and storage these days?

  21. Curt Monash on August 13th, 2013 11:25 am

    Yikes! Relational purists hate that kind of dodge. :)

  22. Timm Rüger on August 14th, 2013 6:11 am

    Sure, relational purists also hate star schemas and love normalization which greatly drives the efforts in EDWH implementations (intentionally?). Besides it is very questionable if normalization is more efficient in days of DBMSs which feature compression or data skipping.

    Look at the all the industry models like Terdata’s FSLDM – they more or less get mapped 1:1 to tables (by the professional services teams of the model vendors). You end up with hundreds if not thousands of tables. Of course one of the selling points of these models is that you do not need to change them much as they represent the “crystallized” domain knowledge. A large European bank thought the same and now they are approaching the 1000s of schema changes for the last five years. And most of the changes required a real data migration (which can be rather tricky and time-consuming). Thats a lot of pain even for classical DWHs which deal only with relational data.

  23. Curt Monash on August 14th, 2013 4:33 pm

    Timm,

    I suspect we’re pretty much on the same side here. :)

  24. Schema-on-need | DBMS 2 : DataBase Management System Services on October 30th, 2013 10:30 am

    [...] Dynamic schemas (July, 2011) Categories: Data models and architecture, Data warehousing, MongoDB and 10gen, PostgreSQL, Schema on need, Structured documents  Subscribe to our complete feed! [...]

  25. DBMS2 revisited | DBMS 2 : DataBase Management System Services on March 23rd, 2014 7:53 am

    [...] NoSQL and dynamic schemas. [...]

  26. What those nested data structures are about | DBMS 2 : DataBase Management System Services on March 28th, 2014 4:32 am

    [...] there’s a strong dynamic schema flavor to these databases. The list of attributes for one web click might be very different in kind [...]

  27. NoSQL vs. NewSQL vs. traditional RDBMS | DBMS 2 : DataBase Management System Services on March 28th, 2014 1:18 pm

    [...] model. Increasingly often, dynamic schemas seem preferable to fixed ones. Internet-tracking nested data structures are just one of the [...]

  28. John Held on April 10th, 2014 8:05 pm

    Curt, I do like ‘dynamic.’ Another term, maybe more useful in the analytics-focused applications than others is ‘schema on read’ versus ‘schema on write’

    Not sure of that concept would generalize well to OLTP-type use cases, but as an analytics guy, ‘schema-on-read’ resonated with me after years of struggling with fixed schemas that presuppose what tomorrow’s analytic questions will look like.

  29. Curt Monash on April 10th, 2014 8:19 pm

    John,

    As you point out, “schema-on-read” is a term that seems applicable mainly to analytics.

    My main use for it these days is as a precursor to my term “schema-on-need”. :) The latter is one of the more interesting recent trends in database architecture.

  30. Confluence: Performance Technologies on May 14th, 2014 1:34 am

    Big Data…

      The technology team is investigating the potenti…

  31. bernard angerer on May 19th, 2014 9:05 am

    Great Discussion (!) as it goes to the core of current software architecture!

    i would like to point also to the work that had been done in the Tuple-Space/JavaSpace community which has a strong link to dynamic schema management.

    thanks
    best regards
    bernard


    CTO Semantic Business Solutions
    http://www.sbsvalid.info

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.