Two subjects in one post, because they were too hard to separate from each other
Any sufficiently complex software is developed in modules and subsystems. DBMS are no exception; the core trinity of parser, optimizer/planner, and execution engine merely starts the discussion. But increasingly, database technology is layered in a more fundamental way as well, to the extent that different parts of what would seem to be an integrated DBMS can sometimes be developed by separate vendors.
Major examples of this trend — where by “major” I mean “spanning a lot of different vendors or projects” — include:
- The object/relational, aka universal, extensibility features developed in the 1990s for Oracle, DB2, Informix, Illustra, and Postgres. The most successful extensions probably have been:
- Geospatial indexing via ESRI.
- Full-text indexing, notwithstanding questionable features and performance.
- MySQL storage engines.
- MPP (Massively Parallel Processing) analytic RDBMS relying on single-node PostgreSQL, Ingres, and/or Microsoft SQL Server — e.g. Greenplum (especially early on), Aster (ditto), DATAllegro, DATAllegro’s offspring Microsoft PDW (Parallel Data Warehouse), or Hadapt.
- Splits in which a DBMS has serious processing both in a “database” layer and in a predicate-pushdown “storage” layer — most famously Oracle Exadata, but also MarkLogic, InfiniDB, and others.
- SQL-on-HDFS — Hive, Impala, Stinger, Shark and so on (including Hadapt).
Other examples on my mind include:
- Data manipulation APIs being added to key-value stores such as Couchbase and Aerospike.
- TokuMX, the Tokutek/MongoDB hybrid I just blogged about.
- NuoDB’s willing reliance on third-party key-value stores (or HDFS in the role of one).
- FoundationDB’s strategy, and specifically its acquisition of Akiban.
And there are several others I hope to blog about soon, e.g. current-day PostgreSQL.
In an overlapping trend, DBMS increasingly have multiple data manipulation APIs. Examples include:
- The object/relational DBMS previously mentioned.
- The new DMLs (Data Manipulation Languages) or APIs previously mentioned over key-value stores.
- The SQL interfaces offered for a considerable number of non-SQL systems — Intersystems Cache’, MarkLogic, Hadoop (and thus HBase) and many more.
- Text search interfaces for a variety of DBMS.
- The JSON/MongoDB-compatibility interfaces that are popping up for multiple DBMS, e.g. DB2 or MarkLogic.
- FoundationDB, previously mentioned.
So will these trends take over the DBMS world?
Developing a multi-purpose DBMS is extremely difficult, and even harder if it’s layered.
- Developing any kind of DBMS is very hard.
- Developing a multi-purpose DBMS is harder yet. Try, for example, to imagine a caching and memory-management subsystem that’s optimal for multiple datatypes and DMLs at once.
- Layering carries performance costs. The best-case performance scenario is when you can optimize the flow of data all the way from client-server connection down to persistent storage, and back. Layering interferes with that.
But on the plus side, it can be great to have one DBMS handle multiple kinds of data.
- Almost irrespective of product category, there are obvious benefits to buying, installing and administering one thing that can meet multiple needs.
- Further, there are major use cases for manipulating the same data in different ways. For example:
- Almost any kind of large object is likely to have tabular metadata attached.
- Many kinds of database can, at times, be usefully addressed via full-text search.
- In scenarios where you incrementally derive and enhance data, it’s natural to want to keep everything in the same place. (That also helps with lineage, security and so on.) But derived data may be structured very differently than the raw data it’s based on.
And by the way — the more different functions a DBMS performs, the more they may need to be walled off from each other. In particular, I’ve long argued that it’s a best practice for e-commerce sites to manage access control, transactions, and interaction data in at least two separate databases, and preferably in three. General interaction logs do not need the security or durability that access control and transactions do, and there can be considerable costs to giving them what they don’t need. A classic example is the 2010 Chase fiasco, in which recovery from an Oracle outage was delayed by database clutter that would have fit better into a NoSQL system anyway. Building a single DBMS that refutes my argument would not be easy.
So will these trends succeed? The forgoing caveats notwithstanding, my answers are more Yes than No.
- Layered and multi-purpose DBMS will likely always have performance penalties, but over time the penalties should become small enough to be affordable in most cases.
- Exadata-like tiering in an otherwise integrated system seems like a smart way to avoid the traditional shared-everything vs. shared-nothing tradeoffs. Tiering could also be a good way to combine the ever more numerous kinds of storage — dish, flash, multiple levels of cache, etc.
- Machine-generated data and “content” both call for multi-datatype DBMS. And taken together, those are a large fraction of the future of computing. Consequently …
- … strong support for multiple datatypes and DMLs is a must for “general-purpose” RDBMS. Oracle and IBM been working on that for 20 years already, with mixed success. I doubt they’ll get much further without a thorough rewrite, but rewrites happen; one of these decades they’re apt to get it right.
- The refactoring of everything (July, 2013)
- JSON in DB2 (September, 2013)
- Multi-structured data support in Hadapt (September, 2013)