June 18, 2014

Using multiple data stores

I’m commonly asked to assess vendor claims of the kind:

So I thought it might be useful to quickly review some of the many ways organizations put multiple data stores to work. As usual, my bottom line is:

Horses for courses

It’s now widely accepted that different data managers are better for different use cases, based on distinctions such as:

Vendors are part of this consensus; already in 2005 I observed

For all practical purposes, there are no DBMS vendors left advocating single-server strategies.

Vendor agreement has become even stronger in the interim, as evidenced by Oracle/MySQL, IBM/Netezza, Oracle’s NoSQL dabblings, and various companies’ Hadoop offerings.

Multiple data stores for a single application

We commonly think of one data manager managing one or more databases, each in support of one or more applications. But the other way around works too; it’s normal for a single application to invoke multiple data stores. Indeed, all but the strictest relational bigots would likely agree: 

The spectacular 2010 JP Morgan Chase outage was largely caused, I believe, by disregard of these precepts.

There also are cases in which applications dutifully get all their data via SQL queries, but send those queries to two or more DBMS. Teradata is proud that its systems can support rather transactional queries (for example in call-center use cases), but the same application may read from and write to a true OTLP database as well.

Further, many OLTP (OnLine Transaction Processing) applications do some fraction of their work via inbound or outbound messaging. Many buzzwords can come into play here, including but not limited to:

Finally, every dashboard that combines information from different data stores could be assigned to this category as well.

Multiple storage approaches in a single DBMS

In theory, a single DBMS could operate like two or more different ones glued together. A few functions should or must be centralized, such as administration, and communication with the outside world (connection handling, parsing, etc.). But data storage, query execution and so on could for the most part be performed by rather loosely coupled subsystems. And so you might have the best of both worlds — something that’s multiple data stores in the ways you want that diversity, but a single system in how it fits into your environment.

I discussed this idea last year with cautious optimism, writing:

So will these trends succeed? The forgoing caveats notwithstanding, my answers are more Yes than No.

  • …  multi-purpose DBMS will likely always have performance penalties, but over time the penalties should become small enough to be affordable in most cases.
  • Machine-generated data and “content” both call for multi-datatype DBMS. And taken together, those are a large fraction of the future of computing. Consequently …
  • … strong support for multiple datatypes and DMLs is a must for “general-purpose” RDBMS. Oracle and IBM [have] been working on that for 20 years already, with mixed success. I doubt they’ll get much further without a thorough rewrite, but rewrites happen; one of these decades they’re apt to get it right.

In 2005 I had been more ambivalent, in part because my model was a full 1990s-dream “universal” DBMS:

IBM, Oracle, and Microsoft have all worked out ways to have integrated query parsing and query optimization, while letting storage be more or less separate. More precisely, Oracle actually still sticks everything into one data store (hence the lack of native XML support), but allows near-infinite flexibility in how it is accessed. Microsoft has already had separate servers for tabular data, text, and MOLAP, although like Sybase, it doesn’t have general datatype extensibility that it can expose to customers, or exploit itself to provide a great variety of datatypes. IBM has had Oracle-like extensibility all along, although it hasn’t been quite as aggressive at exploiting it; now it’s introduced a separate-server option for XML.

That covers most of the waterfront, but I’d like to more explicitly acknowledge three trends:

Related links

Comments

4 Responses to “Using multiple data stores”

  1. David Gruzman on June 19th, 2014 4:32 am

    I agree, that especially in the world of big data there is no silver bullet. We have number of “full table scan” engines like Impala, Hive, Presto, etc as well as “index access” engines also called NoSQLs (too much to list). Each one in both families has its own strength and weaknesses.
    In the same time duplicating (or triplicating) data among engines is expensive.
    I think we will get to the multiple engine – single data paradigm, where different engines are looking into data stored in some open format. It is problematic for mutating data, but IMHO it is feasible for read-only data…

  2. Chuck Bear on June 28th, 2014 10:20 am

    As a company man, I can’t help but point out an old work by one of the founders.

    http://cs.brown.edu/~ugur/fits_all.pdf

  3. S Madhu on October 5th, 2015 11:38 am

    We tried something along these lines at Sybase back in the 90s. We were playing around with microkernels and started a proto to split the monolithic rdbms into a set of services. Some services as you pointed out are common, but the idea was that multiple data stors could be accomodated. Performance did suffer but not too badly.
    To compete with Informix, ASE was being enhanced to accomodate text, geospatial and time series data stores but wih a postgres style foreign data wrapper strategy. The desire for a more coherent approach led to this experiment.

    IBM worked with us to see if these services could be run as OS level services on top of the os2 3.0 microkernel. So these services would sit as peers to
    the aix and os2 virtual machines.

    A little too ahead of its time ! Thinking of reviving the approach in open source, kudu has some
    useful components for this approach. Drizzle tried the same approach if I am not mistaken.

    Let me see if I can get some fellow co-conspirators.

  4. Curt Monash on October 5th, 2015 2:16 pm

    Hiya!

    The impression I got from Jacob Stein was that performance wasn’t good at all, in large part due to the way different kinds of data(type) management wanted to use RAM. Was this perhaps a different Sybase project trying to meet the same market challenge?

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.