Object – DBMS 2 : DataBase Management System Services

Multi-model database managers

Curt Monash — Mon, 24 Aug 2015 08:07:00 +0000

I’d say:

Multi-model database management has been around for decades. Marketers who say otherwise are being ridiculous.
Thus, “multi-model”-centric marketing is the last refuge of the incompetent. Vendors who say “We have a great DBMS, and by the way it’s multi-model (now/too)” are being smart. Vendors who say “You need a multi-model DBMS, and that’s the reason you should buy from us” are being pathetic.
Multi-logical-model data management and multi-latency-assumption data management are greatly intertwined.

Before supporting my claims directly, let me note that this is one of those posts that grew out of a Twitter conversation. The first round went:

Merv Adrian: 2 kinds of multimodel from DBMS vendors: multi-model DBMSs and multimodel portfolios. The latter create more complexity, not less.

Me: “Owned by the same vendor” does not imply “well integrated”. Indeed, not a single example is coming to mind.

Merv: We are clearly in violent agreement on that one.

Around the same time I suggested that Intersystems Cache’ was the last significant object-oriented DBMS, only to get the pushback that they were “multi-model” as well. That led to some reasonable-sounding justification — although the buzzwords of course aren’t from me — namely:

Caché supports #SQL, #NoSQL. Interchange across tables, hierarchical, document storage.

Along the way, I was reminded that some of the marketing claims around “multi-model” are absurd. For example, at the time I am writing this, the Wikipedia article on “multi-model database” claims that “The first multi-model database was OrientDB, created in 2010…” In fact, however, by the definitions used in that article, multi-model DBMS date back to the 1980s, when relational functionality was grafted onto pre-relational systems such as TOTAL and IDMS.

What’s more, since the 1990s, multi-model functionality has been downright common, specifically in major products such as Oracle, DB2 and Informix, not to mention PostgreSQL. (But not so much Microsoft or Sybase.) Indeed, there was significant SQL standards work done around datatype extensions, especially in the contexts of SQL/MM and SQL3.

I tackled this all in 2013, when I argued:

One database to rule them all systems aren’t very realistic, but even so, …
… single-model systems will become increasingly obsolete.

Developments since then have been in line with my thoughts. For example, Spark added DataFrames, which promise substantial data model flexibility for Spark use cases, but more mature products have progressed in a more deliberate way.

What’s new in all this is a growing desire to re-integrate short-request and analytic processing — hence Gartner’s new-ish buzzword of HTAP (Hybrid Transactional/Analytic Processing). The more sensible reasons for this trend are:

Operational applications have always needed to accept immediate writes. (Losing data is bad.)
Operational applications have always needed to serve small query result sets based on the freshest data. (If you write something into a database, you might need to immediately retrieve it to finish the business operation.)
It is increasingly common for predictive decisions to be made at similar speeds. (That’s what recommenders and personalizers do.) Ideally, such decisions can be based on fresh and historical data alike.
The long-standing desire for business intelligence to operate on super-fresh data is, increasingly, making sense, as we get ever more stuff to monitor. However …
… most such analysis should look at historical data as well.
Streaming technology is supplying ever more fresh data.

But here’s the catch — the best models for writing data are the worst for reading it, and vice-versa, because you want to write data as a lightly-structured document or log, but read it from a Ted-Codd-approved RDBMS or MOLAP system. And if you don’t have the time to move data among multiple stores, then you want one store to do a decent job of imitating both kinds of architecture. The interesting new developments in multi-model data management will largely be focused on that need.

Related links

The two-policemen joke seems ever more relevant.
My April, 2015 post on indexing technology reminds us that one DBMS can do multiple things.
Back in 2009 integrating OLTP and data warehousing was clearly a bad idea.

Database diversity revisited

Curt Monash — Mon, 09 Jul 2012 00:55:24 +0000

From time to time, I try to step back and build a little taxonomy for the variety in database technology. One effort was 4 1/2 years ago, in a pre-planned exchange with Mike Stonebraker (his side, alas, has since been taken down). A year ago I spelled out eight kinds of analytic database.

The angle I’ll take this time is to say that every sufficiently large enterprise needs to be cognizant of at least 7 kinds of database challenge. General notes on that include:

I’m using the weasel words “database challenge” to evade questions as to what is or isn’t exactly a DBMS.
One “challenge” can call for multiple products and technologies even within a single enterprise, let alone at different ones. For example, in this post the “eight kinds of analytic database” are reduced to just a single category.
Even so, one product or technology may be well-suited to address a couple different kinds of challenges.

The Big Seven database challenges that almost any enterprise faces are:

Persistent OLTP (OnLine Transaction Processing) database management. If you’re an enterprise of any size, you surely have this need. Most commonly, this need is met by a row-based relational DBMS — Oracle, IBM DB2, Microsoft SQL Server, Sybase ASE, MySQL, PostgreSQL, Progress OpenEdge et al. However,

Some SaaS vendors stray pretty far from the standard relational paradigm. See for example my coverage of salesforce.com or Workday, Inc.
Sometimes an object-oriented DBMS does the job (or a graph DBMS or whatever).
Especially in internet applications, sometimes NoSQL does.

Website and network backing. When we look specifically at websites, the situation shifts somewhat. These can combine aspects of:

OLTP. While the OLTP default is RDBMS, various NoSQL systems can be ACID for sufficiently simple transactions.
Content management, which may be best supported by a document-oriented/dynamic schema DBMS. (And by the way, the dynamic schema need can be reflected back into the OLTP parts.)
The tracking of user interactions, something most popular NoSQL systems — MongoDB, Couchbase, Cloudant, Cassandra, HBase et al. — are well-suited for.

What’s more, it can be unwise to combine true OLTP and user interaction tracking in a single relational database. For one horrific example, consider the September, 2010 Chase outage.

Similar considerations can apply for other systems that ingest machine-generated data, e.g. from social games or sensor networks.

In-memory cache or DBMS. It’s increasingly hard to think of a major OLTP system or web property that goes straight to persistent storage, without an intermediate in-memory layer. Or, if you do have one, it’s because you picked your persistent data store primarily for how well it functions when the whole working set is in RAM. I touched on some of those points in a general memory-centric data management survey last April. Beyond that, I need to learn more about caching grids of various kinds.

Analytic support. Whether you’re focused on event monitoring, trend monitoring, or flat-out investigative analytics, there’s a lot of analysis to be done, and a lot of data stores optimized for helping you do it. Those are, of course, a major subject of this blog. Overview posts include:

Eight kinds of analytic database (July, 2011)
Juggling analytic databases (March, 2012)

One point not emphasized in those posts — sometimes you have a really specialized analytic need that gets you looking at a corresponding DBMS, such as a graph store or maybe SciDB.

True document management. People started recording business information in document format over 5000 years ago. They never stopped. If nothing else, enterprises at least need search engines. Or they can manage their documents via systems that have other merits as well; indeed, I’ve sent more than one client in the direction of MarkLogic.

Embedded database management. Enterprises operate many systems that feature internal database management — e.g. email, computer-aided engineering of various kinds, security appliances, or most things that generate logs. Often, you can just forget about the underlying data management, figuring the system supplier has it covered. On the other hand, perhaps you should stop and think — do you want access to that data as part of your general computing environment? If so, then perhaps you should get more involved in managing or extracting it.

And of course, you may be in the business of developing to embedded DBMS yourself. Those can take many forms. Generally, when I write about them, I focus on the kind of DBMS — e.g. in-memory or mid-range — rather than obsessing about whether a particular product happens to be sold more often through OEM rather than direct channels.

Finally, there’s data integration, among your own databases (of which there are many), but also with external ones. I have some catching up to do on the various flavors of classical ETL (Extract/Transform/Load), so I’m talking with vendors again, including Informatica — but not Talend, which seems reluctant to let me talk with somebody technical, and also not the secrecy-obsessed Ab Initio folks. I probably should circle back to SnapLogic, and of course to my neglected clients at Syncsort. As for Hadoop-related data integration, I’m still figuring that out too. Several people I respect seem excited about HCatalog, and I’m pursuing that further.

One opinion I hold in data integration is that it’s increasingly important to stream updates to your analytic data store as soon as they come in, due to the general desire for low-latency analytics. I see this as something that can and sometimes should be done with the same replication technologies used for high availability, disaster recovery, and so on. More advanced ETL capabilities often aren’t needed; instead, ELT suffices.

Overall, I think enterprises could wind up with diversity in data integration rivaling what they have in database management itself. Candidates include:

A cosmic near-DBMS ETL suite, such as Informatica’s or Ab Initio’s. These will likely work well with …
… complex ETL pipelines working through Hadoop.
Replication/streaming.
Something with a composite-application orientation.
The built-in ETL of their favorite business intelligence tools.

Stay tuned for further research.

Akiban update

Curt Monash — Mon, 19 Mar 2012 10:06:52 +0000

I have a bunch of backlogged post subjects in or around short-request processing, based on ongoing conversations with my clients at Akiban, Cloudant, Code Futures (dbShards), DataStax (Cassandra) and others. Let’s start with Akiban. When I posted about Akiban two years ago, it was reasonable to say:

Akiban is in the short-request DBMS business.
MySQL compatibility is one way to access Akiban, but it’s not the whole story.
Akiban’s main point of technical differentiation is to arrange data hierarchically on disk so that many joins are “zero-cost”.
Walking the hierarchy isn’t a great way to get at data for every possible query; Akiban recognizes the need for other access techniques as well.

All of the above are still true. But unsurprisingly, plenty of the supporting details have changed.

Akiban company basics include:

20 employees.
Reasonable amounts of venture capital.
Offices in downtown Boston (in the same office complex as Cloudant).
Enterprise edition product in beta, planned for Q2 release.
Open source edition coming some time after that.
Several users whose detailed stories are in Akiban’s marketing materials (user names currently NDA).

Akiban technical basics start:

The central idea of Akiban is that if you have a hierarchy of tables in a schema, such as Customer-Order-Detail, then the rows of a child table are physically both ordered by and interleaved with the rows of its parent.
A kind of physical composite key called hKey, organized into something called a group index, then takes the place of a conventional index. Akiban says this makes many kind of joins essentially free.*
Akiban further says that even when you don’t get that benefit, reading and joining via hKeys and the group index is usually at least as fast as it would be to use conventional b-trees. Even so, other data access approaches are on the drawing boards.

Because Akiban’s great virtue is short-request join performance, its target market starts with online businesses that maintain some kind of customer profile along with transaction or presence data — for example retailers or dating services.

*For marketing purposes, the word “essentially” is usually omitted.

In its initial release, Akiban will be a single-server product, dependent on MySQL. That is:

Akiban has a reasonable amount of MySQL compatibility, but won’t promise to be a full MySQL work-alike in all edge cases.
Akiban (the company) advises you to take your existing, overstressed MySQL application and instruct its load balancer to send the worst “problem queries” to Akiban (the product). Akiban believes there’s a great chance it will execute those queries 10-100X faster.
Akiban (the company) figures that if you have a MySQL performance problem, you already are replicating MySQL data to multiple read slaves. It wants Akiban (the product) to be one more of those slaves.
Technically, Akiban (the core DBMS product) isn’t a MySQL storage engine. Rather, there’s a separate MySQL storage engine that receives the replicated data, and then sends it on to Akiban to be stored in Akiban’s structure.

Akiban’s idea of initially focusing on existing MySQL installations makes sense, because:

That’s where the pain is.
Working through MySQL lets Akiban say “Don’t worry about using new technology from a small company; if something goes wrong you can always use any other flavor of MySQL instead, including from a very large database company in Redwood Shores.”

Looking ahead, Akiban tends to conflate the ideas of:

An open source version of its product.
A version of its product that is a relational DBMS with full fit-and-finish, without relying on MySQL in any way.
A version of its product that people will use for new, greenfield applications.

I think those aspects of Akiban’s strategy could still use some refinement.

As for possible Akiban futures:

Founder/CTO Ori Herrnstadt likes to point out that Akiban’s physical architecture could be a great match for object-oriented programs, and hence Akiban would be well-suited for an object-oriented interface, presumably in the form of ORM (Object-Relational Mapping/er) transparency.
Ori also notes that Akiban’s data organization scheme would lend itself nicely to scale-out. (dbShards bases some of its scale-out on a similar concept.)
The potential second, columnar copy of the data I wrote about 2 years ago, while still a possible direction, is a much lower priority than it seemed then, due to the ability to get good performance in other ways.

One thing we haven’t talked about much is write speed. It would seem challenging for Akiban to achieve append-only/log-structured-merge kinds of write speeds. Update-in-place seems like a more suitable model. To me, that screams “solid-state storage”, but of course the reality is that plenty of high-volume MySQL sites today do update-in-place on cloudy disk-based systems. And a few of them even seem to be using Akiban.

Defining NoSQL

Curt Monash — Mon, 03 Oct 2011 00:32:02 +0000

A reporter tweeted: “Is there a simple plain English definition for NoSQL?” After reminding him of my cynical yet accurate Third Law of Commercial Semantics, I gave it a serious try, and came up with the following. More precisely, I tweeted the bolded parts of what’s below; the rest is commentary added for this post.

NoSQL is most easily defined by what it excludes: SQL, joins, strong analytic alternatives to those, and some forms of database integrity. If you leave all four out, and you have a strong scale-out story, you’re in the NoSQL mainstream.

Thus, I’d say Cassandra, HBase, Mongo DB, and Couchbase are prime examples, in no particular order. Riak as well.
I might have phrased that better if I’d used a different word than simply “strong” — but hey, there was a 140-character limit, and he was on deadline.

Using NoSQL can make sense when at least one of two things is paramount: low-cost scale-out or dynamic schemas.

There are some seriously sensible use cases for dynamic schemas.
“Low-cost” generally boils down to:
- Performance.
- Open source free-like-beer.
- Not a lot of database administration.

I’ve generally given object-oriented DBMS vendors and also MarkLogic hard times whenever they consider saying they’re “NoSQL”. Reasons include:

Closed source.
Database administration overhead (even if you get good stuff for incurring that overhead, like MarkLogic’s comprehensive indexing).

Also, NoSQL started out being ACID-unfriendly.

What you give up are the query flexibility and the easily automatic data integrity of SQL-based systems. I should have added something about a mature ecosystem.

In the most recent live example, I influenced a client away from Cassandra and toward scale-out MySQL (dbShards and/or Schooner flavors, most likely). Part of the reason was the ability to do joins, which are useful in their application. Another part is that their development practices obviated any significant benefit from dynamic schemas. But perhaps the most important — or at least resonant — reason of all was that they really, really cared about .NET support.

Oracle NoSQL is unlikely to be a big deal

Curt Monash — Fri, 30 Sep 2011 18:20:53 +0000

Alex Williams noticed that there will be a NoSQL session at Oracle OpenWorld next week, and is wondering whether this will be a big deal. I think it won’t be.

There really are three major points to NoSQL.

Dynamic schemas. This is the only one of the three that truly depends on NoSQL.
Scale-out short-request processing. If you want to scale out efficiently at high request volumes, you’re best off not using all the flexibility SQL/relational DBMS offer. (In particular, you don’t want to do cross-node joins). Not coincidentally, a number of the best scale-out offerings were built to be NoSQL.
Open source. Doing a relational DBMS is a big project. It seems easier to build NoSQL ones.

Oracle can address the latter two points as aggressively as it wishes via MySQL. It so happens I would generally recommend MySQL enhanced by dbShards, Schooner, and/or dbShards/Schooner, rather than Oracle-only MySQL … but that’s a detail. In some form or other, Oracle’s MySQL is a huge player in the scale-out, open source, short-request database management market.

So that leaves us with dynamic schemas. Oracle has at least four different sets of technology in that area:

As Workday noticed years ago, MySQL can be used as a functional, basic key-value store.
Oracle also has XML-based Berkeley DB/SleepyCat kicking around.*
The XML extensions to Oracle’s core DBMS could be alleged to have a dynamic schema/NoSQL flavor. (Blech.)
A dynamic schema argument could also be made for object-oriented DBMS technology. While Oracle doesn’t to my knowledge exactly sell that, it does have the Tangosol Coherence line of technology, with a potentially similar programming model.

If Oracle is now refreshing and rebranding one or more of these as “NoSQL”, there’s no reason to view that as a big deal at all.

*That’s Mike Olson’s former company, if you’re keeping score at home.

The database architecture of salesforce.com, force.com, and database.com

Curt Monash — Thu, 15 Sep 2011 16:09:32 +0000

salesforce.com, force.com, and database.com use exactly the same database infrastructure and architecture. That’s the good news. The bad news is that salesforce.com is somewhat obscure about technical details, for reasons such as:

A long-ago marketing decision to not give infrastructure details, so as to convey a “Don’t worry; we’ll take care of everything” message.
Even so, a long-ago and perhaps now-regretted marketing decision to disclose and even exaggerate salesforce.com’s reliance on Oracle, as part of an early-days attempt to prove salesforce was using enterprise-class technology.
A desire to hide the recipe for salesforce.com’s secret sauce.
Force of habit — I’m not sure salesforce even knows how to tell its technical story with any clarity.

Actually, salesforce.com has moved some kinds of data out of Oracle that previously used to be stored there. Besides Oracle, salesforce uses at least a file system and a RAM-based data store about which I have no details. Even so, much of salesforce.com’s data is stored in Oracle — a single instance of Oracle, which it believes may be the largest instance of Oracle in the world.

Salesforce did spell out some of its database story in a 2008 force.com white paper, which is good stuff, but potentially misleading in one important way. The paper tells of a level of abstraction, whereby what the application sees as logical “columns” are stored in a very different schema than one might assume. However, it doesn’t spell out a second level of abstraction, whereby that logical schema also isn’t how the database is actually laid out.

Another flaw in the paper is that it spins “We had to do this, to support multitenancy, so we did.” issues as “Because we’re multitenant, we can do this, while single-tenant systems can’t.” One example is the query optimization step around “user visibility” in Figure 11. Welcome to marketing.

At the first level of abstraction, data seems to be kept mainly in a single wide table, with hundreds of columns. What’s more, many of those are “flex columns”; a flex column can hold data of many different kinds and even datatypes. Notwithstanding the second level of abstraction, I imagine the idea of stuffing different kinds of thing into the same column has something to do with the fact that Oracle’s physical limit on columns falls far short of the number of logical columns salesforce wants to use.

If we imagine that the different kinds of data in a flex column were each in their own column instead, the whole thing might sound like BigTable/Cassandra/HBase-style column-group NoSQL. Thus, much as Workday uses MySQL to simulate a key-value store, salesforce.com can be said to use Oracle to simulate a different kind of NoSQL. In both cases, what’s going on seems to be a kind of object/relational mapping, but with the relational aspect strongly deemphasized. Or, if you take a more relational view, we could say that salesforce.com’s tables are a lot wider than any one user organization’s, because each user sees only its own custom columns (plus the standard ones common to all users).

The second layer of abstraction has a lot to do with multitenancy. If you want to stick data for many different user organizations into the same huge table, then you have to label it in some way to show who is permitted to see or update each part. Logically, this leads to a join, between one table carrying data plus a simple key showing which users/roles are entitled to see it, and a second table showing who actually is that kind of user/has that kind of role. But that join makes a lot of sense to store in a denormalized way, all the more because data is partitioned across the computer cluster in line with which user organization it actually belongs to.

Multitenant security isn’t the only reason for this denormalization, but it appears to be the biggest one.

The whole thing is doing 550 million or so transactions per day. salesforce.com thinks that fact should be regarded as evidence that it works.

Terminology: Dynamic- vs. fixed-schema databases

Curt Monash — Sun, 31 Jul 2011 23:02:56 +0000

E. F. “Ted” Codd taught the computing world that databases should have fixed logical schemas (which protect the user from having to know about physical database organization). But he may not have been as universally correct as he thought. Cases I’ve noted in which fixed schemas may be problematic include:

“A bunch of apps in one, similar but not the same” (in my recent post on MongoDB).
Out-of-control product catalogs (ditto).
Analytic use cases in which one keeps enhancing the database with derived data.

And if marketing profile analysis is ever done correctly, that will be a huge example for the list.

So what do we call those DBMS — for example NoSQL, object-oriented, or XML-based systems — that bake the schema into the applications or the records themselves? In the MongoDB post I went with “schemaless,” but I wasn’t really comfortable with that, so I took the discussion to Twitter. Comments from Vlad Didenko (in particular), Ryan Prociuk, Merv Adrian, and Roland Bouman favored the idea that schemas in such systems are changeable or late-bound, rather than entirely absent. I quickly agreed.

The discussion wasn’t entirely serious; wise-ass comments were contributed by at least Merv, Neil Raden, Yiorgos Adamopoulos, and myself.

I like that approach for the same reason I favor saying that databases are poly- or multi-structured (rather than un- or semi-): Every database has structure, the only question being when that structure is determined. I wouldn’t precisely equate “poly-structured” to “has a late-bound schema”; for example, I’d say that mucking with the DDL (Data Description Language) of a relational database shows that it’s a little bit poly-structured, even though it’s not at all late-bound. But the concepts are definitely related.

So what actual wording should we use here? The only alternative I see to fixed schema is “static”, and that feels like it has too much of a connotation of “unchangeable”. The simplest word I can think of for changeable/late-bound/whatever is dynamic schema; that choice also has the virtue of some traction, as per the Vlad Didenko tweet linked above. Casual googling is also supportive of “fixed” and “dynamic”, at least over whatever alternatives I came up with. So those are my choices.

For actual definitions, I’ll say:

A (logical) schema is fixed if it is defined before a program is written, but dynamic if it is defined by the program or data itself.
A database is fixed- or dynamic-schema depending on whether its schemas are fixed or dynamic respectively.
A DBMS is fixed- or dynamic-schema depending on whether databases created in it tend to have fixed or dynamic schemas respectively.

Suit yourself as to what you say about relational DBMS when they also have a bit of XML, text, or whatever support.

By these definitions:

Relational databases are fixed-schema (within the caveat above about XML or text data).
MOLAP databases are fixed-schema.
Pre-relational network and hierarchical DBMS (e.g. IMS) are fixed-schema.
Most other DBMS are dynamic-schema.

What do you think? Do these definitions work for you?

McObject and eXtremeDB

Curt Monash — Fri, 22 Jul 2011 12:32:16 +0000

I talked with McObject yesterday. McObject has two product lines, both of which are something like in-memory DBMS — eXtremeDB, which is the main one, and Perst. McObject has been around since at least 2003, probably has no venture capital, and probably has a very low double-digit number of employees.*

*I could be wrong in those guesses; as small companies go, McObject is unusually prone to secrecy games.

As best I understand:

eXtremeDB is something like an in-memory object-oriented DBMS, designed to be embeddable.
However, much as with Objectivity and other old-school OODBMS, eXtremeDB winds up being more of a toolkit with which to build DBMS than a full DBMS.
eXtremeDB has a few indexing schemes. The main one is good old B-trees. One customer wanted Patricia tries, so they’re in there. (Perhaps not coincidentally, solidDB relies on Patricia tries.) At least one wanted R-trees, so they’re in there too.
eXtremeDB has long had the option of persistent logs.
eXtremeDB newly has a hybrid memory-centric option, in which you can have more data in the database than fits into RAM.
eXtremeDB newly has multi-master two-phase-commit clustering.

My guess three years ago that eXtremeDB might emerge as an alternative to solidDB seems to have been borne out. McObject CEO Steve Graves says that the core of McObject’s business is OEMs, in sectors such as telecom equipment and defense/aerospace. That’s exactly solidDB’s traditional market, except that solidDB got acquired by IBM and deemphasized it.

I’ve said before that if I were starting a SaaS effort — and it wasn’t just focused on analytics — I’d look at using a memory-centric OODBMS. Perhaps eXtremeDB is worth looking at in such scenarios.

Forthcoming Oracle appliances

Curt Monash — Fri, 24 Jun 2011 06:44:56 +0000

Edit: I checked with Oracle, and it’s indeed TimesTen that’s supposed to be the basis of this new appliance, as per a comment below. That would be less cool, alas.

Oracle seems to have said on yesterday’s conference call Oracle OpenWorld (first week in October) will feature appliances based on Tangosol and Hadoop. As I post this, the Seeking Alpha transcript of Oracle’s call is riddled with typos. Bolded comments below are by me.

Well, we’re planning to add a couple of appliances and announcing them this fall. One appliance, that should surprise you is a large memory addition to Exadata for analytics and memory, so we continue to invest. We thought that would — we’ve been the leader of in-memory database technology ever since we bought Tungsten. I presume that’s a typo for “Tangosol”. And it sort of denigrates Oracle TimesTen. And that’s for both for transactions and for preprocessing. We are, as memories become cheaper and larger scale, we’ve changed as much of our algorithms and this in-memory analytics accelerator is going to be, again, coming out and we’ll be announcing it in the fall at Oracle OpenWorld.

That part, especially in connection with the last sentence of the next quote, sounds almost as if Tangosol will be positioned as a kind of memory-centric object-oriented DBMS, albeit with Oracle as its persistence layer. Well, I favor both in-memory and object-oriented DBMS, and especially the intersection of those two categories. So in principle this could be a very cool product. Exploiting that coolness, however, may require one heck of a missionary sell.

In addition, attaching to our Exalogic box, there’s a lot of misunderstanding about what’s a dupe is, and is it a replacement for database. I presume “a dupe” is a typo for “Hadoop”. So the dupe is not a replacement for database. It’s an adjunct to the database, which we think, is very, very important. It really is a tool for Java programmers. And we’re the world leader in Java technology and we are building a big data accelerator to attach to our Exalogic box, which comes out also this fall. The big data accelerator includes some of the standard open source heavy software, HTFF, the heavy file system and a number of other pieces, but also some Oracle components that we think can dramatically speed up the entire math-produced process. I presume that’s a series of typos for “HDFS” and “MapReduce“. And will be particularly attractive to Java programmers who are the ones, who asked for — aspire to do. There are some interesting applications they do, ETL is one. Log processing is another. Those last two sentences are more evidence for the theory that this is about Hadoop. Besides, I spoke with somebody who listened to the call. We’re going to have a lot of those features, functions and prebuilt applications in our big data accelerator. So, Oracle has always followed database technology trends, whether it’s object databases, in-memory databases and kept up with this technology and some, quite often led on innovation.

And that part sounds as if Oracle will announce a Hadoop appliance, positioning it more as a Java software accelerator than a place to store cheap data. Be the positioning as it may, my objections to the idea of a Hadoop appliance still stand, although Amr Awadallah’s counterarguments make sense as well.

When it’s still best to use a relational DBMS

Curt Monash — Sun, 29 May 2011 19:56:37 +0000

There are plenty of viable alternatives to relational database management systems. For short-request processing, both document stores and fully object-oriented DBMS can make sense. Text search engines have an important role to play. E. F. “Ted” Codd himself once suggested that relational DBMS weren’t best for analytics.* Analysis of machine-generated log data doesn’t always have a naturally relational aspect. And I could go on with more examples yet.

*Actually, he didn’t admit that what he was advocating was a different kind of DBMS, namely a MOLAP one — but he was. And he was wrong anyway about the necessity for MOLAP. But let’s overlook those details.

Nonetheless, relational DBMS dominate the market. As I see it, the reasons for relational dominance cluster into four areas (which of course overlap):

Data re-use. Ted Codd’s famed original paper referred to shared data banks for a reason.
The benefits of normalization, which include:
- You only have to do programming work of writing something once …
- … and you don’t have to do the programming work of keeping multiple versions of the information consistent.
- You only have to do processing work of writing something once.
- You only have to buy storage to hold each fact once.
Separation of concerns.
- Different people can worry about programming and “database stuff.”
- Indeed, even performance optimization can sometimes be separated from programming (i.e., when all you have to do to get speed is implement the correct indexes).
Maturity and momentum, as reflected in the availability of:
- People.
- A broad variety of mature relational DBMS.
- Vast amounts of packaged software that “talks” SQL.

Generally speaking, I find the reasons for sticking with relational technology compelling in cases such as:

You’re building a low-volume, medium-complexity suite of applications that will evolve over time. This is the use case for which relational DBMS were invented, and they’re still great for it.
Your (duplicated) data volumes would be ridiculous if you didn’t do a reasonable amount of normalization. Once you need to normalize, you need to do joins — and if you’re doing joins, you’re in relational territory.
You simply don’t see a cost/benefit advantage to moving away from proven legacy technology. If you’re looking for an off-the-shelf answer to your needs — or if you’re inventorying your own technological shelves — relational-oriented technology has overwhelming share.

For many enterprises, that third point alone should be decisive in a large fraction of cases.

But the advantages of relational technology are less clear when you’re doing serious engineering of path-breaking new applications, where by “serious engineering” I mean:

The problem is big enough that you simply want the best solution, with only loose coupling needed to the rest of your technical environment.
Long-lasting “strategic” or legacy technology is not a great concern; you’re willing to keep “rebuilding the 747 while it’s flying” if that’s what’s necessary to get the best possible result.
You have access to sufficient quantities of sufficiently smart people.

For example:

I recently suggested that innovative SaaS vendors could adopt object-oriented database technology.
Major web applications are rarely very relational. Until recently, the default approach to scaling out web databases was memcached/sharded MySQL, hardly a whole-hearted adoption of relational technology. Now NoSQL DBMS are vigorous competitors.
Analytic challenges that amount to teasing out signals from streams of data are sometimes handled non-relationally as well, although it’s often nice to be able to do a few joins to mix in information from more relationally-structured data.

Not coincidentally, in a lot of those cases, throwing performance concerns “over the wall” to the database administrator isn’t going to work.

*I do expect the pendulum to swing back a bit as high-performance/highly-scalable MySQL implementations mature, but there are relatively few supporting examples to date.

To look at it another way, it’s right to be skeptical about relational DBMS when you can defeat all of the reasons to favor them. For example:

Data re-use may not arise when applications are self-contained and rapidly-changing.
Sometimes you don’t need to normalize your data.
It’s not obvious that the relational approach to separation of concerns is the best one. Perhaps you’d be better off with the people who understand a specific application best being responsible for all the decisions connected with it.
As for that maturity and momentum:
- People don’t actually learn much SQL in school.
- Are any of the mature relational DBMS what you really want?
- Is any of that packaged software out there really helpful for your specific problem?

I should probably stop there. But in an appeal to authority, I’ll close instead with a quote from Codd’s own OLAP paper:

IT should never forget that technology is a means to an end, and not an end in itself. Technologies must be evaluated individually in terms of their ability to satisfy the needs of their respective users. IT should never be reluctant to use the most appropriate interface to satisfy users’ requirements. Attempting to force one technology or tool to satisfy a particular need for which another tool is more effective and efficient is like attempting to drive a screw into a wall with a hammer when a screwdriver is at hand: the screw may eventually enter the wall but at what cost?

Related link

My exchange with Mike Stonebraker highlighting our shared advocacy for database diversity