From time to time, I try to step back and build a little taxonomy for the variety in database technology. One effort was 4 1/2 years ago, in a pre-planned exchange with Mike Stonebraker (his side, alas, has since been taken down). A year ago I spelled out eight kinds of analytic database.
The angle I’ll take this time is to say that every sufficiently large enterprise needs to be cognizant of at least 7 kinds of database challenge. General notes on that include:
- I’m using the weasel words “database challenge” to evade questions as to what is or isn’t exactly a DBMS.
- One “challenge” can call for multiple products and technologies even within a single enterprise, let alone at different ones. For example, in this post the “eight kinds of analytic database” are reduced to just a single category.
- Even so, one product or technology may be well-suited to address a couple different kinds of challenges.
The Big Seven database challenges that almost any enterprise faces are:
Persistent OLTP (OnLine Transaction Processing) database management. If you’re an enterprise of any size, you surely have this need. Most commonly, this need is met by a row-based relational DBMS — Oracle, IBM DB2, Microsoft SQL Server, Sybase ASE, MySQL, PostgreSQL, Progress OpenEdge et al. However,
- Some SaaS vendors stray pretty far from the standard relational paradigm. See for example my coverage of salesforce.com or Workday, Inc.
- Sometimes an object-oriented DBMS does the job (or a graph DBMS or whatever).
- Especially in internet applications, sometimes NoSQL does.
Website and network backing. When we look specifically at websites, the situation shifts somewhat. These can combine aspects of:
- OLTP. While the OLTP default is RDBMS, various NoSQL systems can be ACID for sufficiently simple transactions.
- Content management, which may be best supported by a document-oriented/dynamic schema DBMS. (And by the way, the dynamic schema need can be reflected back into the OLTP parts.)
- The tracking of user interactions, something most popular NoSQL systems — MongoDB, Couchbase, Cloudant, Cassandra, HBase et al. — are well-suited for.
What’s more, it can be unwise to combine true OLTP and user interaction tracking in a single relational database. For one horrific example, consider the September, 2010 Chase outage.
Similar considerations can apply for other systems that ingest machine-generated data, e.g. from social games or sensor networks.
In-memory cache or DBMS. It’s increasingly hard to think of a major OLTP system or web property that goes straight to persistent storage, without an intermediate in-memory layer. Or, if you do have one, it’s because you picked your persistent data store primarily for how well it functions when the whole working set is in RAM. I touched on some of those points in a general memory-centric data management survey last April. Beyond that, I need to learn more about caching grids of various kinds.
Analytic support. Whether you’re focused on event monitoring, trend monitoring, or flat-out investigative analytics, there’s a lot of analysis to be done, and a lot of data stores optimized for helping you do it. Those are, of course, a major subject of this blog. Overview posts include:
True document management. People started recording business information in document format over 5000 years ago. They never stopped. If nothing else, enterprises at least need search engines. Or they can manage their documents via systems that have other merits as well; indeed, I’ve sent more than one client in the direction of MarkLogic.
Embedded database management. Enterprises operate many systems that feature internal database management — e.g. email, computer-aided engineering of various kinds, security appliances, or most things that generate logs. Often, you can just forget about the underlying data management, figuring the system supplier has it covered. On the other hand, perhaps you should stop and think — do you want access to that data as part of your general computing environment? If so, then perhaps you should get more involved in managing or extracting it.
And of course, you may be in the business of developing to embedded DBMS yourself. Those can take many forms. Generally, when I write about them, I focus on the kind of DBMS — e.g. in-memory or mid-range — rather than obsessing about whether a particular product happens to be sold more often through OEM rather than direct channels.
Finally, there’s data integration, among your own databases (of which there are many), but also with external ones. I have some catching up to do on the various flavors of classical ETL (Extract/Transform/Load), so I’m talking with vendors again, including Informatica — but not Talend, which seems reluctant to let me talk with somebody technical, and also not the secrecy-obsessed Ab Initio folks. I probably should circle back to SnapLogic, and of course to my neglected clients at Syncsort. As for Hadoop-related data integration, I’m still figuring that out too. Several people I respect seem excited about HCatalog, and I’m pursuing that further.
One opinion I hold in data integration is that it’s increasingly important to stream updates to your analytic data store as soon as they come in, due to the general desire for low-latency analytics. I see this as something that can and sometimes should be done with the same replication technologies used for high availability, disaster recovery, and so on. More advanced ETL capabilities often aren’t needed; instead, ELT suffices.
Overall, I think enterprises could wind up with diversity in data integration rivaling what they have in database management itself. Candidates include:
- A cosmic near-DBMS ETL suite, such as Informatica’s or Ab Initio’s. These will likely work well with …
- … complex ETL pipelines working through Hadoop.
- Something with a composite-application orientation.
- The built-in ETL of their favorite business intelligence tools.
Stay tuned for further research.