“Data integration” can mean many different things, to an extent that’s impeding me from writing about the area. So I’ll start by simply laying out some of the myriad ways that data can be brought to where it is needed, and worry about other subjects later. Yes, this is a massive wall of text, and incomplete even so — but that in itself is my central point.
There are two main paradigms for data integration:
- Movement or replication — you take data from one place and copy it to another.
- Federation — you treat data in multiple different places logically as if it were all in one database.
Data movement and replication typically take one of three forms:
- Logical, transactional, or trigger-based — sending data across the wire every time an update happens, or as the result of a large-result-set query/extract, or in response to a specific request.
- Log-based — like logical replication, but driven by the transaction/update log rather than the core data management mechanism itself, so as to avoid directly overstressing the DBMS.
- Block/file-based — sending chunks of data, and expecting the target system to store them first and only make sense of them afterward.
Beyond the core functions of movement, replication, and/or federation, there are other concerns closely connected to data integration. These include:
- Transparency and emulation, e.g. via a layer of software that makes data in one format look like it’s in another. (If memory serves, this is the use case for which Larry DeBoever coined the term “middleware.”)
- Cleaning and quality — with new uses of data can come new requirements for accuracy.
- Master, reference, or canonical data –
- Archiving and information preservation — part of keeping data safe is ensuring that there are copies at various physical locations. Another part can be making it logically tamper-proof, or at least highly auditable.
In particular, the following are largely different from each other.
Local replication, for high availability and/or read scalability. If you have an important database application — SQL or non-SQL as the case may be — and you want to ensure its availability, you may choose to have a second machine capable of running it, on standby, possessed of a fully and reliability current copy of the data. These needs are usually met by local, synchronous replication, where “synchronous” means that an update is not committed on the first machine until there’s assurance it will also go through on the second one. (Fortunately, there are faster forms of synchronous replication than two-phase commit.)
In some cases, you also need the application to run against several copies of the data, for performance. That can be achieved with similar technology.
Remote replication, most commonly (but not only) for disaster recovery. If you really, really want to ensure application availability, you may also choose to have a live system ready to go at a different physical location. In that case you still want best-realistic-speed for data replication, but it needs to be asynchronous — your updates on the main system can’t afford to wait for acknowledgements to be sent back and forth across a wide-area network.
Synchronously replicated or not, you might want to send work to your high-availability or disaster recovery database copies, for performance, to the extent that you have them anyway. In particularly, asynchronous replication is fast enough for almost any analytic use case.
Low-latency replication, to populate an analytic database. It’s increasingly desirable to stream/trickle operational updates straight into an analytic database. But the two database management systems involved are likely to be different. That places additional demands on the replication technology beyond what is needed for replicating like-to-like.
Rebalancing, in a shared-nothing and/or adaptable DBMS. Databases are not always kept in single, homogenous systems. That can create the need to move or copy data from one place to another. An obvious example is when you add, delete, or repair nodes in a shared-nothing system. Another is when data is moved among different tiers of storage (e.g. solid-state vs. hard-disk).
Cross-database information lifecycle management (ILM). Sometimes you rebalance among different databases, managed by different hardware or software. Even if we assume that off-line storage isn’t involved — “disk is the new tape” — general ILM is lot more complicated than the single-DBMS kind.
ETL (Extract/Transform/Load). In the simplest cases, ETL takes data from one database and puts in another, often on a batch basis, sometimes in a trickle/stream. But unlike what we call “replication”, ETL also allows significant changes to data along the way. Even ETL distinguished primarily by performance puts data through complex processing pipelines.
Conceptually, and irrespective of what really is or isn’t going on, it’s probably easier to think of ETL as something you copy data into and then back out of than simply a set of pipes.
ELT (Extract/Load/Transform). On the other hand, sometimes ETL’s main function is indeed piping, with the transformation happening after the data gets to its new location. ELT can be appealing when (for example) the destination is a cheap, efficient analytic DBMS. Also, a particular rich form of ELT is possible using Hadoop (or other) MapReduce.
Often, it’s reasonable to say that the “E” and “L” parts of ELT are done via ETL technology, while the “T” is done via something else.
Data mart spin-out. When he was at eBay, Oliver Ratzesberger made a big deal of the ability to spin out a data mart very quickly. There are two kinds of ways to do this:
- Virtual data mart spin-out, with no physical data movement at all. (This is Oliver’s preferred way.)
- Physical data mart spin-out, based on copying the data. Greenplum, inspired by then-customer eBay, was probably the first to make a big fuss about that.
One way and/or the other, fast data-mart spin-out has become an important — albeit still forward-leaning — feature for analytic DBMS.
Business intelligence tools, querying multiple databases. Often, enterprises have looked to BI to achieve what they see as data integration. It’s pretty straightforward for a BI tool to query multiple relational databases. This is not exactly the same thing as doing ETL to support a BI project, or even as selling BI and ETL more or less bundled together.
Indexing, search, and/or query of external databases, from within a particular data store. Sometimes it’s the data store itself that reaches out to other databases. Text search engines are particularly likely to index information stored in other places — and by the way, in the case of text, the index usually holds a complete copy of the information being indexed. But relational DBMS have occasionally-used “external table” functionality as well.
Different storage engines in the same DBMS. Frequently, the makers of a single database management system find it advantageous to have two or more rather different storage engines under the covers. The base case is that some data gets put in one engine, some in another, and that’s that. But in a few cases, data might move from one engine to another. An example of the latter strategy is Vertica, with features both a write-optimized store (in memory) and read-optimized store (what we really think of as Vertica).
Bidirectional Hadoop (HDFS)/DBMS connectors. A variant on these two approaches are the bidirectional Hadoop connectors that various DBMS vendors have announced. Details vary, but functionality can include the ability to do DBMS queries that incorporate Hadoop data and/or Hadoop jobs that directly access the DBMS.
Service-Oriented Architecture (SOA), Enterprise Application Integration (EAI), Enterprise Service Bus (ESB), Composite applications … There are a whole lot of concepts that boil down to “letting loosely-coupled applications or databases — usually operational ones — exchange data as needed.” I probably should know more about these areas.
Of course, even that taxonomy understates the complexity of the subject. Most notably, various capabilities are combined in single vendors, products, or projects. For example:
- If you’re going NewSQL or NoSQL, you probably expect answers for local replication, remote replication, and rebalancing. You may want to stream to an analytic DBMS or Hadoop cluster as well.
- ELT and ETL are commonly combined into an ETLT strategy. In particular, ETL vendors are trying to subsume the data transformation capabilities of Hadoop.
I’m sure there’s much here to disagree with, and even more to criticize in the way of omissions. Fire away! Just please recall — no market categorization is ever precise.