Bearing in mind the difficulties in covering big companies and their products, I had a call with IBM about its core ETL technology (Extract/Transform/Load), and have some notes accordingly. It’s pretty reasonable to say that there are and were a Big Three of high-end ETL vendors:
- Ab Initio
However, IBM fondly thinks there are a Big Two, on the theory that Informatica Powercenter can’t scale as well as IBM and Ab Initio can, and hence gets knocked out of deals when particularly strong scalability and throughput are required.
IBM wanted to follow up about its stance on Hadoop-based ETL, which may roughly be summarized as:
- Obviously, HDFS (Hadoop Distributed File System) is an increasingly important target for ETL.
- Also, there are some workloads for which ETL and Hadoop-based analytic processing are so interwoven that ETL — or rather ELT/ETLT — should be done on Hadoop.
- For that reason alone, it makes sense to support Hadoop as an execution engine, much as other vendors do and Informatica will.
- However, IBM is in no rush to offer such support, because IBM’s own ETL engine has great parallel performance as it stands.
I.e., IBM is in effect saying “Those other guys have to rush to Hadoop so they can do parallel ETL at all, but we don’t have that problem.” Indeed, IBM says that its users today run ETL jobs with 100s of sub-jobs, which might be equivalent to 10s of MapReduce steps.
Unfortunately, time didn’t permit a detailed discussion of the wonders of IBM’s architecture and technology. But some basics include:
- There’s a coordinator process that breaks and perhaps compiles a job into a set of sub-jobs/sub-flows.
- But there’s no head node; data is piped directly from execution node to execution node as needed.
- Data tends to be distributed among nodes in line with the next join key.
- The system tries not to spill intermediate results back to disk. Examples of when that might not quite work out include:
- Cases where different parts of the process finish at significantly different times.
All that makes sense by analogy to how scale-out analytic RDBMS are designed.
Relevant history and names seem to include:
- IBM Infosphere Information Server, or something like that.
- Ascential, a data integration company IBM bought some years ago.
- DataStage, Ascential’s main product name.
- Torrent, a company Ascential bought 6-7 years ago, which provided the architecture for what IBM sells in most new ETL deals.
- PX, which I gather is the name of that architecture.
Also, it may or may not be interesting to note that:
- When independent, Ascential was juggling several different ETL engines, due to acquisition or whatever.
- With various company name changes, Ascential more or less spun into and then back out of Informix.
- IBM acquired Informix and then Ascential in two separate and apparently unrelated deals.