October 7, 2012

IBM’s ETL

Bearing in mind the difficulties in covering big companies and their products, I had a call with IBM about its core ETL technology (Extract/Transform/Load), and have some notes accordingly. It’s pretty reasonable to say that there are and were a Big Three of high-end ETL vendors:

Informatica
IBM/Ascential
Ab Initio

However, IBM fondly thinks there are a Big Two, on the theory that Informatica Powercenter can’t scale as well as IBM and Ab Initio can, and hence gets knocked out of deals when particularly strong scalability and throughput are required.

IBM wanted to follow up about its stance on Hadoop-based ETL, which may roughly be summarized as:

Obviously, HDFS (Hadoop Distributed File System) is an increasingly important target for ETL.
Also, there are some workloads for which ETL and Hadoop-based analytic processing are so interwoven that ETL — or rather ELT/ETLT — should be done on Hadoop.
For that reason alone, it makes sense to support Hadoop as an execution engine, much as other vendors do and Informatica will.
However, IBM is in no rush to offer such support, because IBM’s own ETL engine has great parallel performance as it stands.

I.e., IBM is in effect saying “Those other guys have to rush to Hadoop so they can do parallel ETL at all, but we don’t have that problem.” Indeed, IBM says that its users today run ETL jobs with 100s of sub-jobs, which might be equivalent to 10s of MapReduce steps.

Unfortunately, time didn’t permit a detailed discussion of the wonders of IBM’s architecture and technology. But some basics include:

There’s a coordinator process that breaks and perhaps compiles a job into a set of sub-jobs/sub-flows.
But there’s no head node; data is piped directly from execution node to execution node as needed.
Data tends to be distributed among nodes in line with the next join key.
The system tries not to spill intermediate results back to disk. Examples of when that might not quite work out include:
- Sorts.
- Cases where different parts of the process finish at significantly different times.

All that makes sense by analogy to how scale-out analytic RDBMS are designed.

Relevant history and names seem to include:

IBM Infosphere Information Server, or something like that.
Ascential, a data integration company IBM bought some years ago.
DataStage, Ascential’s main product name.
Torrent, a company Ascential bought 6-7 years ago, which provided the architecture for what IBM sells in most new ETL deals.
PX, which I gather is the name of that architecture.

Also, it may or may not be interesting to note that:

When independent, Ascential was juggling several different ETL engines, due to acquisition or whatever.
With various company name changes, Ascential more or less spun into and then back out of Informix.
IBM acquired Informix and then Ascential in two separate and apparently unrelated deals.

Categories: EAI, EII, ETL, ELT, ETLT, MapReduce, Parallelization

Subscribe to our complete feed!

Comments

7 Responses to “IBM’s ETL”

Geordee on October 7th, 2012 4:08 pm

Some links to the history

http://it.toolbox.com/blogs/infosphere/lee-scheffler-interview-the-ghost-of-datastage-past-8819

And about Orchestrate
http://www.pr3systems.com/Operator_Combination_and_Control.pdf
Curt Monash on October 8th, 2012 3:09 pm

Geordee,

That first historical link is outstanding. I didn’t know DataStage was based on Pick, but of course that makes sense given the company history.

Thanks!

CAM
Max Splodge on October 9th, 2012 3:49 am

Curt,

I love the statement “Unfortunately, time didn’t permit a detailed discussion of the wonders of IBM’s architecture and technology.”

I assume you chose the words very carefully

Max
Mike Beckerle on November 1st, 2012 12:53 pm

PX stood for “Parallel Extender” originally, which was the way the product family complexity was managed, i.e., you could buy original DataStage, or DataStage with PX for much more money.

This PX meant the Torrent Orchestrate engine, (Orchestrate(TM) was the actual name of the technology) added as an alternative and mostly-compatable back-end to DataStage.

The original DataStage (not PX) technology did evolve out of Pick stuff, but I think that’s really a little too thin way to think about it. It’s a pretty rich product itself.

Later I do think PX became the informal IBM term for the scalable backend, or the product with the scalable backend, somewhat ambiguously.
Dave on May 17th, 2014 7:56 am

Is there any information on the architecture anywhere, particularly on scaling of job processes?
A Modern Data Warehouse Architecture: Part 1 – Add a Data Lake | Database Fog Blog on September 22nd, 2014 9:40 am

[…] Monash on IBM ETL here. Share this:PrintEmailLinkedInTwitterMoreFacebookStumbleUponRedditLike this:Like Loading… […]
A Modern Data Warehouse Architecture Part 1 – Add a Data Lake on October 27th, 2014 3:43 pm

[…] Curt Monash on IBM ETL here. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

IBM’s ETL

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin