October 7, 2012

IBM’s ETL

Bearing in mind the difficulties in covering big companies and their products, I had a call with IBM about its core ETL technology (Extract/Transform/Load), and have some notes accordingly. It’s pretty reasonable to say that there are and were a Big Three of high-end ETL vendors:

However, IBM fondly thinks there are a Big Two, on the theory that Informatica Powercenter can’t scale as well as IBM and Ab Initio can, and hence gets knocked out of deals when particularly strong scalability and throughput are required.

IBM wanted to follow up about its stance on Hadoop-based ETL, which may roughly be summarized as:

I.e., IBM is in effect saying “Those other guys have to rush to Hadoop so they can do parallel ETL at all, but we don’t have that problem.” Indeed, IBM says that its users today run ETL jobs with 100s of sub-jobs, which might be equivalent to 10s of MapReduce steps.

Unfortunately, time didn’t permit a detailed discussion of the wonders of IBM’s architecture and technology. But some basics include:

All that makes sense by analogy to how scale-out analytic RDBMS are designed.

Relevant history and names seem to include:

Also, it may or may not be interesting to note that:

Comments

7 Responses to “IBM’s ETL”

  1. Geordee on October 7th, 2012 4:08 pm
  2. Curt Monash on October 8th, 2012 3:09 pm

    Geordee,

    That first historical link is outstanding. I didn’t know DataStage was based on Pick, but of course that makes sense given the company history.

    Thanks!

    CAM

  3. Max Splodge on October 9th, 2012 3:49 am

    Curt,

    I love the statement “Unfortunately, time didn’t permit a detailed discussion of the wonders of IBM’s architecture and technology.”

    I assume you chose the words very carefully

    Max

  4. Mike Beckerle on November 1st, 2012 12:53 pm

    PX stood for “Parallel Extender” originally, which was the way the product family complexity was managed, i.e., you could buy original DataStage, or DataStage with PX for much more money.

    This PX meant the Torrent Orchestrate engine, (Orchestrate(TM) was the actual name of the technology) added as an alternative and mostly-compatable back-end to DataStage.

    The original DataStage (not PX) technology did evolve out of Pick stuff, but I think that’s really a little too thin way to think about it. It’s a pretty rich product itself.

    Later I do think PX became the informal IBM term for the scalable backend, or the product with the scalable backend, somewhat ambiguously.

  5. Dave on May 17th, 2014 7:56 am

    Is there any information on the architecture anywhere, particularly on scaling of job processes?

  6. A Modern Data Warehouse Architecture: Part 1 – Add a Data Lake | Database Fog Blog on September 22nd, 2014 9:40 am

    […] Monash on IBM ETL here. Share this:PrintEmailLinkedInTwitterMoreFacebookStumbleUponRedditLike this:Like Loading… […]

  7. A Modern Data Warehouse Architecture Part 1 – Add a Data Lake on October 27th, 2014 3:43 pm

    […] Curt Monash on IBM ETL here. […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.