August 26, 2008

Three approaches to parallelizing data transformation

Many MPP data warehousing vendors have told me their products are used for ELT (Extract/Load/Transform) instead of ETL (Extract/Transform/Load). I.e., needed data transformations are done on the MPP system, rather than on the — probably SMP — system the data comes from.* If the data transformation is being applied on a record-by-record basis, then it’s automatically fully parallelized. Even if the transforms are more complex, considerable parallel processing may still be going on.

*Or it’s some of each, at which point it’s called ETLT — I bet you can work out what that stands for.

But depending on your needs, at least two other approaches to data transformation parallelization could also be considered. Pervasive Software, which has a big data integration software business of its own, built a new ETL tool. The foundation was a middle-tier multi-core-friendly Java dataflow engine, which has been now split out as Pervasive Datarush. The product is in the early stages of being released, which may be a good excuse for the website confusingly suggesting both of:

The third approach is my Subject Of The Week: MapReduce. When I posted a list of canonical MapReduce applications, my friends at Aster Data offered one pushback — I left out the area of data transformation. As CEO Mayank Bawa puts it:

Large-scale transformations can be parameterized as SQL/MR functions for data cleansing and standardization, unleashing the true potential for Extract-Load-Transform pipelines and making large-scale data model normalization feasible. Push down also enables rapid discovery and data pre-processing to create analytical data sets used for advanced analytics such as SAS and SPSS.

Some of our recent links about MapReduce

Comments

8 Responses to “Three approaches to parallelizing data transformation”

  1. MapReduce links | DBMS2 -- DataBase Management System Services on August 27th, 2008 5:20 am

    [...] Another application of MapReduce [...]

  2. Yves de Montcheuil on August 28th, 2008 4:24 am

    Curt, on this topic, I would like to point you to Talend, the first open source data integration software. Talend Open Studio is also the first solution to support both the ETL and ELT approaches natively – and of course the ETLT approach as well.
    Unlike tools like Sunopsis (now Oracle Data Integrator), arguably the pionner of ELT, and engine-based tools such as Informatica or DataStage that support only ETL (ELT is only an afterthought), Talend supports both approaches natively, providing always the best performance.
    More info on Talend Open Studio: http://www.talend.com/products-data-integration/talend-open-studio.php

    Yves @ Talend

  3. Luke Lonergan on August 28th, 2008 1:15 pm

    Talend is good stuff – they have a very active development team on the forefront of ELT pushdowns. Two thumbs up!

  4. Why MapReduce matters to SQL data warehousing | DBMS2 -- DataBase Management System Services on August 28th, 2008 2:45 pm

    [...] Another application of MapReduce [...]

  5. Infology.Ru » Blog Archive » Три подхода к распараллеливанию процесса преобразования данных on September 29th, 2008 5:22 pm

    [...] Автор: Curt Monash Дата публикации оригинала: 2008-08-26 Перевод: Олег Кузьменко Источник: Блог Курта Монаша [...]

  6. Infology.Ru » Blog Archive » Почему MapReduce так важен для хранилищ данных? on October 5th, 2008 2:59 am

    [...] Another application of MapReduce [...]

  7. Pervasive DataRush | DBMS2 -- DataBase Management System Services on January 7th, 2009 9:21 pm

    [...] made a few references to Pervasive DataRush in the past — like this one — but I’ve never gotten around to seriously writing it up.   I’ll now try to [...]

  8. ETL 与并行执行 | Alex的个人Blog on January 13th, 2009 1:03 am

    [...] 是可以很容易伸缩的. 而在另一个文章中读到Pervasive Software 也提供一个商业编程ETL API 可以很容易并行执行ETL任务, [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.