Many MPP data warehousing vendors have told me their products are used for ELT (Extract/Load/Transform) instead of ETL (Extract/Transform/Load). I.e., needed data transformations are done on the MPP system, rather than on the — probably SMP — system the data comes from.* If the data transformation is being applied on a record-by-record basis, then it’s automatically fully parallelized. Even if the transforms are more complex, considerable parallel processing may still be going on.
*Or it’s some of each, at which point it’s called ETLT — I bet you can work out what that stands for.
But depending on your needs, at least two other approaches to data transformation parallelization could also be considered. Pervasive Software, which has a big data integration software business of its own, built a new ETL tool. The foundation was a middle-tier multi-core-friendly Java dataflow engine, which has been now split out as Pervasive Datarush. The product is in the early stages of being released, which may be a good excuse for the website confusingly suggesting both of:
- You can have Datarush for free.
- If Datarush doesn’t produce a 30X speedup for you, you can get your money back.
The third approach is my Subject Of The Week: MapReduce. When I posted a list of canonical MapReduce applications, my friends at Aster Data offered one pushback — I left out the area of data transformation. As CEO Mayank Bawa puts it:
Large-scale transformations can be parameterized as SQL/MR functions for data cleansing and standardization, unleashing the true potential for Extract-Load-Transform pipelines and making large-scale data model normalization feasible. Push down also enables rapid discovery and data pre-processing to create analytical data sets used for advanced analytics such as SAS and SPSS.
Some of our recent links about MapReduce
- The integration of MapReduce with SQL data warehousing
- Three major applications of MapReduce
- Sound bites about MapReduce
- Other links about MapReduce