There have been many recent announcements about how data integration/ETL (Extract/Transform/Load) vendors are going to work with MapReduce. Most of what they say boils down to one or more of a few things:
- Hadoop generally stores data in HDFS (Hadoop Distributed File System). ETL vendors want to be able to extract data from or load it into HDFS.
- ETL vendors have development environments that let you specify/script/whatever ETL jobs. ETL vendors want their development tools to develop ETL processes executed via MapReduce/Hadoop.
- In particular, this allows ETL vendors to exploit the parallel-processing capabilities of MapReduce.
Some additional twists include:
- Pentaho announced business intelligence and ETL for Hadoop last year.
- Syncsort thinks different sort algorithms should be usable with Hadoop. Consequently, it plans to contribute technology to the community to make sort pluggable into Hadoop. (However, Syncsort is keeping its own sort technology proprietary.)
- Syncsort is considering replicating some Hive functionality, starting with joins, hopefully running much faster. (However, Syncsort’s basic Hadoop support is a quarter or three away, so any more advanced functionality would probably come out in 2012 or beyond.)
- SnapLogic fondly thinks that its generation of MapReduce jobs is particularly intelligent.
Finally, my former clients at Pervasive, who haven’t briefed me for a while, seem to have told Doug Henschen that they have pointed DataRush at MapReduce.* However, I couldn’t find evidence of same on the Pervasive DataRush website beyond some help in using all the cores on any one Hadoop node.
*Also see that article because it names a bunch of ETL vendors doing Hadoop-related things.