My client Syncsort:
- Is an ETL (Extract/Transform/Load) vendor, whose flagship product DMExpress was evidently renamed to DMX.
- Has a strong history in and fondness for sort.
- Has announced a new ETL product, DMX-h ETL Edition, which uses Hadoop MapReduce to parallelize DMX by controlling a copy of DMX that resides on every data node of the Hadoop cluster.*
- Has also announced the closely-related DMX-h Sort Edition, offering acceleration for the sorts inherent in Map and Reduce steps.
- Contributed a patch to Apache Hadoop to open up Hadoop MapReduce to make all this possible.
*Perhaps we should question Syncsort’s previous claims of having strong multi-node parallelism already.
The essence of the Syncsort DMX-h ETL Edition story is:
- DMX-h inherits the various ETL-suite trappings of DMX.
- Syncsort claims DMX-h has major performance advantages vs., for example, Hive- or Pig-based alternatives.
- With a copy of DMX on every node, DMX-h can do parallel load/export.
More details can be found in a slide deck Syncsort graciously allowed me to post.
And just to be clear:
- Syncsort DMX-h ETL is not focused on getting data in and out of HDFS (Hadoop Distributed File System). Rather, it uses Hadoop to support generic ETL.
- Syncsort DMX-h ETL does not invoke Hive or Pig. Rather, it’s based on Java call-outs to DMX.
Let’s turn now to the Syncsort Hadoop patch, which:
- Was primarily designed for Hadoop 2, and was adopted into Hadoop 2.03 in January.
- Has been backported to work with Hadoop 1 and now ships with the Cloudera Hadoop distribution.
- Syncsort somewhat confusingly refers to as “pluggable sort”.
Both versions of DMX-h depend upon this patch.
The point of the Syncsort Hadoop patch is to let you interrupt Map and Reduce steps at the points where they expect to perform a sort. You may then invoke a different algorithm or program altogether. This offers two kinds of potential benefits:
- Performance, for example via:
- An alternative sort algorithm, e.g. Syncsort’s. (This is the idea behind DMX-h Sort Edition.)
- Not doing a (full) sort at all, but rather returning for example:
- Top N results
- A count
- Functionality, for example via the various ETL capabilities of DMX — which is of course the idea behind DMX-h ETL edition.
I am curious as to whether other functionality use cases will emerge.