May 29, 2013

Syncsort extends Hadoop MapReduce

My client Syncsort:

*Perhaps we should question Syncsort’s previous claims of having strong multi-node parallelism already. 🙂

The essence of the Syncsort DMX-h ETL Edition story is:

More details can be found in a slide deck Syncsort graciously allowed me to post.

And just to be clear:

Let’s turn now to the Syncsort Hadoop patch, which:

Both versions of DMX-h depend upon this patch.

The point of the Syncsort Hadoop patch is to let you interrupt Map and Reduce steps at the points where they expect to perform a sort. You may then invoke a different algorithm or program altogether. This offers two kinds of potential benefits:

I am curious as to whether other functionality use cases will emerge.

Comments

8 Responses to “Syncsort extends Hadoop MapReduce”

  1. aaron on May 31st, 2013 11:33 am

    Syncsort gen1 made an outstanding mainframe sort/merge package. It is also a great preprocessing tool for large extract data sets such as for DW to speed ETL. The world has changed. The goal of packages of this ilk was to optimize RAM use and sequential read/writes – which are issues that no longer exist.

    ETL tools (sometimes with the exception of Ab Initio) are a confusing product in the current market. They are substantially slower than other programming (again due to fast CPUs with lots of RAM). The pitch gets reduced to:
    – develop using *limited* developers. There is occasional value in non-programmer analysts touching code, but ETL programmers are generally not productive or valuable – this is mostly a management/sales pitch strategy.
    – metadata with lineage. This is generally only useful in projects with under 200 developers with strong management.

    Net-net – Syncsort seems to be jumping from a dead market to a hopeless position in a frozen one.

  2. aaron on May 31st, 2013 11:35 am

    oops – should have said OVER 200 developers

  3. Curt Monash on May 31st, 2013 7:10 pm

    Aaron,

    If you’re suggesting that ETL tools can in some cases be like second-rate DBMS, you have a point.

    Still, a large portion of Hadoop use is for what could be called ETL. So the story “Do what you were doing before for the non-relational parts, and we’ll help you with the relational ones” isn’t entirely crazy.

  4. aaron on June 1st, 2013 1:26 am

    My issue with is that now that CPU is so much faster than IO and plentiful in Hadoop environments – how can there be a product that optimizes sort/merge specifically?

    This seems to be a cure without a disease.

    If this is a specific RDBMS-Hadoop adaptor I still don’t see the value. If it is pointed RDBMS-Hadoop, the RDBMS is likely the bottleneck, so why have a separate tool for this connection? If Hadoop-RDBMS, what is value again vs. Sqoop?

    I guess there may be some optimizations, but it seems such a point product that it may be more useful licensed to RDBMS vendors.

    If much of the work is already being done in Java/XML/shell commands, this really probably shouldn’t be packaged as an ETL tool, but rather as a superadaptor or an ESB.

  5. Curt Monash on June 1st, 2013 1:36 am

    Aaron,

    If you’re saying that nothing is ever CPU-bound any more, I don’t know what you’re talking about.

  6. aaron on June 3rd, 2013 8:54 am

    I think you missed the point. The CPU needs of a sorting algorithm are proportionate to the amount of data. Better sorting in large memory systems is unlikely to reduce IO, but may reduce CPU.

    Historically, sorting was constrained by both IO and CPU. CPU speed/density kept increasing faster than IO, and by ~1990, most sorting became IO bound.

    In a shared system doing things like Hadoop today, I can’t remember any case where sorting commonly became CPU bound.

    My point is that this is a treatment in need of a disease. If the vendor has a counter argument – I would be interested in hearing it.

  7. Arun C Murthy on June 3rd, 2013 7:05 pm

    I wouldn’t say there is no return on trying to better utilize the CPU on sorts etc.

    There is a tremendous return on better CPU utilization from a datacentre cost perspective (power, heat). Furthermore, modern CPU architecture requires a particular bent of mind for your algorithms; and returns can be quite stunning:
    MonetDB Paper: http://www-db.cs.wisc.edu/cidr/cidr2005/papers/P19.pdf

    Apache Tez (a new runtime) incorporates many, many sort optimizations for MR-like applications, interactive SQL queries etc. – again, we see lots of return there.

    Hope this helps.

  8. aaron on June 4th, 2013 2:10 pm

    Thanks Arun – that is a seminal paper, and my point is also made by the authors there:
    “we point out that the “artificially” high bandwidths generated by MonetDB/MIL make it harder to scale the system to disk-based problems efficiently, simply because memory bandwidth tends to be much larger (and cheaper)than I/O bandwidth.” (And, BTW, sad how little Volcano has integrated with Hadoops.)

    It is instructive to look at Hadoop system sorting. I am amazed at how little large scale sorting (even 10GB or more) actually happens in ones I look at. This is likely relevant to the huge effort that is going on in relational::Hadoop integration (which may be more targeted at preserving existing license streams than user value), but good sorting can obviate the need for many of those constructs.

    I’ve been following Tez with interest. DAGs both for workflow and for data relationship modeling and algorithms based on that are clearly a key path for Hadoop to do unique work.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.