Let’s start with some Syncsort basics.
- Syncsort was founded in 1968.
- As you might guess from its name and age, Syncsort started out selling software for IBM mainframes, used for sorting data. However, for the past 30 or so years, Syncsort’s products have gone beyond sort to also do join, aggregation, and merge. This was the basis for Syncsort’s expansion into the more general ETL (Extract/Transform/Load) business.
- As you might further guess, along the way there was a port to UNIX, development of a GUI (Graphical User Interface), and a change of ownership as Syncsort’s founder more or less cashed out.
- At this point, Syncsort sees itself primarily as a data integration/ETL company, whose main claim to fame is performance, with further claims of linear scaling and no manual tuning.*
One of Syncsort’s favorite value propositions is to contrast the cost of doing ETL in Syncsort, on commodity hardware, to the cost of doing ELT (Extract/Load/Transform) on high-end Teradata gear.
*I forget whether Syncsort actually bothered to say “almost” when making those claims, but one should of course assume the word is in there.
Syncsort general highlights include:
- Syncsort now has about 350 employees and $100 million in revenue.
- Syncsort’s company reboot occurred in April, 2008. Syncsort’s founder was largely bought out by investors, and new management started coming in.
- Syncsort says it has three main businesses:
- Data protection — this is the smallest Syncsort business, and I didn’t ask further about it.
- Mainframe sort, apparently now under the product name MFX rather than Syncsort. (However, Syncsort says that for the past 30 years its sort products have done more than sort, specifically also join, aggregation, and merge. That’s the basis for the whole move to ETL.)
- Data integration, which I think really means “open systems rather than mainframe.” This is the biggest part of the whole.
- Syncsort’s main data integration product is called DMExpress. There also are bunch of installations of a legacy UNIX product called — you guessed it! — Syncsort.
- There are about 900 DMExpress customers. Syncsort guesses that around 60% of them are using DMExpress for more than just sorting.
- Syncsort cites its main data integration competitors as being, no surprise, Ab Initio (their favorite choice to compare themselves to), Informatica (with PowerCenter), and IBM (with DataStage). Syncsort evidently also sees Talend reasonably often, Pervasive rarely, and expressor never.
The high-level technology picture for Syncsort DMExpress is:
- DMExpress is focused on big-batch loading, not low-latency streaming. In theory one could fire up DMExpress every 10-15 seconds, but Syncsort didn’t make that sound like a common use case.
- Core competencies of DMExpress seem to include sorting, aggregation, joins, merging, compression, and DBMS-style optimization.
- Syncsort asserts that 80% of the cycles in ETL are taken up with sorting and aggregation, and hence that any advantages in performance or scalability DMExpress has in those areas translate to general performance and scalability advantages in ETL.
- Syncsort DMExpress runs on dedicated boxes, with fast direct-attached storage; 15,000 RPM disks are common, and Syncsort wishes it could persuade more of its customers to use solid-state drives instead. Syncsort believes DMExpress is at its most differentiated when buffers overflow and swapping is needed.
- DMExpress compression is just gzip. Syncsort asserts that sorting makes a big difference in gzip’s compression ratio.
- DMExpress is faster on Linux than on Unix.
- In general, Syncsort makes a big deal out of using as few CPU cycles as possible. (Given the product’s history, that makes sense.) Syncsort’s core performance claim is that DMExpress handles data at close to raw I/O rates.
Syncsort DMExpress competitive claims include:
- DMExpress supposedly uses only 25% as much CPU as competitors, even Ab Initio.
- DMExpress does direct I/O from disk, with large buffers so that reads can be nicely sequential. Apparently this is called “partition parallelism,” and other ETL vendors do it too. But Syncsort claims differentiation in that this happens automagically.
- Syncsort asserts that managing general parallelism is painful in, say, Informatica PowerCenter. But again, DMExpress does that automagically.
- DMExpress starts one thread per file read in. Syncsort asserts that Ab Initio, by contrast, starts many more processes than that.
Syncsort estimates that one DMExpress customer is loading 1000 records/second/machine on 500 machines, around the clock. That would be about 2 billion records/hour, which is not implausible given who the customer is. Syncsort also told a story of an unnamed customer for whom Oracle utterly choked on joining 5 tables of 1 terabyte each. (27 days to run with clever workarounds.) DMExpress did the join in 6 hours and the whole load in 15.
By the way, I gather that Syncsort DMExpress is sometimes nicknamed “DMX”.
Syncsort became a client since the last time I posted a vendor client list.