Syncsort and Vertica combined to devise and run a benchmark in which a data warehouse got loaded at 5 ½ terabytes per hour, which is several times faster than the figures used in any other vendors’ similar press releases in the past. Takeaways include:
- Syncsort isn’t just a mainframe sort utility company, but also does data integration. Who knew?
- Vertica’s design to overcome the traditional slow load speed of columnar DBMS works.
The latter is unsurprising. Back in February, I wrote at length about how Vertica makes rapid columnar updates. I don’t have a lot of subsequent new detail, but it made sense then and now.
As is always the case in private benchmarks – or indeed audited public ones – the specific numbers shouldn’t be taken too seriously. Back in February, Vertica talked of sustainable load speeds of 3-5 megabytes/second. That’s around 10-18 gigabytes/hour. Multiply by 16 for the number of nodes in this latest benchmark, tack on the factor of your choice for better hardware, and you still come out more than an order of magnitude away from the 5+TB/hour figure. (Edit: Omer Trajman clarifies in the comments that this comes from the difference between 1 stream of trickle feed and 8 streams of bulk load. Bulk rates are about 3X trickle rates, per stream, and the benchmark was done at 8 streams/node.)
But all trailing zeros aside, I’m hearing more and more about data warehouse load speeds these days. E.g., data integration startup Expressor Software hopes fast loading will be one of its claims to fame. Aster Data’s architecture includes dedicated bulk load nodes. Kickfire bragged back in April of 100 GB/hour load speeds. Many other database vendors emphasize bulk and/or incremental load speed as well.
There are at least two main reasons for this emphasis on data warehouse load speed. First, increasingly many data warehouse use cases are 24/7, making batch windows problematic. Examples can include call centers, website personalization, or manufacturing, plus any kind of global enterprise or global SaaS operation. Second, there are a lot of use cases where, 24/7-ness even aside, data comes in fast. Clickstream/network event is the most obvious example. Telecom is another biggie. And all meltdowns aside, there is quite a lot of database use — including columnar databases — in the financial trading “tick store” market.
All that said — I think most commercial data warehouse DBMS will provide most users with much more load speed than they actually need. More important in most cases will be the performance overhead created by the loading, as well as the load speed and hardware utilization of the data integration middleware itself. But at some point in the technology stack, data warehouse load speed is an increasingly non-trivial subject.