October 5, 2014

Spark vs. Tez, revisited

I’m on record as noting and agreeing with an industry near-consensus that Spark, rather than Tez, will be the replacement for Hadoop MapReduce. I presumed that Hortonworks, which is pushing Tez, disagreed. But Shaun Connolly of Hortonworks suggested a more nuanced view. Specifically, Shaun tweeted thoughts including:

Tez vs Spark = Apples vs Oranges.

Spark is general-purpose engine with elegant APIs for app devs creating modern data-driven apps, analytics, and ML algos.

Tez is a framework for expressing purpose-built YARN-based DAGs; its APIs are for ISVs & engine/tool builders who embed it

[For example], Hive embeds Tez to convert its SQL needs into purpose-built DAGs expressed optimally and leveraging YARN

That said, I haven’t yet had a chance to understand what advantages Tez might have over Spark in the use cases that Shaun relegates it to.

Related link


6 Responses to “Spark vs. Tez, revisited”

  1. Sandy Ryza on October 5th, 2014 9:31 pm

    I think a fairer comparison between Spark and Tez is probably Apples vs. Apple cores. Spark offers a superset of Tez’s functionality. Tez is an engine for distributing data-parallel computation over lots of computers. Spark is also this, but includes an elegant API on top, as well as a distributed memory abstraction that allows caching data across the cluster. They both relate to YARN in exactly the same way: they use it to deploy their bits and schedule work on the cluster.

  2. Sandy Ryza on October 5th, 2014 9:33 pm

    More details at http://qr.ae/JcBMm

  3. David Gruzman on October 8th, 2014 4:02 am

    I agree with Sandy, that both Spark and Tex implements the same abstraction. I see two different cases of usage with different factors of success.
    First one – interactive usage by knowledge worker. Here I think Spark will win, since its usability is fantastic. In memory abstraction is also nice here – since data volumes are frequently modest.

    Second one – as a execution engine for the SQL processing. Usability is not a factor here. The factor is robustness and performance of shuffle… I can not tell now who is better from this perspective…
    Regarding distributed memory abstraction, and capability to cache data in memory between steps – i believe its applicability for big data is limited because we can not give big heaps to JVM.

  4. tariq on October 9th, 2014 1:39 pm

    You have a typo in line#4. It should be suggested and not “suggesed”

  5. Curt Monash on October 10th, 2014 4:20 am

    Fixed — thanks!!


  6. Notes on the Hortonworks IPO S-1 filing | DBMS 2 : DataBase Management System Services on December 7th, 2014 8:57 am

    […] Spark vs. Tez (October, 2014) […]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.