December 13, 2012

Spark, Shark, and RDDs — technology notes

Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level:

The key concept here seems to be the RDD. Any one RDD:

Otherwise, there’s a lot of flexibility; an RDD can be a set of tuples, a collection of XML documents, or whatever other reasonable kind of dataset you want. And I gather that:

Just like MapReduce, Spark wants to be fault-tolerant enough to work on clusters of dubiously-reliable hardware. Unlike MapReduce, Spark doesn’t persist intermediate result sets (unless they’re too large to fit into RAM). Rather, Spark’s main fault-tolerance strategy is:

Further, Reynold Xin emailed:

Spark [supports] speculative execution for dealing with stragglers. Speculation is particularly important for low-latency jobs, which are common in Spark.

Shark borrows a lot of Hive code to do what Hive does, only over Spark. Notes on Shark’s query planning include:

Further Shark smarts are to be added down the road.

And finally, Shark gives a columnar storage format to its RDDs, which has already been discussed on this blog.


11 Responses to “Spark, Shark, and RDDs — technology notes”

  1. Matei Zaharia on December 14th, 2012 2:30 pm

    For anyone interested in learning more about Spark and Shark, here are their homepages:,

  2. Spark and Shark in the news | Spark on December 21st, 2012 1:39 pm

    […] Curt Monash, editor of the popular DBMS2 blog, wrote a great introduction to Spark and Shark, as well as a more detailed technical overview. […]

  3. Spark, Shark, and BDAS In the News | Andy Konwinski on February 19th, 2013 7:11 pm

    […] Spark, Shark, and RDDs — technology notes, December 13, 2012 […]

  4. ClearStory, Spark, and Storm | DBMS 2 : DataBase Management System Services on September 29th, 2013 10:56 pm

    […] Is a flagship user of Spark. […]

  5. Notes on memory-centric data management | DBMS 2 : DataBase Management System Services on January 3rd, 2014 4:36 am

    […] is emphatically backing Shark. And a key aspect of Shark is that, unlike most of Hadoop, it’s […]

  6. Spark and Databricks | DBMS 2 : DataBase Management System Services on February 2nd, 2014 1:51 pm

    […] the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is […]

  7. Bob Schilmann on February 4th, 2014 2:55 pm

    I’m confused, is this related to Sparqlcity?

  8. Curt Monash on February 4th, 2014 5:23 pm

    Spark is no relation to Sparql.

  9. Optimism, pessimism, and fatalism — fault-tolerance, Part 2 | DBMS 2 : DataBase Management System Services on June 8th, 2014 12:58 pm

    […] both Hadoop and Spark support speculative execution, in which several clones of a processing step are executed at once […]

  10. Basho and Riak | DBMS 2 : DataBase Management System Services on October 15th, 2015 11:18 am

    […] something better than ZooKeeper for cluster management, and I presume some help in persisting Spark RDDs as […]

  11. Spark on fire | DBMS 2 : DataBase Management System Services on July 31st, 2016 7:20 am

    […] A clever approach to fault-tolerance. […]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.