December 13, 2012

Spark, Shark, and RDDs — technology notes

Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level:

The key concept here seems to be the RDD. Any one RDD:

Otherwise, there’s a lot of flexibility; an RDD can be a set of tuples, a collection of XML documents, or whatever other reasonable kind of dataset you want. And I gather that:

Just like MapReduce, Spark wants to be fault-tolerant enough to work on clusters of dubiously-reliable hardware. Unlike MapReduce, Spark doesn’t persist intermediate result sets (unless they’re too large to fit into RAM). Rather, Spark’s main fault-tolerance strategy is:

Further, Reynold Xin emailed:

Spark [supports] speculative execution for dealing with stragglers. Speculation is particularly important for low-latency jobs, which are common in Spark.

Shark borrows a lot of Hive code to do what Hive does, only over Spark. Notes on Shark’s query planning include:

Further Shark smarts are to be added down the road.

And finally, Shark gives a columnar storage format to its RDDs, which has already been discussed on this blog.

Comments

9 Responses to “Spark, Shark, and RDDs — technology notes”

  1. Matei Zaharia on December 14th, 2012 2:30 pm

    For anyone interested in learning more about Spark and Shark, here are their homepages: http://spark-project.org, http://shark.cs.berkeley.edu.

  2. Spark and Shark in the news | Spark on December 21st, 2012 1:39 pm

    [...] Curt Monash, editor of the popular DBMS2 blog, wrote a great introduction to Spark and Shark, as well as a more detailed technical overview. [...]

  3. Spark, Shark, and BDAS In the News | Andy Konwinski on February 19th, 2013 7:11 pm

    [...] Spark, Shark, and RDDs — technology notes, December 13, 2012 [...]

  4. ClearStory, Spark, and Storm | DBMS 2 : DataBase Management System Services on September 29th, 2013 10:56 pm

    [...] Is a flagship user of Spark. [...]

  5. Spark, Shark, and RDDs | Sigmoid Analytics on December 6th, 2013 5:29 am

    [...] Read more here /* [...]

  6. Notes on memory-centric data management | DBMS 2 : DataBase Management System Services on January 3rd, 2014 4:36 am

    [...] is emphatically backing Shark. And a key aspect of Shark is that, unlike most of Hadoop, it’s [...]

  7. Spark and Databricks | DBMS 2 : DataBase Management System Services on February 2nd, 2014 1:51 pm

    [...] the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is [...]

  8. Bob Schilmann on February 4th, 2014 2:55 pm

    I’m confused, is this related to Sparqlcity?

  9. Curt Monash on February 4th, 2014 5:23 pm

    Spark is no relation to Sparql.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.