December 13, 2012

Introduction to Spark, Shark, BDAS and AMPLab

UC Berkeley’s AMPLab is working on a software stack that:

The whole thing has $30 million in projected funding (half government, half industry) and a 6-year plan (which they’re 2 years into).

Specific projects of note in all that include:

Mike Franklin* and his colleagues, who recently introduced me to all this, are focused on the database parts, including Spark and Shark. A recent slide deck gives details; Slide 11 in particular shows some of the project elements (I gather that everything on that slide is expected some time in 2013). A fuller accounting of project components may be found on the AMPLab website.

*Mike is the guy on whose work Truviso was based.

The most obvious improvements in Spark over MapReduce are:

The most obvious improvements in Shark over Hive are:

Not spilling intermediate results to disk is an important point. We normally think of this as a big deal in complex query execution, for example as an aspect of the design of Impala or Hadapt. But it’s perhaps even more important in iterative machine learning algorithms, which seem to be top-of-mind as a design point for the AMPLab guys.

There seems to be quite a bit of interest in and even adoption of these projects. The AMPLab guys seemed more comfortable talking about that for the record via email, and so with permission I quote (lightly edited):

We’ve seen Spark used for a variety of analytics and statistical learning applications, mostly on Hadoop and Hive data. These range from replacing Hive or Pig for simple SQL queries, to anomaly detection, to interactive dashboards where users can drill into data. Two examples of companies that have talked publicly about their Spark use cases are:

  • Conviva (Ion Stoica’s video analytics company), one of our earlier users, which has used it to replace a large fraction of their queries.
  • Quantifind, a company that performs predictive analytics and text mining on social data to help marketers at large entertainment companies.

See http://data-informed.com/blog/2012/10/17/spark-an-open-source-engine-for-iterative-data-mining/ for a short writeup on both of these use cases. Other users we know about are performing web analytics and BI-like workloads.

Several companies have also contributed to the open source projects. For example, Yahoo! has contributed a JDBC server to Shark, and is working on a bytecode optimizer.

We have a growing user community. Our meetup group is approaching 500 members. To date, meetups have been hosted by AirBnb, Groupon, Yelp, Palantir, Conviva, and Klout. More details at http://www.meetup.com/spark-users/.

Finally, we held a Big Data bootcamp for industrial practitioners back in August that offered two days of training using Spark and Shark. The bootcamp was sold out for on-site attendance and 5000 people attended via online live streaming. Details at http://ampcamp.berkeley.edu.

You can find the list of public contributors to Spark and Shark at the following two GitHub pages:

and

I went through the list and identified the companies the contributors are associated with based on public information. Below is the partial list, roughly in the order of lines of code contributed.

  • UC Berkeley AMPLab
  • Yahoo!
  • Conviva
  • Quantifind
  • Clearstory Data
  • Time Out (UK)
  • GoodData (Czech)
  • AdMobius
  • Nuxeo (France)
  • Princeton University

More on Spark and Shark technology in a separate post.

Comments

10 Responses to “Introduction to Spark, Shark, BDAS and AMPLab”

  1. Spark, Shark, and RDDs — technology notes | DBMS 2 : DataBase Management System Services on December 13th, 2012 5:54 pm

    [...] Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level: [...]

  2. Matei Zaharia on December 14th, 2012 2:29 pm

    Thanks for writing this up, Curt. For more details on Spark and Shark, here are the homepages: http://spark-project.org and http://shark.cs.berkeley.edu.

  3. Robert Hodges on December 16th, 2012 2:53 am

    Thanks for covering AMPLab. The work on Shark/Spark is very innovative and look as if it has excellent potential. The other interesting team in Soda Hall is Joe Hellerstein’s group, who have been doing work around CALM. That work is more relevant to OLTP processing though it also has interesting application to analytics as well. I hope you will have a chance to interview them the next time you visit UC Berkeley, assuming you have not done so already.

  4. Mike Franklin on December 16th, 2012 7:19 pm

    Robert,

    As you mention, our discussions with Curt have focused on a few components of the BDAS stack that have had recent releases.

    While AMPLab’s emphasis is on analytics, we do work on OLTP as well. Projects include the MDCC (Multi-Data Center Consistency) protocol, and the Probabilistically Bounded Staleness (PBS) framework – the latter of which is being done in collaboration with Joe and his group.

    AMPLab also has efforts on scalable Machine Learning and using Crowdsourcing and Human Computation for analytics. I wrote a recent blog post including some of these other projects at https://amplab.cs.berkeley.edu/2012/12/02/a-snapshot-of-database-research-in-the-amplab/

    Of course, this is just one effort in the Big Data area on campus. Suffice it to say that Curt could indeed find lots to write about in Soda Hall and elsewhere at Berkeley.

  5. Robert Hodges on December 17th, 2012 11:38 am

    Thanks for the link Mike. The output from your lab, let alone Cal CS as a whole, is quite impressive. I didn’t know about mesos but am looking at it now. Cheers, Robert

  6. Spark and Shark in the news | Spark on December 21st, 2012 1:37 pm

    [...] Monash, editor of the popular DBMS2 blog, wrote a great introduction to Spark and Shark, as well as a more detailed technical [...]

  7. Spark, Shark, and BDAS In the News | Andy Konwinski on February 19th, 2013 7:10 pm

    [...] Introduction to Spark, Shark, BDAS and AMPLab, December 13, 2012 [...]

  8. 大数据分析与列数据库 | Alex的个人Blog on February 28th, 2013 1:29 am

    [...] [13]http://www.dbms2.com/2012/12/13/introduction-to-spark-shark-bdas-and-amplab/ [...]

  9. Cloudera Hadoop strategy and usage notes | DBMS 2 : DataBase Management System Services on August 27th, 2013 5:15 pm

    [...] Charles is also seeing at least POC interest in Spark. [...]

  10. Spark and Databricks | DBMS 2 : DataBase Management System Services on February 2nd, 2014 1:51 pm

    [...] heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some [...]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.