December 13, 2012

Introduction to Spark, Shark, BDAS and AMPLab

UC Berkeley’s AMPLab is working on a software stack that:

Is meant (among other goals) to improve upon Hadoop …
… but also to interoperate with it, and which in fact …
… uses significant parts of Hadoop.
Seems to have the overall name BDAS (Berkeley Data Analytics System).

The whole thing has $30 million in projected funding (half government, half industry) and a 6-year plan (which they’re 2 years into).

Specific projects of note in all that include:

Mesos, a cluster manager. I don’t know much about Mesos, but it seems to be in production use, most notably at Twitter supporting Storm.
Spark, a replacement for MapReduce and the associated execution stack.
Shark, a replacement for Hive.

Mike Franklin* and his colleagues, who recently introduced me to all this, are focused on the database parts, including Spark and Shark. A recent slide deck gives details; Slide 11 in particular shows some of the project elements (I gather that everything on that slide is expected some time in 2013). A fuller accounting of project components may be found on the AMPLab website.

*Mike is the guy on whose work Truviso was based.

The most obvious improvements in Spark over MapReduce are:

Richer and more flexible syntax, in that:
- You can do stuff beyond Map and Reduce.
- You can mix steps at will.
- An alternate approach to fault tolerance, in which data doesn’t have to be written to disk between steps.

The most obvious improvements in Shark over Hive are:

It uses Spark, which performs better than MapReduce.
It has columnar, in-memory data structures.

Not spilling intermediate results to disk is an important point. We normally think of this as a big deal in complex query execution, for example as an aspect of the design of Impala or Hadapt. But it’s perhaps even more important in iterative machine learning algorithms, which seem to be top-of-mind as a design point for the AMPLab guys.

There seems to be quite a bit of interest in and even adoption of these projects. The AMPLab guys seemed more comfortable talking about that for the record via email, and so with permission I quote (lightly edited):

We’ve seen Spark used for a variety of analytics and statistical learning applications, mostly on Hadoop and Hive data. These range from replacing Hive or Pig for simple SQL queries, to anomaly detection, to interactive dashboards where users can drill into data. Two examples of companies that have talked publicly about their Spark use cases are:

Conviva (Ion Stoica’s video analytics company), one of our earlier users, which has used it to replace a large fraction of their queries.

Quantifind, a company that performs predictive analytics and text mining on social data to help marketers at large entertainment companies.

See http://data-informed.com/blog/2012/10/17/spark-an-open-source-engine-for-iterative-data-mining/ for a short writeup on both of these use cases. Other users we know about are performing web analytics and BI-like workloads.

Several companies have also contributed to the open source projects. For example, Yahoo! has contributed a JDBC server to Shark, and is working on a bytecode optimizer.

We have a growing user community. Our meetup group is approaching 500 members. To date, meetups have been hosted by AirBnb, Groupon, Yelp, Palantir, Conviva, and Klout. More details at http://www.meetup.com/spark-users/.

Finally, we held a Big Data bootcamp for industrial practitioners back in August that offered two days of training using Spark and Shark. The bootcamp was sold out for on-site attendance and 5000 people attended via online live streaming. Details at http://ampcamp.berkeley.edu.

You can find the list of public contributors to Spark and Shark at the following two GitHub pages:

https://github.com/mesos/spark/graphs/contributors

https://github.com/amplab/shark/graphs/contributors

and

I went through the list and identified the companies the contributors are associated with based on public information. Below is the partial list, roughly in the order of lines of code contributed.

UC Berkeley AMPLab

Yahoo!

Conviva

Quantifind

Clearstory Data

Time Out (UK)

GoodData (Czech)

AdMobius

Nuxeo (France)

Princeton University

More on Spark and Shark technology in a separate post.

Categories: ClearStory Data, Databricks, Spark and BDAS, Hadoop, MapReduce, Parallelization, Specific users, SQL/Hadoop integration

Subscribe to our complete feed!

Comments

11 Responses to “Introduction to Spark, Shark, BDAS and AMPLab”

Spark, Shark, and RDDs — technology notes | DBMS 2 : DataBase Management System Services on December 13th, 2012 5:54 pm

[…] Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level: […]
Matei Zaharia on December 14th, 2012 2:29 pm

Thanks for writing this up, Curt. For more details on Spark and Shark, here are the homepages: http://spark-project.org and http://shark.cs.berkeley.edu.
Robert Hodges on December 16th, 2012 2:53 am

Thanks for covering AMPLab. The work on Shark/Spark is very innovative and look as if it has excellent potential. The other interesting team in Soda Hall is Joe Hellerstein’s group, who have been doing work around CALM. That work is more relevant to OLTP processing though it also has interesting application to analytics as well. I hope you will have a chance to interview them the next time you visit UC Berkeley, assuming you have not done so already.
Mike Franklin on December 16th, 2012 7:19 pm

Robert,

As you mention, our discussions with Curt have focused on a few components of the BDAS stack that have had recent releases.

While AMPLab’s emphasis is on analytics, we do work on OLTP as well. Projects include the MDCC (Multi-Data Center Consistency) protocol, and the Probabilistically Bounded Staleness (PBS) framework – the latter of which is being done in collaboration with Joe and his group.

AMPLab also has efforts on scalable Machine Learning and using Crowdsourcing and Human Computation for analytics. I wrote a recent blog post including some of these other projects at https://amplab.cs.berkeley.edu/2012/12/02/a-snapshot-of-database-research-in-the-amplab/

Of course, this is just one effort in the Big Data area on campus. Suffice it to say that Curt could indeed find lots to write about in Soda Hall and elsewhere at Berkeley.
Robert Hodges on December 17th, 2012 11:38 am

Thanks for the link Mike. The output from your lab, let alone Cal CS as a whole, is quite impressive. I didn’t know about mesos but am looking at it now. Cheers, Robert
Spark and Shark in the news | Spark on December 21st, 2012 1:37 pm

[…] Monash, editor of the popular DBMS2 blog, wrote a great introduction to Spark and Shark, as well as a more detailed technical […]
Spark, Shark, and BDAS In the News | Andy Konwinski on February 19th, 2013 7:10 pm

[…] Introduction to Spark, Shark, BDAS and AMPLab, December 13, 2012 […]
大数据分析与列数据库 | Alex的个人Blog on February 28th, 2013 1:29 am

[…] [13]http://www.dbms2.com/2012/12/13/introduction-to-spark-shark-bdas-and-amplab/ […]
Cloudera Hadoop strategy and usage notes | DBMS 2 : DataBase Management System Services on August 27th, 2013 5:15 pm

[…] Charles is also seeing at least POC interest in Spark. […]
Spark and Databricks | DBMS 2 : DataBase Management System Services on February 2nd, 2014 1:51 pm

[…] heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some […]
blocked drains horsham on August 20th, 2024 7:13 pm

Tһanks for the marvelous posting! I definitely enjoyeed reading it, you
arre a great author.I will ensure that I bookmaгk your blog and definitеly
will come back in the future. I want to encouгage
continue youг greɑt poѕts, have a nice ｅvening!

Нere is my weeb blog – blocked drains horsham

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Introduction to Spark, Shark, BDAS and AMPLab

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin