UC Berkeley’s AMPLab is working on a software stack that:
- Is meant (among other goals) to improve upon Hadoop …
- … but also to interoperate with it, and which in fact …
- … uses significant parts of Hadoop.
- Seems to have the overall name BDAS (Berkeley Data Analytics System).
The whole thing has $30 million in projected funding (half government, half industry) and a 6-year plan (which they’re 2 years into).
Specific projects of note in all that include:
- Mesos, a cluster manager. I don’t know much about Mesos, but it seems to be in production use, most notably at Twitter supporting Storm.
- Spark, a replacement for MapReduce and the associated execution stack.
- Shark, a replacement for Hive.
Mike Franklin* and his colleagues, who recently introduced me to all this, are focused on the database parts, including Spark and Shark. A recent slide deck gives details; Slide 11 in particular shows some of the project elements (I gather that everything on that slide is expected some time in 2013). A fuller accounting of project components may be found on the AMPLab website.
*Mike is the guy on whose work Truviso was based.
The most obvious improvements in Spark over MapReduce are:
- Richer and more flexible syntax, in that:
- You can do stuff beyond Map and Reduce.
- You can mix steps at will.
- An alternate approach to fault tolerance, in which data doesn’t have to be written to disk between steps.
The most obvious improvements in Shark over Hive are:
- It uses Spark, which performs better than MapReduce.
- It has columnar, in-memory data structures.
Not spilling intermediate results to disk is an important point. We normally think of this as a big deal in complex query execution, for example as an aspect of the design of Impala or Hadapt. But it’s perhaps even more important in iterative machine learning algorithms, which seem to be top-of-mind as a design point for the AMPLab guys.
There seems to be quite a bit of interest in and even adoption of these projects. The AMPLab guys seemed more comfortable talking about that for the record via email, and so with permission I quote (lightly edited):
We’ve seen Spark used for a variety of analytics and statistical learning applications, mostly on Hadoop and Hive data. These range from replacing Hive or Pig for simple SQL queries, to anomaly detection, to interactive dashboards where users can drill into data. Two examples of companies that have talked publicly about their Spark use cases are:
- Conviva (Ion Stoica’s video analytics company), one of our earlier users, which has used it to replace a large fraction of their queries.
- Quantifind, a company that performs predictive analytics and text mining on social data to help marketers at large entertainment companies.
See http://data-informed.com/blog/2012/10/17/spark-an-open-source-engine-for-iterative-data-mining/ for a short writeup on both of these use cases. Other users we know about are performing web analytics and BI-like workloads.
Several companies have also contributed to the open source projects. For example, Yahoo! has contributed a JDBC server to Shark, and is working on a bytecode optimizer.
We have a growing user community. Our meetup group is approaching 500 members. To date, meetups have been hosted by AirBnb, Groupon, Yelp, Palantir, Conviva, and Klout. More details at http://www.meetup.com/spark-users/.
Finally, we held a Big Data bootcamp for industrial practitioners back in August that offered two days of training using Spark and Shark. The bootcamp was sold out for on-site attendance and 5000 people attended via online live streaming. Details at http://ampcamp.berkeley.edu.
You can find the list of public contributors to Spark and Shark at the following two GitHub pages:
I went through the list and identified the companies the contributors are associated with based on public information. Below is the partial list, roughly in the order of lines of code contributed.
- UC Berkeley AMPLab
- Clearstory Data
- Time Out (UK)
- GoodData (Czech)
- Nuxeo (France)
- Princeton University
More on Spark and Shark technology in a separate post.