I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.
- Spark is very new. All Spark adoption is recent.
- Databricks was founded to commercialize Spark. It is very much in stealth mode …
- … except insofar as Databricks folks are going out and trying to drum up Spark adoption.
- Ion Stoica is running Databricks, but you couldn’t tell that from his UC Berkeley bio page. Edit: After I posted this, Ion’s bio was quickly updated.
- Spark creator and Databricks CTO Matei Zaharia is an MIT professor, but actually went on leave there before he ever showed up.
- Cloudera is perhaps Spark’s most visible supporter. But Cloudera’s views of Spark’s role in the world is different from the Spark team’s.
The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:
- Spark is a distributed execution engine for analytic processes …
- … which works well with Hadoop.
- Spark is distinguished by a flexible in-memory data model …
- … and farms out persistence to HDFS (Hadoop Distributed File System) or other existing data stores.
- Intended analytic use cases for Spark include:
- SQL data manipulation.
- ETL-like data manipulation.
- Streaming-like data manipulation.
- Machine learning.
- Graph analytics.
Except for certain low-latency operations,* anything you can do in Spark can also be done in straight Hadoop; Spark just can have advantages in performance and programming ease. Spark RDDs (Resilient Distributed Datasets) are immutable at this time, so Spark is not suited for short-request update workloads.
*A new Spark task requires a thread, not a whole Java Virtual Machine.
Everybody agrees that machine learning is a top Spark use case. In particular:
- Cloudera sees machine learning as the major area of Spark adoption to date.
- Ion gave me the impression machine learning is one of the major areas of Spark adoption to date.
- Mike gave me the impression that machine learning was a core intended use case for Spark the first time we talked about it.
- There’s a machine learning library for Spark, and also a way to use Spark to do distributed R.
I believe data transformation is a major Spark use case as well.
- Ion gave me that impression, although Cloudera surprisingly did not. Edit: Actually, see Matt Brandwine’s comment below.
- I have one client (ClearStory) using Spark that way, and a second that’s likely to.
- It makes sense that the #1 Hadoop use case (to date), which is something Spark also is well-suited for, would be an important early Spark use case as well.
Spark Streaming is fairly new, but is already getting some adoption. Notes on that start:
- The actual technology is a form of micro-batching. I plan to learn more about that in the future.
- Cloudera sees streaming as one of the two big Spark use cases, and praises Spark Streaming for its fault tolerance and its great ease of coding.
- Mike Franklin knows a lot about streaming.
Part of that story is a sudden decline in the reputation of Storm, whose troubles seem to include:
- Project founder and Twitter employee Nathan Marz seems no longer to be associated with Storm nor employed at Twitter.
- I am told that in general the Storm community is not all that vibrant.
- Various aspects of Storm’s technology are disappointing people.
Other notes on Spark use cases include:
- Impala-loving Cloudera doesn’t plan to support Shark. Duh.
- Cloudera also won’t at first support any Spark predictive modeling add-on.
- Ion’s other company, Conviva, is doing some real-time decisioning in Spark.
Spark data management has been enhanced by a project called Tachyon.* The main point of Tachyon is that Spark RDDs (Resilient Distributed Datasets) now persist in memory beyond the life of a job; besides offering the RDDs to other Spark jobs, Tachyon also opens them to Hadoop via an HDFS emulator.
*If there’s ever a Spark/Tachyon management suite, I hope some aspect is named Cherenkov — i.e., the radiation that is measured to detect the passage of tachyons.:)
And finally, some metrics and so on:
- Databricks has between 10 and 20 employees.
- Spark has >100 individual contributors from >25 different companies.
- There was a Spark Summit with >450 attendees (from >180 organizations), and an earlier Spark-mainly conference with >200 attendees.
- The Spark meet-up group in San Francisco has >1500 members signed up.
- Various Spark users and subprojects are identified on the Apache Spark pages.
- Most of the current substance on Databricks’ website is in its blog.