Spark and Databricks
I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.
- Spark is very new. All Spark adoption is recent.
- Databricks was founded to commercialize Spark. It is very much in stealth mode …
- … except insofar as Databricks folks are going out and trying to drum up Spark adoption. 🙂
- Ion Stoica is running Databricks, but you couldn’t tell that from his UC Berkeley bio page. Edit: After I posted this, Ion’s bio was quickly updated. 🙂
- Spark creator and Databricks CTO Matei Zaharia is an MIT professor, but actually went on leave there before he ever showed up.
- Cloudera is perhaps Spark’s most visible supporter. But Cloudera’s views of Spark’s role in the world is different from the Spark team’s.
The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:
- Spark is a distributed execution engine for analytic processes …
- … which works well with Hadoop.
- Spark is distinguished by a flexible in-memory data model …
- … and farms out persistence to HDFS (Hadoop Distributed File System) or other existing data stores.
- Intended analytic use cases for Spark include:
- SQL data manipulation.
- ETL-like data manipulation.
- Streaming-like data manipulation.
- Machine learning.
- Graph analytics.
Except for certain low-latency operations,* anything you can do in Spark can also be done in straight Hadoop; Spark just can have advantages in performance and programming ease. Spark RDDs (Resilient Distributed Datasets) are immutable at this time, so Spark is not suited for short-request update workloads.
*A new Spark task requires a thread, not a whole Java Virtual Machine.
Everybody agrees that machine learning is a top Spark use case. In particular:
- Cloudera sees machine learning as the major area of Spark adoption to date.
- Ion gave me the impression machine learning is one of the major areas of Spark adoption to date.
- Mike gave me the impression that machine learning was a core intended use case for Spark the first time we talked about it.
- There’s a machine learning library for Spark, and also a way to use Spark to do distributed R.
I believe data transformation is a major Spark use case as well.
- Ion gave me that impression, although Cloudera surprisingly did not. Edit: Actually, see Matt Brandwine’s comment below.
- I have one client (ClearStory) using Spark that way, and a second that’s likely to.
- It makes sense that the #1 Hadoop use case (to date), which is something Spark also is well-suited for, would be an important early Spark use case as well.
Spark Streaming is fairly new, but is already getting some adoption. Notes on that start:
- The actual technology is a form of micro-batching. I plan to learn more about that in the future.
- Cloudera sees streaming as one of the two big Spark use cases, and praises Spark Streaming for its fault tolerance and its great ease of coding.
- Mike Franklin knows a lot about streaming.
Part of that story is a sudden decline in the reputation of Storm, whose troubles seem to include:
- Project founder and Twitter employee Nathan Marz seems no longer to be associated with Storm nor employed at Twitter.
- I am told that in general the Storm community is not all that vibrant.
- Various aspects of Storm’s technology are disappointing people.
Other notes on Spark use cases include:
- Impala-loving Cloudera doesn’t plan to support Shark. Duh.
- Cloudera also won’t at first support any Spark predictive modeling add-on.
- Ion’s other company, Conviva, is doing some real-time decisioning in Spark.
Spark data management has been enhanced by a project called Tachyon.* The main point of Tachyon is that Spark RDDs (Resilient Distributed Datasets) now persist in memory beyond the life of a job; besides offering the RDDs to other Spark jobs, Tachyon also opens them to Hadoop via an HDFS emulator.
*If there’s ever a Spark/Tachyon management suite, I hope some aspect is named Cherenkov — i.e., the radiation that is measured to detect the passage of tachyons.:)
And finally, some metrics and so on:
- Databricks has between 10 and 20 employees.
- Spark has >100 individual contributors from >25 different companies.
- There was a Spark Summit with >450 attendees (from >180 organizations), and an earlier Spark-mainly conference with >200 attendees.
- The Spark meet-up group in San Francisco has >1500 members signed up.
- Various Spark users and subprojects are identified on the Apache Spark pages.
Related link
- Most of the current substance on Databricks’ website is in its blog.
Comments
16 Responses to “Spark and Databricks”
Leave a Reply
[…] Spark and other memory-centric technology, including streaming. […]
Curt, great post. Re: “I believe data transformation is a major Spark use case as well. Ion gave me that impression, although Cloudera surprisingly did not.” we do, in fact, also expect Spark to be used for core data processing workloads. As the tools/apps community builds atop Spark (e.g. ClearStory) this will become more true. Today the longest pole for the Hadoop ecosystem is interactive analysis, thus the initial interest and focus there.
While he doesn’t explicit call out data processing, Mike Olson comments on the future of Spark and MapReduce in this blog post: http://vision.cloudera.com/mapreduce-spark/
>>If there’s ever a Spark/Tachyon management suite, >>I hope some aspect is named Cherenkov — i.e., the >>radiation that is measured to detect the passage of tachyons.:)
I am impressed with you knowledge level in theoretical physics 🙂
I started college as a physics major, and ultimately came within a couple courses of finishing it (as a double major with math). The main thing I avoided by dropping the major was electronics lab. 🙂
Some comments:
— Spark is clearly superior to MR for iterative machine learning, since it can hold data in memory while MR persists after each pass through the data. Some machine learning algorithms are iterative, others are not. The performance delta for logistic regression (which is iterative) is 100X versus MR
— Mahout is so done that speakers at the Spark Summit didn’t feel a need to stick a fork in it
— I don’t hear many people asking for high-latency analytics
— Part of the “sell” for Spark is a single platform for advanced analytics encompassing Fast Queries, ML, Streaming, Graph Engines etc instead of multiple point solutions (Giraph, Storm, Mahout etc). It’s a nice theory, but as you note Cloudera isn’t too keen on promoting Shark, and Spark’s Graph Engine has work to do to displace GraphLab
— R pushdown is still nascent. AMPLab just released SparkR, (http://portfortune.wordpress.com/2014/01/17/r-interface-to-apache-spark/) which has generated a lot of interest but is too new to say much about. A startup called Adatao has developed a proprietary R interface that can push down embarrassingly parallel functions.
— The Spark Summit was excellent, well worth attending.
[…] commercial support. Recent coverage by this blog here and here; additional coverage here, here, here and […]
[…] Spark buzz keeps increasing; almost everybody I talk with expects Spark to win big, probably across […]
Spark really is super exciting. I recommend both the intro RDD and the streaming papers; they’re very accessible and make a strong case for why vanilla MR should be tossed (http://spark.incubator.apache.org/research.html).
The unified abstraction for big data batch and stream processing is a particularly compelling. Fast/slow paths are often dealt with separately in big data systems and it’s nice to see the two flows start to unify.
The biggest hindrance to Spark right now is its newness. It’s also Scala-first, but exposes well to Java and Python so that’s not a blocker. Lack of detail on how the transformations are implemented is annoying but something that documentation (or reading the source code) can solve.
Give Spark another year or two to mature, and there won’t be any reason to pick vanilla map reduces for new projects.
Hi Curt,
I am going through the articles/documents explaining how to get started with spark. I have a big data problem that I want to solve using spark and spark streaming.
Being a big data noob(My question will say it for me) I don’t understand what all I need to install to spark work in a distributed computing environment.
i.e. a) If I install spark on each cluster machine?
b) Do I need to install some other HDFS providing piece on each cluster first before I can use spark on them?
c) What should one install if they don’t have any existing dependency on Hadoop, HBase, Mesos, Techyon or other layers that is often talked about.
Thanks
Manas
Hey Manas,
You can either follow the documentation on the Spark site:
http://spark.incubator.apache.org/docs/latest/
Or, for a more automated experience, you can easily install Spark in Hadoop with Cloudera Manager, e.g.:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.8.1/Cloudera-Manager-Installation-Guide/cmig_spark_installation_standalone.html
Some community resources for you, too:
* Spark Community: http://spark.incubator.apache.org/community.html
* Cloudera Spark Community: http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/bd-p/Spark
If you need Hadoop in the first place, the Cloudera QuickStart VM is a great place to start:
http://go.cloudera.com/vm-download
Hope this helps,
Matt
Great help. Thanks Matt.
How does Apache Drill compare to Spark? Will Drill fade away? Or does it serve a different purpose?
[…] is trying to revitalize the Apache Storm project, after Storm lost momentum; indeed, Hortonworks is referring to Storm as a component of […]
I do not know if it’s just me or if everybody else experiencing problems with your blog.
It appears as if some of the written text within your content are running
off the screen. Can somebody else please provide feedback and let me know if this is happening to them as
well? This might be a problem with my web browser because I’ve had this happen previously.
Cheers
실시간 티비
RDF and graphs | DBMS 2 : DataBase Management System Services
learn
RDF and graphs | DBMS 2 : DataBase Management System Services