February 2, 2014

Spark and Databricks

I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.

Spark is very new. All Spark adoption is recent.
Databricks was founded to commercialize Spark. It is very much in stealth mode …
… except insofar as Databricks folks are going out and trying to drum up Spark adoption. 🙂
Ion Stoica is running Databricks, but you couldn’t tell that from his UC Berkeley bio page. Edit: After I posted this, Ion’s bio was quickly updated. 🙂
Spark creator and Databricks CTO Matei Zaharia is an MIT professor, but actually went on leave there before he ever showed up.
Cloudera is perhaps Spark’s most visible supporter. But Cloudera’s views of Spark’s role in the world is different from the Spark team’s.

The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:

Spark is a distributed execution engine for analytic processes …
… which works well with Hadoop.
Spark is distinguished by a flexible in-memory data model …
… and farms out persistence to HDFS (Hadoop Distributed File System) or other existing data stores.
Intended analytic use cases for Spark include:
- SQL data manipulation.
- ETL-like data manipulation.
- Streaming-like data manipulation.
- Machine learning.
- Graph analytics.

Except for certain low-latency operations,* anything you can do in Spark can also be done in straight Hadoop; Spark just can have advantages in performance and programming ease. Spark RDDs (Resilient Distributed Datasets) are immutable at this time, so Spark is not suited for short-request update workloads.

*A new Spark task requires a thread, not a whole Java Virtual Machine.

Everybody agrees that machine learning is a top Spark use case. In particular:

Cloudera sees machine learning as the major area of Spark adoption to date.
Ion gave me the impression machine learning is one of the major areas of Spark adoption to date.
Mike gave me the impression that machine learning was a core intended use case for Spark the first time we talked about it.
There’s a machine learning library for Spark, and also a way to use Spark to do distributed R.

I believe data transformation is a major Spark use case as well.

Ion gave me that impression, although Cloudera surprisingly did not. Edit: Actually, see Matt Brandwine’s comment below.
I have one client (ClearStory) using Spark that way, and a second that’s likely to.
It makes sense that the #1 Hadoop use case (to date), which is something Spark also is well-suited for, would be an important early Spark use case as well.

Spark Streaming is fairly new, but is already getting some adoption. Notes on that start:

The actual technology is a form of micro-batching. I plan to learn more about that in the future.
Cloudera sees streaming as one of the two big Spark use cases, and praises Spark Streaming for its fault tolerance and its great ease of coding.
Mike Franklin knows a lot about streaming.

Part of that story is a sudden decline in the reputation of Storm, whose troubles seem to include:

Project founder and Twitter employee Nathan Marz seems no longer to be associated with Storm nor employed at Twitter.
I am told that in general the Storm community is not all that vibrant.
Various aspects of Storm’s technology are disappointing people.

Other notes on Spark use cases include:

Impala-loving Cloudera doesn’t plan to support Shark. Duh.
Cloudera also won’t at first support any Spark predictive modeling add-on.
Ion’s other company, Conviva, is doing some real-time decisioning in Spark.

Spark data management has been enhanced by a project called Tachyon.* The main point of Tachyon is that Spark RDDs (Resilient Distributed Datasets) now persist in memory beyond the life of a job; besides offering the RDDs to other Spark jobs, Tachyon also opens them to Hadoop via an HDFS emulator.

*If there’s ever a Spark/Tachyon management suite, I hope some aspect is named Cherenkov — i.e., the radiation that is measured to detect the passage of tachyons.:)

And finally, some metrics and so on:

Databricks has between 10 and 20 employees.
Spark has >100 individual contributors from >25 different companies.
There was a Spark Summit with >450 attendees (from >180 organizations), and an earlier Spark-mainly conference with >200 attendees.
The Spark meet-up group in San Francisco has >1500 members signed up.
Various Spark users and subprojects are identified on the Apache Spark pages.

Related link

Most of the current substance on Databricks’ website is in its blog.

Categories: Cloudera, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Market share and customer counts, Memory-centric data management, Predictive modeling and advanced analytics, RDF and graphs, Streaming and complex event processing (CEP)

Subscribe to our complete feed!

Comments

16 Responses to “Spark and Databricks”

Some stuff I’m thinking about (early 2014) | DBMS 2 : DataBase Management System Services on February 2nd, 2014 1:51 pm

[…] Spark and other memory-centric technology, including streaming. […]
Matt Brandwein on February 2nd, 2014 2:18 pm

Curt, great post. Re: “I believe data transformation is a major Spark use case as well. Ion gave me that impression, although Cloudera surprisingly did not.” we do, in fact, also expect Spark to be used for core data processing workloads. As the tools/apps community builds atop Spark (e.g. ClearStory) this will become more true. Today the longest pole for the Hadoop ecosystem is interactive analysis, thus the initial interest and focus there.

While he doesn’t explicit call out data processing, Mike Olson comments on the future of Spark and MapReduce in this blog post: http://vision.cloudera.com/mapreduce-spark/
Vlad Rodionov on February 3rd, 2014 4:36 pm

>>If there’s ever a Spark/Tachyon management suite, >>I hope some aspect is named Cherenkov — i.e., the >>radiation that is measured to detect the passage of tachyons.:)
I am impressed with you knowledge level in theoretical physics 🙂
Curt Monash on February 3rd, 2014 4:48 pm

I started college as a physics major, and ultimately came within a couple courses of finishing it (as a double major with math). The main thing I avoided by dropping the major was electronics lab. 🙂
Thomas W. Dinsmore on February 6th, 2014 8:48 pm

Some comments:

— Spark is clearly superior to MR for iterative machine learning, since it can hold data in memory while MR persists after each pass through the data. Some machine learning algorithms are iterative, others are not. The performance delta for logistic regression (which is iterative) is 100X versus MR

— Mahout is so done that speakers at the Spark Summit didn’t feel a need to stick a fork in it

— I don’t hear many people asking for high-latency analytics

— Part of the “sell” for Spark is a single platform for advanced analytics encompassing Fast Queries, ML, Streaming, Graph Engines etc instead of multiple point solutions (Giraph, Storm, Mahout etc). It’s a nice theory, but as you note Cloudera isn’t too keen on promoting Shark, and Spark’s Graph Engine has work to do to displace GraphLab

— R pushdown is still nascent. AMPLab just released SparkR, (http://portfortune.wordpress.com/2014/01/17/r-interface-to-apache-spark/) which has generated a lot of interest but is too new to say much about. A startup called Adatao has developed a proprietary R interface that can push down embarrassingly parallel functions.

— The Spark Summit was excellent, well worth attending.
Machine Learning in Hadoop: Part Two | Building The Analytic Enterprise on February 12th, 2014 12:51 pm

[…] commercial support. Recent coverage by this blog here and here; additional coverage here, here, here and […]
Notes and comments, March 17, 2014 | DBMS 2 : DataBase Management System Services on March 17th, 2014 3:09 am

[…] Spark buzz keeps increasing; almost everybody I talk with expects Spark to win big, probably across […]
Niek Sanders on March 17th, 2014 11:59 pm

Spark really is super exciting. I recommend both the intro RDD and the streaming papers; they’re very accessible and make a strong case for why vanilla MR should be tossed (http://spark.incubator.apache.org/research.html).

The unified abstraction for big data batch and stream processing is a particularly compelling. Fast/slow paths are often dealt with separately in big data systems and it’s nice to see the two flows start to unify.

The biggest hindrance to Spark right now is its newness. It’s also Scala-first, but exposes well to Java and Python so that’s not a blocker. Lack of detail on how the transformations are implemented is annoying but something that documentation (or reading the source code) can solve.

Give Spark another year or two to mature, and there won’t be any reason to pick vanilla map reduces for new projects.
Manas on March 20th, 2014 10:06 pm

Hi Curt,
I am going through the articles/documents explaining how to get started with spark. I have a big data problem that I want to solve using spark and spark streaming.
Being a big data noob(My question will say it for me) I don’t understand what all I need to install to spark work in a distributed computing environment.
i.e. a) If I install spark on each cluster machine?
b) Do I need to install some other HDFS providing piece on each cluster first before I can use spark on them?
c) What should one install if they don’t have any existing dependency on Hadoop, HBase, Mesos, Techyon or other layers that is often talked about.
Thanks
Manas
Matt Brandwein on March 20th, 2014 10:34 pm

Hey Manas,

You can either follow the documentation on the Spark site:
http://spark.incubator.apache.org/docs/latest/

Or, for a more automated experience, you can easily install Spark in Hadoop with Cloudera Manager, e.g.:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.8.1/Cloudera-Manager-Installation-Guide/cmig_spark_installation_standalone.html

Some community resources for you, too:
* Spark Community: http://spark.incubator.apache.org/community.html
* Cloudera Spark Community: http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/bd-p/Spark

If you need Hadoop in the first place, the Cloudera QuickStart VM is a great place to start:
http://go.cloudera.com/vm-download

Hope this helps,

Matt
Manas on March 20th, 2014 10:53 pm

Great help. Thanks Matt.
Juan Ortiz on September 11th, 2014 1:34 pm

How does Apache Drill compare to Spark? Will Drill fade away? Or does it serve a different purpose?
Streaming for Hadoop | DBMS 2 : DataBase Management System Services on October 5th, 2014 4:58 am

[…] is trying to revitalize the Apache Storm project, after Storm lost momentum; indeed, Hortonworks is referring to Storm as a component of […]
nutrition plan on June 7th, 2016 9:23 am

I do not know if it’s just me or if everybody else experiencing problems with your blog.

It appears as if some of the written text within your content are running
off the screen. Can somebody else please provide feedback and let me know if this is happening to them as
well? This might be a problem with my web browser because I’ve had this happen previously.
Cheers
실시간 티비 on March 30th, 2022 8:42 pm

실시간 티비

RDF and graphs | DBMS 2 : DataBase Management System Services
learn on March 27th, 2023 1:18 pm

learn

RDF and graphs | DBMS 2 : DataBase Management System Services

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Spark and Databricks

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin