February 2, 2014

Spark and Databricks

I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.

The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:

Except for certain low-latency operations,* anything you can do in Spark can also be done in straight Hadoop; Spark just can have advantages in performance and programming ease. Spark RDDs (Resilient Distributed Datasets) are immutable at this time, so Spark is not suited for short-request update workloads.

*A new Spark task requires a thread, not a whole Java Virtual Machine.

Everybody agrees that machine learning is a top Spark use case. In particular:

I believe data transformation is a major Spark use case as well.

Spark Streaming is fairly new, but is already getting some adoption. Notes on that start:

Part of that story is a sudden decline in the reputation of Storm, whose troubles seem to include:

Other notes on Spark use cases include:

Spark data management has been enhanced by a project called Tachyon.* The main point of Tachyon is that Spark RDDs (Resilient Distributed Datasets) now persist in memory beyond the life of a job; besides offering the RDDs to other Spark jobs, Tachyon also opens them to Hadoop via an HDFS emulator.

*If there’s ever a Spark/Tachyon management suite, I hope some aspect is named Cherenkov — i.e., the radiation that is measured to detect the passage of tachyons.:)

And finally, some metrics and so on:

Related link

Comments

16 Responses to “Spark and Databricks”

  1. Some stuff I’m thinking about (early 2014) | DBMS 2 : DataBase Management System Services on February 2nd, 2014 1:51 pm

    […] Spark and other memory-centric technology, including streaming. […]

  2. Matt Brandwein on February 2nd, 2014 2:18 pm

    Curt, great post. Re: “I believe data transformation is a major Spark use case as well. Ion gave me that impression, although Cloudera surprisingly did not.” we do, in fact, also expect Spark to be used for core data processing workloads. As the tools/apps community builds atop Spark (e.g. ClearStory) this will become more true. Today the longest pole for the Hadoop ecosystem is interactive analysis, thus the initial interest and focus there.

    While he doesn’t explicit call out data processing, Mike Olson comments on the future of Spark and MapReduce in this blog post: http://vision.cloudera.com/mapreduce-spark/

  3. Vlad Rodionov on February 3rd, 2014 4:36 pm

    >>If there’s ever a Spark/Tachyon management suite, >>I hope some aspect is named Cherenkov — i.e., the >>radiation that is measured to detect the passage of tachyons.:)
    I am impressed with you knowledge level in theoretical physics 🙂

  4. Curt Monash on February 3rd, 2014 4:48 pm

    I started college as a physics major, and ultimately came within a couple courses of finishing it (as a double major with math). The main thing I avoided by dropping the major was electronics lab. 🙂

  5. Thomas W. Dinsmore on February 6th, 2014 8:48 pm

    Some comments:

    — Spark is clearly superior to MR for iterative machine learning, since it can hold data in memory while MR persists after each pass through the data. Some machine learning algorithms are iterative, others are not. The performance delta for logistic regression (which is iterative) is 100X versus MR

    — Mahout is so done that speakers at the Spark Summit didn’t feel a need to stick a fork in it

    — I don’t hear many people asking for high-latency analytics

    — Part of the “sell” for Spark is a single platform for advanced analytics encompassing Fast Queries, ML, Streaming, Graph Engines etc instead of multiple point solutions (Giraph, Storm, Mahout etc). It’s a nice theory, but as you note Cloudera isn’t too keen on promoting Shark, and Spark’s Graph Engine has work to do to displace GraphLab

    — R pushdown is still nascent. AMPLab just released SparkR, (http://portfortune.wordpress.com/2014/01/17/r-interface-to-apache-spark/) which has generated a lot of interest but is too new to say much about. A startup called Adatao has developed a proprietary R interface that can push down embarrassingly parallel functions.

    — The Spark Summit was excellent, well worth attending.

  6. Machine Learning in Hadoop: Part Two | Building The Analytic Enterprise on February 12th, 2014 12:51 pm

    […] commercial support.  Recent coverage by this blog here and here; additional coverage here, here, here and […]

  7. Notes and comments, March 17, 2014 | DBMS 2 : DataBase Management System Services on March 17th, 2014 3:09 am

    […] Spark buzz keeps increasing; almost everybody I talk with expects Spark to win big, probably across […]

  8. Niek Sanders on March 17th, 2014 11:59 pm

    Spark really is super exciting. I recommend both the intro RDD and the streaming papers; they’re very accessible and make a strong case for why vanilla MR should be tossed (http://spark.incubator.apache.org/research.html).

    The unified abstraction for big data batch and stream processing is a particularly compelling. Fast/slow paths are often dealt with separately in big data systems and it’s nice to see the two flows start to unify.

    The biggest hindrance to Spark right now is its newness. It’s also Scala-first, but exposes well to Java and Python so that’s not a blocker. Lack of detail on how the transformations are implemented is annoying but something that documentation (or reading the source code) can solve.

    Give Spark another year or two to mature, and there won’t be any reason to pick vanilla map reduces for new projects.

  9. Manas on March 20th, 2014 10:06 pm

    Hi Curt,
    I am going through the articles/documents explaining how to get started with spark. I have a big data problem that I want to solve using spark and spark streaming.
    Being a big data noob(My question will say it for me) I don’t understand what all I need to install to spark work in a distributed computing environment.
    i.e. a) If I install spark on each cluster machine?
    b) Do I need to install some other HDFS providing piece on each cluster first before I can use spark on them?
    c) What should one install if they don’t have any existing dependency on Hadoop, HBase, Mesos, Techyon or other layers that is often talked about.
    Thanks
    Manas

  10. Matt Brandwein on March 20th, 2014 10:34 pm

    Hey Manas,

    You can either follow the documentation on the Spark site:
    http://spark.incubator.apache.org/docs/latest/

    Or, for a more automated experience, you can easily install Spark in Hadoop with Cloudera Manager, e.g.:
    http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.8.1/Cloudera-Manager-Installation-Guide/cmig_spark_installation_standalone.html

    Some community resources for you, too:
    * Spark Community: http://spark.incubator.apache.org/community.html
    * Cloudera Spark Community: http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/bd-p/Spark

    If you need Hadoop in the first place, the Cloudera QuickStart VM is a great place to start:
    http://go.cloudera.com/vm-download

    Hope this helps,

    Matt

  11. Manas on March 20th, 2014 10:53 pm

    Great help. Thanks Matt.

  12. Juan Ortiz on September 11th, 2014 1:34 pm

    How does Apache Drill compare to Spark? Will Drill fade away? Or does it serve a different purpose?

  13. Streaming for Hadoop | DBMS 2 : DataBase Management System Services on October 5th, 2014 4:58 am

    […] is trying to revitalize the Apache Storm project, after Storm lost momentum; indeed, Hortonworks is referring to Storm as a component of […]

  14. nutrition plan on June 7th, 2016 9:23 am

    I do not know if it’s just me or if everybody else experiencing problems with your blog.

    It appears as if some of the written text within your content are running
    off the screen. Can somebody else please provide feedback and let me know if this is happening to them as
    well? This might be a problem with my web browser because I’ve had this happen previously.
    Cheers

  15. 실시간 티비 on March 30th, 2022 8:42 pm

    실시간 티비

    RDF and graphs | DBMS 2 : DataBase Management System Services

  16. learn on March 27th, 2023 1:18 pm

    learn

    RDF and graphs | DBMS 2 : DataBase Management System Services

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.