July 31, 2016

Notes on Spark and Databricks — generalities

I visited Databricks in early July to chat with Ion Stoica and Reynold Xin. Spark also comes up in a large fraction of the conversations I have. So let’s do some catch-up on Databricks and Spark. In a nutshell:

I shall explain below. I also am posting separately about Spark evolution, especially Spark 2.0. I’ll also talk a bit in that post about Databricks’ proprietary/closed-source technology.

Spark is the replacement for Hadoop MapReduce.

This point is so obvious that I don’t know what to say in its support. The trend is happening, as originally decreed by Cloudera (and me), among others. People are rightly fed up with the limitations of MapReduce, and — niches perhaps aside — there are no serious alternatives other than Spark.

The greatest use for Spark seems to be the same as the canonical first use for MapReduce: data transformation. Also in line with the Spark/MapReduce analogy: 

And so it seems likely that, at least for as long as Spark is growing rapidly, data transformation will appear to be the biggest Spark use case.

Spark is becoming the default platform for machine learning.

Largely, this is a corollary of:

To do machine learning you need two things in your software:

And thus I have conversations like:

SparkSQL (nee’ Shark) is puttering along.

SparkSQL is pretty much following the Hive trajectory.

Databricks reports good success in its core business of cloud-based machine learning support.

Databricks, to an even greater extent than I previously realized, is focused on its cloud business, for which there are well over 200 paying customers. Notes on that include:

Databricks’ core marketing concept seems to be “just-in-time data platform”. I don’t know why they picked that, as opposed to something that emphasizes Spark’s flexibility and functionality.

Spark Streaming’s long-term success is not assured.

To a first approximation, things look good for Spark Streaming.

But I’m also hearing rumbles and grumbles about Spark Streaming. What’s more, we know that Spark Streaming wasn’t a core part of Spark’s design; the use case just happened to emerge. Demanding streaming use cases typically involve a lot of short-request inserts (or updates/upserts/whatever). And if you were designing a system to handle those … would it really be based on Spark?

Databricks is not keeping a tight grip on Spark leadership.

For starters:

At the moment, Databricks is pretty clearly the general leader of Spark. Indeed:

But overall, Databricks doesn’t seem to care much about keeping Spark leadership. Its marketing efforts in that respect are minimal. Word-of-mouth buzz paints a similar picture. My direct relationship with the company gives the same impression. Oh, I’m sure Databricks would like to remain the Spark leader. But it doesn’t seem to devote much energy toward keeping the role.

Related links

Starting with my introduction to Spark, previous overview posts include those in:

Comments

6 Responses to “Notes on Spark and Databricks — generalities”

  1. Notes on Spark and Databricks — technology | DBMS 2 : DataBase Management System Services on July 31st, 2016 10:30 am

    […] my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion […]

  2. Ranko Mosic on August 1st, 2016 6:20 am

    Re: default platform for ML/DL: Google TensorFlow is making strong inroads in this space. TF is recently open sourced in distributed incarnation ( single node was os’d last December). TF can run on heterogeneous hardware ( CPU, GPU, mobile ). This might be an event of great importance, similar to the release of MapReduce paper by Google, with the difference that Google is actually releasing the code this time.

  3. Naveen on August 4th, 2016 1:27 pm

    Ranko – an issue with TensorFlow at scale is wrangling the data into the right structure, which as Curt mentions above, Spark is the only game in town. I expect to see some Spark tools evolve towards feeding data into TensorFlow.

  4. Notes from a long trip, July 19, 2016 | DBMS 2 : DataBase Management System Services on August 7th, 2016 1:27 pm

    […] Spark and Databricks are both prospering. […]

  5. More about Databricks and Spark | DBMS 2 : DataBase Management System Services on August 21st, 2016 4:36 pm

    […] CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world […]

  6. Chris on August 29th, 2016 1:24 pm

    Spark is not becoming the default machine-learning platform. That’s wishful thinking.

    The reason is because Databricks has not invested enough in MLlib over the years, so it doesn’t have a defensible technological edge in ML. Instead, the space is highly fragmented. Apache Mahout is just one of many competitors that come to mind for traditional ML on the JVM.

    One of the reasons Spark probably won’t win the race here is that it’s not computationally efficient for large linear algebra operations. What it does best is fast ETL, which makes it an important part of an ML pipeline, but unable to offer a full solution.

    Beyond traditional ML like random forests, if you look at deep learning, the default platform for the JVM is Deeplearning4j. DL4J has a sophisticated Spark integration, and it’s certified on CDH.

    http://deeplearning4j.org/spark

    DL4J integrates with Spark as a data access layer, using it to orchestrate multiple host threads. And Spark does OK on that task. DL4J also makes Spark run fast on multi-GPUs. Performance is equal to Caffe on non-trivial image processing jobs.

    http://deeplearning4j.org/gpu

    For the computations, you need different tools. We use ND4J, a Java/C++ scientific computing library that uses JavaCPP to avoid the overhead of the JNI.

    ND4J is n-dimensional arrays for Java. We basically ported Numpy to the JVM for the purpose of scientific computing, large matrix manipulations, etc. It also comes with a Scala API, ND4S.

    http://nd4j.org/
    https://github.com/deeplearning4j/libnd4j
    https://github.com/bytedeco/javacpp
    https://github.com/deeplearning4j/nd4s

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.