February 28, 2015

Databricks and Spark update

I chatted last night with Ion Stoica, CEO of my client Databricks, for an update both on his company and Spark. Databricks’ actual business is Databricks Cloud, about which I can say:

I do not expect all of the above to remain true as Databricks Cloud matures.

Ion also said that Databricks is over 50 people, and has moved its office from Berkeley to San Francisco. He also offered some Spark numbers, such as:

Please note that certification of a Spark distribution is a free service from Databricks, and amounts to checking that the API works against a test harness. Speaking of certification, Ion basically agrees with my views on ODP, although like many — most? — people he expresses himself more politely than I do.

We talked briefly about several aspects of Spark or related projects. One was DataFrames. Per Databricks:

In Spark, a DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

I gather this is modeled on Python pandas, and extends an earlier Spark capability for RDDs (Resilient Distributed Datasets) to carry around metadata that was tantamount to a schema.

SparkR is also on the rise, although it has the usual parallel R story to the effect:

So of course is Spark Streaming. And then there are Spark Packages, which are — and I’m speaking loosely here — a kind of user-defined function.

I’ll stop here. However, I have a couple of other Spark-related posts in the research pipeline.


7 Responses to “Databricks and Spark update”

  1. Mark Callaghan on February 28th, 2015 9:56 am

    “The low end of the amount of data Databricks Cloud customers are working with is 100s of gigabytes. This isn’t surprising.”

    Why do you need Spark or Hadoop for what can fit on one host with a lot of RAM and/or a lot of SSD?

  2. David Gruzman on February 28th, 2015 2:27 pm

    You might need hundreds of CPU cores to do processing of this 100s of gigabytes in interactive speed. It is something we can not pack into single server in reasonable price.

  3. Ion Stoica on February 28th, 2015 9:29 pm

    Like David mentioned, using more nodes can help with improving interactivity, as many workloads are CPU and/or I/O bounded . This is important since one of the main values of Databricks Cloud is allowing users to interactively query and process the data.

  4. Databricks and Spark update | Analytics Team on February 28th, 2015 11:54 pm

    […] out what DBMS2 has to say about Databricks and […]

  5. Big Analytics Roundup (March 2, 2015) | The Big Analytics Blog on March 2nd, 2015 10:30 am

    […] Monash writes about Databricks and Spark on his DBMS2 […]

  6. Multi-model database managers | DBMS 2 : DataBase Management System Services on August 24th, 2015 4:07 am

    […] since then have been in line with my thoughts. For example, Spark added DataFrames, which promise substantial data model flexibility for Spark use cases, but more mature products […]

  7. Notes on Spark and Databricks — generalities | DBMS 2 : DataBase Management System Services on July 31st, 2016 10:30 am

    […] February, 2015 […]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.