Spark is on the rise, to an even greater degree than I thought last month.
- Numerous clients and other companies I talk with have adopted Spark, plan to adopt Spark, or at least think it’s likely they will. In particular:
- A number of analytic-stack companies are joining ClearStory in using Spark. Most of the specifics are confidential, but I hope some will be announced soon.
- MapR has joined Cloudera in supporting Spark, and indeed — unlike Cloudera — is supporting the full Spark stack.
- Mike Olson of Cloudera is on record as predicting that Spark will be the replacement for Hadoop MapReduce. Just about everybody seems to agree, except perhaps for Hortonworks folks betting on the more limited and less mature Tez. Spark’s biggest technical advantages as a general data processing engine are probably:
- The Directed Acyclic Graph processing model. (Any serious MapReduce-replacement contender will probably echo that aspect.)
- A rich set of programming primitives in connection with that model.
- Support also for highly-iterative processing, of the kind found in machine learning.
- Flexible in-memory data structures, namely the RDDs (Resilient Distributed Datasets).
- A clever approach to fault-tolerance.
- Spark is a major contender in streaming.
- There’s some cool machine-learning innovation using Spark.
- Spark 1.0 will drop by mid-May, Apache voters willin’ an’ the creek don’ rise. Publicity will likely ensue, with strong evidence of industry support.*
*Yes, my fingerprints are showing again.
The most official description of what Spark now contains is probably the “Spark ecosystem” diagram from Databricks. However, at the time of this writing it is slightly out of date, as per some email from Databricks CEO Ion Stoica (quoted with permission):
… but if I were to redraw it, SparkSQL will replace Shark, and Shark will eventually become a thin layer above SparkSQL and below BlinkDB.
With this change, all the modules on top of Spark (i.e., SparkStreaming, SparkSQL, GraphX, and MLlib) are part of the Spark distribution. You can think of these modules as libraries that come with Spark.
In an unfortunate non-development, Tachyon is not (yet?) part of Spark, and so it is hard for a Spark job’s data to be shared with other jobs (Spark or otherwise) or processes. That said:
- The tight integration of data structures and processes gives similar performance benefits to those of in-process vs. out-of-process in-database analytic functions. (It also of course raises similar stability concerns, but those seem less important in the case of Spark than of a true DBMS.)
- From a Hadoop vendor’s standpoint, Tachyon’s benefit of not requiring HDFS (Hadoop Distributed File System) isn’t important, and Tachyon somewhat conflicts with a newish effort called HDFS Caching.
A couple of Spark machine learning stories are very cool, in that they involve intra-day retraining of models. The better-known one is Yahoo’s, which in a prototype built in 120 lines of code trains a new model for recommendation of each candidate top-page news story. When I challenged that anecdote, Ion told me about his own former company Conviva, which retrains models every minute to decide which particular source of streaming video each client system will be connected to.
I am generally skeptical of immature SQL efforts, and SparkSQL is no exception. That said, it seems to be going in sensible directions, which should be welcome to those folks who used or were planning to use Shark anyway.
- SparkSQL actually has its own optimizer, rather than using the inappropriate Hive one. As with many new optimizers, it’s starting out rule-based, but is planned to become cost-based down the road.
- SparkSQL can run queries against data that’s either inside Spark or outside-but-accessible.
- SparkSQL can be accessed via Python and other APIs.
- Spark works with the Hive metastore, nee’ HCatalog.
And finally, there’s no public news as to what Databricks’ own business is. I think that’s a bit silly, but in fairness:
- The Spark 1.0 launch will consume every bit of marketing bandwidth they have.
- They don’t yet want to commit to a delivery date of their first offering.