Databricks, Spark and BDAS
Discussion of BDAS (Berkeley Data Analytics Systems), especially Spark and related projects, and also of Databricks, the company commercializing Spark.
I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.
- Spark is very new. All Spark adoption is recent.
- Databricks was founded to commercialize Spark. It is very much in stealth mode …
- … except insofar as Databricks folks are going out and trying to drum up Spark adoption.
- Ion Stoica is running Databricks, but you couldn’t tell that from his UC Berkeley bio page. Edit: After I posted this, Ion’s bio was quickly updated.
- Spark creator and Databricks CTO Matei Zaharia is an MIT professor, but actually went on leave there before he ever showed up.
- Cloudera is perhaps Spark’s most visible supporter. But Cloudera’s views of Spark’s role in the world is different from the Spark team’s.
The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:
- Spark is a distributed execution engine for analytic processes …
- … which works well with Hadoop.
- Spark is distinguished by a flexible in-memory data model …
- … and farms out persistence to HDFS (Hadoop Distributed File System) or other existing data stores.
- Intended analytic use cases for Spark include:
- SQL data manipulation.
- ETL-like data manipulation.
- Streaming-like data manipulation.
- Machine learning.
- Graph analytics.
I first wrote about in-memory data management a decade ago. But I long declined to use that term — because there’s almost always a persistence story outside of RAM — and coined “memory-centric” as an alternative. Then I relented 1 1/2 years ago, and defined in-memory DBMS as
DBMS designed under the assumption that substantially all database operations will be performed in RAM (Random Access Memory)
By way of contrast:
Hybrid memory-centric DBMS is our term for a DBMS that has two modes:
- Querying and updating (or loading into) persistent storage.
These definitions, while a bit rough, seem to fit most cases. One awkward exception is Aerospike, which assumes semiconductor memory, but is happy to persist onto flash (just not spinning disk). Another is Kognitio, which is definitely lying when it claims its product was in-memory all along, but may or may not have redesigned its technology over the decades to have become more purely in-memory. (But if they have, what happened to all the previous disk-based users??)
Two other sources of confusion are:
- The broad variety of memory-centric data management approaches.
- The over-enthusiastic marketing of SAP HANA.
With all that said, here’s a little update on in-memory data management and related subjects.
- I maintain my opinion that traditional databases will eventually wind up in RAM.
- At conventional large enterprises — as opposed to for example pure internet companies — production deployments of HANA are probably comparable in number and investment to production deployments of Hadoop. (I’m sorry, but much of my supporting information for that is confidential.)
- Cloudera is emphatically backing Spark. And a key aspect of Spark is that, unlike most of Hadoop, it’s memory-centric.
- It has become common for disk-based DBMS to persist data through a “log-structured” architecture. That’s a whole lot like what you do for persistence in a fundamentally in-memory system.
- I’m also sensing increasing comfort with the strategy of committing writes as soon as they’ve been acknowledged by two or more nodes in RAM.
- I’ve never heard a story about an in-memory DBMS actually losing data. It’s surely happened, but evidently not often.
|Categories: Aerospike, Cloudera, Clustering, Databricks, Spark and BDAS, Hadoop, In-memory DBMS, Kognitio, Market share and customer counts, Memory-centric data management, SAP AG, Theory and architecture||13 Comments|
Glassbeam checked in recently, and they turn out to exemplify quite a few of the themes I’ve been writing about. For starters:
- Glassbeam has an analytic technology stack focused on poly-structured machine-generated data.
- Glassbeam partially organizes that data into event series …
- … in a schema that is modified as needed.
Glassbeam basics include:
- Founded in 2009.
- Based in Santa Clara. Back-end engineering in Bangalore.
- $6 million in angel money; no other VC.
- High single-digit customer count, …
- … plus another high single-digit number of end customers for an OEM offering a limited version of their product.
All Glassbeam customers except one are SaaS/cloud (Software as a Service), and even that one was only offered a subscription (as oppose to perpetual license) price.
So what does Glassbeam’s technology do? Glassbeam says it is focused on “machine data analytics,” specifically for the “Internet of Things”, which it distinguishes from IT logs.* Specifically, Glassbeam sells to manufacturers of complex devices — IT (most of its sales so far ), medical, automotive (aspirational to date), etc. — and helps them analyze “phone home” data, for both support/customer service and marketing kinds of use cases. As of a recent release, the Glassbeam stack can: Read more
ClearStory Data is:
- One of the two start-ups I’m most closely engaged with.
- Run by a CEO for whom I have great regard, but who does get rather annoying about secrecy.
- On the verge, finally, of fully destealthing.
I think I can do an interesting post about ClearStory while tap-dancing around the still-secret stuff, so let’s dive in.
- Has developed a full-stack business intelligence technology — which will however be given a snazzier name than “BI” — that is focused on incorporating a broad variety of third-party information, usually along with some of the customer’s own data. Thus, ClearStory …
- … pushes Variety and Variability to extremes, more so than it stresses Volume and Velocity. But it does want to be used at interactive/memory-centric speeds.
- Has put a lot of effort into user interface, but in ways that fit my theory that UI is more about navigation than actual display.
- Has much of its technical differentiation in the area of data mustering …
- … and much of the rest in DBMS-like engineering.
- Is a flagship user of Spark.
- Also relies on Storm, HDFS (Hadoop Distributed File System) and various lesser open source projects (e.g. the ubiquitous Zookeeper).
- Is to a large extent written in Scala.
- Is at this time strictly a multi-tenant SaaS (Software as a Service) offering, except insofar as there’s an on-premises agent to help feed customers’ own data into the core ClearStory cloud service.
To a first approximation, ClearStory ingests data in a system built on Storm (code name: Stormy), dumps it into HDFS, and then operates on it in a system built on Spark (code name: Sparky). Along the way there’s a lot of interaction with another big part of the system, a metadata catalog with no code name I know of. Or as I keep it straight:
- ClearStory’s end-user UI talks mainly to Sparky, and also to the metadata store.
- ClearStory’s administrative UI talks mainly to Stormy, and also to the metadata store.
Two subjects in one post, because they were too hard to separate from each other
Any sufficiently complex software is developed in modules and subsystems. DBMS are no exception; the core trinity of parser, optimizer/planner, and execution engine merely starts the discussion. But increasingly, database technology is layered in a more fundamental way as well, to the extent that different parts of what would seem to be an integrated DBMS can sometimes be developed by separate vendors.
Major examples of this trend — where by “major” I mean “spanning a lot of different vendors or projects” — include:
- The object/relational, aka universal, extensibility features developed in the 1990s for Oracle, DB2, Informix, Illustra, and Postgres. The most successful extensions probably have been:
- Geospatial indexing via ESRI.
- Full-text indexing, notwithstanding questionable features and performance.
- MySQL storage engines.
- MPP (Massively Parallel Processing) analytic RDBMS relying on single-node PostgreSQL, Ingres, and/or Microsoft SQL Server — e.g. Greenplum (especially early on), Aster (ditto), DATAllegro, DATAllegro’s offspring Microsoft PDW (Parallel Data Warehouse), or Hadapt.
- Splits in which a DBMS has serious processing both in a “database” layer and in a predicate-pushdown “storage” layer — most famously Oracle Exadata, but also MarkLogic, InfiniDB, and others.
- SQL-on-HDFS — Hive, Impala, Stinger, Shark and so on (including Hadapt).
Other examples on my mind include:
- Data manipulation APIs being added to key-value stores such as Couchbase and Aerospike.
- TokuMX, the Tokutek/MongoDB hybrid I just blogged about.
- NuoDB’s willing reliance on third-party key-value stores (or HDFS in the role of one).
- FoundationDB’s strategy, and specifically its acquisition of Akiban.
And there are several others I hope to blog about soon, e.g. current-day PostgreSQL.
In an overlapping trend, DBMS increasingly have multiple data manipulation APIs. Examples include: Read more
When we scheduled a call to talk about Sentry, Cloudera’s Charles Zedlewski and I found time to discuss other stuff as well. One interesting part of our discussion was around the processing “frameworks” Cloudera sees as most important.
- The four biggies are:
- MapReduce. Duh.
- SQL, specifically Impala. This is as opposed to the uneasy Hive/MapReduce layering.
- “Math” , which seems to mainly be through partnerships with SAS and Revolution Analytics. I don’t know a lot about how these work, but I presume they bypass MapReduce, in which case I could imagine them greatly outperforming Mahout.
- Stream processing (Storm) is next in line.
- Graph — e.g. Giraph — rises to at least the proof-of-concept level. Again, the hope would be that this well outperforms graph-on-MapReduce.
- Charles is also seeing at least POC interest in Spark.
- But MPI (Message Passing Interface) on Hadoop isn’t going anywhere fast, except to the extent it’s baked into SAS or other “math” frameworks. Generic MPI use cases evidently turn out to be a bad fit for Hadoop, due to factors such as:
- Low data volumes.
- Latencies in various parts of the system
HBase was artificially omitted from this “frameworks” discussion because Cloudera sees it as a little bit more of a “storage” system than a processing one.
Another good subject was offloading work to Hadoop, in a couple different senses of “offload”: Read more
|Categories: Cloudera, Complex event processing (CEP), Databricks, Spark and BDAS, Endeca, Hadoop, HP and Neoview, MapReduce, Predictive modeling and advanced analytics, RDF and graphs, Revolution Analytics, SAS Institute, Teradata||22 Comments|
Hadoop 2.0/YARN is the first big step in evolving Hadoop beyond a strict Map/Reduce paradigm, in that it at least allows for the possibility of non- or beyond-MapReduce processing engines. While YARN didn’t meet its target of general availability around year-end 2012, Arun Murthy of Hortonworks told me recently that:
- Yahoo is a big YARN user.
- There are other — paying — YARN users.
- YARN general availability is now targeted for well before the end of 2013.
Arun further told me about Tez, the next-generation Hadoop processing engine he’s working on, which he also discussed in a recent blog post:
With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG [Directed Acyclic Graph] of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others. The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.
This is similar to the approach of BDAS Spark:
Rather than being restricted to Maps and Reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order.
although Tez won’t match Spark’s richer list of primitive operations.
More specifically, there will be six primitive Tez operations:
- HDFS (Hadoop Distributed File System) input and output.
- Sorting on input and output (I’m not sure why that’s two operations rather than one).
- Shuffling of input and output (ditto).
A Map step would compound HDFS input, output sorting, and output shuffling; a Reduce step compounds — you guessed it! — input sorting, input shuffling, and HDFS output.
I can’t think of much in the way of algorithms that would be logically impossible in MapReduce yet possible in Tez. Rather, the main point of Tez seems to be performance, performance consistency, response-time consistency, and all that good stuff. Specific advantages that Arun and I talked about included:
- The requirement for materializing (onto disk) intermediate results that you don’t want to is gone. (Yay!)
- Hadoop jobs will step on each other’s toes less. Instead of Maps and Reduces from unrelated jobs getting interleaved, all the operations from a single job will by default be executed in one chunk. (Even so, I see no reason to expect early releases of Tez to do a great job on highly concurrent mixed workload management.)
- Added granularity brings opportunities for additional performance enhancements, for example in the area of sorting. (Arun loves sorts.)
|Categories: Databricks, Spark and BDAS, Hadoop, Hortonworks, MapReduce, Workload management, Yahoo||14 Comments|
Spark and Shark are interesting alternatives to MapReduce and Hive. At a high level:
- Rather than persisting data to disk after every step, as MapReduce does, Spark instead writes to something called RDDs (Resilient Distributed Datasets), which can live in memory.
- Rather than being restricted to maps and reduces, Spark has more numerous primitive operations, including map, reduce, sample, join, and group-by. You can do these more or less in any order. All the primitives are parallel with respect to the RDDs.
- Shark is a lot like Hive, only rewritten (in significant parts) and running over Spark.
- There’s an approach to launching tasks quickly — ~5 milliseconds or so — that I unfortunately didn’t grasp.
The key concept here seems to be the RDD. Any one RDD:
- Is a collection of Java objects, which should have the same or similar structure.
- Can be partitioned/distributed and shuffled/redistributed across the cluster.
- Doesn’t have to be entirely in memory at once.
Otherwise, there’s a lot of flexibility; an RDD can be a set of tuples, a collection of XML documents, or whatever other reasonable kind of dataset you want. And I gather that:
- At the moment, RDDs expire at the end of a job.
- This restriction will be lifted in a future release.
|Categories: Data models and architecture, Databricks, Spark and BDAS, Hadoop, MapReduce, Memory-centric data management, Open source, Parallelization, SQL/Hadoop integration||10 Comments|
UC Berkeley’s AMPLab is working on a software stack that:
- Is meant (among other goals) to improve upon Hadoop …
- … but also to interoperate with it, and which in fact …
- … uses significant parts of Hadoop.
- Seems to have the overall name BDAS (Berkeley Data Analytics System).
The whole thing has $30 million in projected funding (half government, half industry) and a 6-year plan (which they’re 2 years into).
Specific projects of note in all that include:
- Mesos, a cluster manager. I don’t know much about Mesos, but it seems to be in production use, most notably at Twitter supporting Storm.
- Spark, a replacement for MapReduce and the associated execution stack.
- Shark, a replacement for Hive.
|Categories: ClearStory Data, Databricks, Spark and BDAS, Hadoop, MapReduce, Parallelization, Specific users, SQL/Hadoop integration||10 Comments|
A consensus has evolved that:
- Columnar compression (i.e., value-based compression) compresses better than block-level compression (i.e., compression of bit strings).
- Columnar compression can be done pretty well in row stores.
Still somewhat controversial is the claim that:
- Columnar compression can be done even better in column stores than in row-based systems.
A strong plausibility argument for the latter point is that new in-memory analytic data stores tend to be columnar — think HANA or Platfora; compression is commonly cited as a big reason for the choice. (Another reason is that I/O bandwidth matters even when the I/O is from RAM, and there are further reasons yet.)
One group that made the in-memory columnar choice is the Spark/Shark guys at UC Berkeley’s AMP Lab. So when I talked with them Thursday (more on that another time, but it sounds like cool stuff), I took some time to ask why columnar stores are better at compression. In essence, they gave two reasons — simplicity, and speed of decompression.
In each case, the main supporting argument seemed to be that finding the values in a column is easier when they’re all together in a column store. Read more
|Categories: Columnar database management, Database compression, Databricks, Spark and BDAS, In-memory DBMS, Netezza||10 Comments|