Databricks, Spark and BDAS
Discussion of BDAS (Berkeley Data Analytics Systems), especially Spark and related projects, and also of Databricks, the company commercializing Spark.
This is part of a three-post series on Kudu, a new data storage system from Cloudera.
- Part 1 is an overview of Kudu technology.
- Part 2 is a lengthy dive into how Kudu writes and reads data.
- Part 3 (this post) is a brief speculation as to Kudu’s eventual market significance.
Combined with Impala, Kudu is (among other things) an attempt to build a no-apologies analytic DBMS (DataBase Management System) into Hadoop. My reactions to that start:
- It’s plausible; just not soon. What I mean by that is:
- Success will, at best, be years away. Please keep that in mind as you read this otherwise optimistic post.
- Nothing jumps out at me to say “This will never work!”
- Unlike when it introduced Impala — or when I used to argue with Jeff Hammerbacher pre-Impala — this time Cloudera seems to have reasonable expectations as to how hard the project is.
- There’s huge opportunity if it works.
- The analytic RDBMS vendors are beatable. Teradata has a great track record of keeping its product state-of-the-art, but it likes high prices. Most other strong analytic RDBMS products were sold to (or originated by) behemoth companies that seem confused about how to proceed.
- RDBMS-first analytic platforms didn’t do as well as I hoped. That leaves a big gap for Hadoop.
I’ll expand on that last point. Analytics is no longer just about fast queries on raw or simply-aggregated data. Data transformation is getting ever more complex — that’s true in general, and it’s specifically true in the case of transformations that need to happen in human real time. Predictive models now often get rescored on every click. Sometimes, they even get retrained at short intervals. And while data reduction in the sense of “event extraction from high-volume streams” isn’t that a big deal yet in commercial apps featuring machine-generated data — if growth trends continue as much of us expect, it’s only a matter of time before that changes.
Of course, this is all a bullish argument for Spark (or Flink, if I’m wrong to dismiss its chances as a Spark competitor). But it also all requires strong low-latency analytic data underpinnings, and I suspect that several kinds of data subsystem will prosper. I expect Kudu-supported Hadoop/Spark to be a strong contender for that role, along with the best of the old-school analytic RDBMS, Tachyon-supported Spark, one or more contenders from the Hana/MemSQL crowd (i.e., memory-centric RDBMS that purport to be good at analytics and transactions alike), and of course also whatever Cloudera’s strongest competitor(s) choose to back.
This is part of a three-post series on Kudu, a new data storage system from Cloudera.
- Part 1 (this post) is an overview of Kudu technology.
- Part 2 is a lengthy dive into how Kudu writes and reads data.
- Part 3 is a brief speculation as to Kudu’s eventual market significance.
Cloudera is introducing a new open source project, Kudu,* which from Cloudera’s standpoint is meant to eventually become the single best underpinning for analytics on the Hadoop stack. I’ve spent multiple hours discussing Kudu with Cloudera, mainly with Todd Lipcon. Any errors are of course entirely mine.
*Like the impala, the kudu is a kind of antelope. I knew that, because I enjoy word games. What I didn’t know — and which is germane to the naming choice — is that the kudu has stripes.
- Kudu is an alternative to HDFS (Hadoop Distributed File System), or to HBase.
- Kudu is meant to be the underpinning for Impala, Spark and other analytic frameworks or engines.
- Kudu is not meant for OLTP (OnLine Transaction Processing), at least in any foreseeable release. For example:
- Kudu doesn’t support multi-row transactions.
- There are no active efforts to front-end Kudu with an engine that is fast at single-row queries.
- Kudu is rather columnar, except for transitory in-memory stores.
- Kudu’s core design points are that it should:
- Accept data very quickly.
- Immediately make that data available for analytics.
- More specifically, Kudu is meant to accept, along with slower forms of input:
- Lots of fast random writes, e.g. of web interactions.
- Streams, viewed as a succession of inserts.
- Updates and inserts alike.
- The core “real-time” use cases for which Kudu is designed are, unsurprisingly:
- Low-latency business intelligence.
- Predictive model scoring.
- Kudu is designed to work fine with spinning disk, and indeed has been tested to date mainly on disk-only nodes. Even so, Kudu’s architecture is optimized for the assumption that there will be at least some flash on the node.
- Kudu is designed primarily to support relational/SQL processing. However, Kudu also has a nested-data roadmap, which of course starts with supporting the analogous capabilities in Impala.
|Categories: Business intelligence, Cloudera, Columnar database management, Database compression, Databricks, Spark and BDAS, Hadoop, HBase, Predictive modeling and advanced analytics, Solid-state memory, SQL/Hadoop integration||2 Comments|
MongoDB isn’t the only company I reached out to recently for an update. Another is DataStax. I chatted mainly with Patrick McFadin, somebody with whom I’ve had strong consulting relationships at a user and vendor both. But Rachel Pedreschi contributed the marvelous phrase “twinkling dashboard”.
It seems fair to say that in most cases:
- Cassandra is adopted for operational applications, specifically ones with requirements for extreme uptime and/or extreme write speed. (Of course, it should also be the case that NoSQL data structures are a good fit.)
- Spark, including SparkSQL, and Solr are seen primarily as ways to navigate or analyze the resulting data.
Those generalities, in my opinion, make good technical sense. Even so, there are some edge cases or counterexamples, such as:
- DataStax trumpets British Gas‘ plans collecting a lot of sensor data and immediately offering it up for analysis.*
- Safeway uses Cassandra for a mobile part of its loyalty program, scoring customers and pushing coupons at them.
- A large title insurance company uses Cassandra-plus-Solr to manage a whole lot of documents.
*And so a gas company is doing lightweight analysis on boiler temperatures, which it regards as hot data.
While most of the specifics are different, I’d say similar things about MongoDB, Cassandra, or any other NoSQL DBMS that comes to mind: Read more
|Categories: Business intelligence, Cassandra, Databricks, Spark and BDAS, DataStax, NoSQL, Open source, Petabyte-scale data management, Predictive modeling and advanced analytics, Specific users, Text||6 Comments|
- Multi-model database management has been around for decades. Marketers who say otherwise are being ridiculous.
- Thus, “multi-model”-centric marketing is the last refuge of the incompetent. Vendors who say “We have a great DBMS, and by the way it’s multi-model (now/too)” are being smart. Vendors who say “You need a multi-model DBMS, and that’s the reason you should buy from us” are being pathetic.
- Multi-logical-model data management and multi-latency-assumption data management are greatly intertwined.
Before supporting my claims directly, let me note that this is one of those posts that grew out of a Twitter conversation. The first round went:
Merv Adrian: 2 kinds of multimodel from DBMS vendors: multi-model DBMSs and multimodel portfolios. The latter create more complexity, not less.
Me: “Owned by the same vendor” does not imply “well integrated”. Indeed, not a single example is coming to mind.
Merv: We are clearly in violent agreement on that one.
Around the same time I suggested that Intersystems Cache’ was the last significant object-oriented DBMS, only to get the pushback that they were “multi-model” as well. That led to some reasonable-sounding justification — although the buzzwords of course aren’t from me — namely: Read more
|Categories: Complex event processing (CEP), Data models and architecture, Database diversity, Databricks, Spark and BDAS, Intersystems and Cache', MOLAP, Object||3 Comments|
A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.
To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — — mine.)
Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:
- Inconsistent, in which case humans might not know how to look it up and database JOINs might fail.
- Unintegrated, in which case one application might not be able to use data that another happily maintains. (This is the classic data silo problem.)
Inconsistency can take multiple forms, including: Read more
Let’s start with some terminology biases:
- I dislike the term “big data” but like the Vs that define it — Volume, Velocity, Variety and Variability.
- Though I think it’s silly, I understand why BI innovators flee from the term “business intelligence” (they’re afraid of not sounding new).
So when my clients at Zoomdata told me that they’re in the business of providing “the fastest visual analytics for big data”, I understood their choice, but rolled my eyes anyway. And then I immediately started to check how their strategy actually plays against the “big data” Vs.
It turns out that:
- Zoomdata does its processing server-side, which allows for load-balancing and scale-out. Scale-out and claims of great query speed are relevant when data is of high volume.
- Zoomdata depends heavily on Spark.
- Zoomdata’s UI assumes data can be a mix of historical and streaming, and that if looking at streaming data you might want to also check history. This addresses velocity.
- Zoomdata assumes data can be in a variety of data stores, including:
- Relational (operational RDBMS, analytic RDBMS, or SQL-on-Hadoop).
- Files (generic HDFS — Hadoop Distributed File System or S3).*
- NoSQL (MongoDB and HBase were mentioned).
- Search (Elasticsearch was mentioned among others).
- Zoomdata also tries to detect data variability.
- Zoomdata is OEM/embedding-friendly.
*The HDFS/S3 aspect seems to be a major part of Zoomdata’s current story.
Core aspects of Zoomdata’s technical strategy include: Read more
Occasionally I talk with an astute reporter — there are still a few left — and get led toward angles I hadn’t considered before, or at least hadn’t written up. A blog post may then ensue. This is one such post.
There is a group of questions going around that includes:
- Is Hadoop overhyped?
- Has Hadoop adoption stalled?
- Is Hadoop adoption being delayed by skills shortages?
- What is Hadoop really good for anyway?
- Which adoption curves for previous technologies are the best analogies for Hadoop?
To a first approximation, my responses are: Read more
|Categories: Application areas, Data warehousing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Hortonworks, MapR, MapReduce, Market share and customer counts, Open source, Pricing||6 Comments|
I talked with my clients at MemSQL about the release of MemSQL 4.0. Let’s start with the reminders:
- MemSQL started out as in-memory OTLP (OnLine Transaction Processing) DBMS …
- … but quickly positioned with “We also do ‘real-time’ analytic processing” …
- … and backed that up by adding a flash-based column store option …
- … before Gartner ever got around to popularizing the term HTAP (Hybrid Transaction and Analytic Processing).
- There’s also a JSON option.
The main new aspects of MemSQL 4.0 are:
- Geospatial indexing. This is for me the most interesting part.
- A new optimizer and, I suppose, query planner …
- … which in particular allow for serious distributed joins.
- Some rather parallel-sounding connectors to Spark. Hadoop and Amazon S3.
- Usual-suspect stuff including:
- More SQL coverage (I forgot to ask for details).
- Some added or enhanced administrative/tuning/whatever tools (again, I forgot to ask for details).
- Surely some general Bottleneck Whack-A-Mole.
There’s also a new free MemSQL “Community Edition”. MemSQL hopes you’ll experiment with this but not use it in production. And MemSQL pricing is now wholly based on RAM usage, so the column store is quasi-free from a licensing standpoint is as well.
|Categories: Amazon and its cloud, Columnar database management, Databricks, Spark and BDAS, GIS and geospatial, Hadoop, Investment research and trading, Market share and customer counts, MemSQL, NewSQL, Pricing, Structured documents||8 Comments|
I’m going to be out-of-sorts this week, due to a colonoscopy. (Between the prep, the procedure, and the recovery, that’s a multi-day disablement.) In the interim, here’s a collection of links, quick comments and the like.
1. Are you an engineer considering a start-up? This post is for you. It’s based on my long experience in and around such scenarios, and includes a section on “Deadly yet common mistakes”.
2. There seems to be a lot of confusion regarding the business model at my clients Databricks. Indeed, my own understanding of Databricks’ on-premises business has changed recently. There are no changes in my beliefs that:
- Databricks does not directly license or support on-premises Spark users. Rather …
- … it helps partner companies to do so, where:
- Examples of partner companies include usual-suspect Hadoop distribution vendors, and DataStax.
- “Help” commonly includes higher-level support.
However, I now get the impression that revenue from such relationships is a bigger deal to Databricks than I previously thought.
Databricks, by the way, has grown to >50 people.
3. DJ Patil and Ruslan Belkin apparently had a great session on lessons learned, covering a lot of ground. Many of the points are worth reading, but one in particular echoed something I’m hearing lots of places — “Data is super messy, and data cleanup will always be literally 80% of the work.” Actually, I’d replace the “always” by something like “very often”, and even that mainly for newish warehouses, data marts or datasets. But directionally the comment makes a whole lot of sense.
|Categories: Data integration and middleware, Databricks, Spark and BDAS, DataStax, Hadoop, Health care, Investment research and trading, Text||Leave a Comment|
I hear much discussion of shortfalls in analytic technology, especially from companies that want to fill in the gaps. But how much do these gaps actually matter? In many cases, that depends on what the analytic technology is being used for. So let’s think about some different kinds of analytic task, and where they each might most stress today’s available technology.
In separating out the task areas, I’ll focus first on the spectrum “To what extent is this supposed to produce novel insights?” and second on the dimension “To what extent is this supposed to be integrated into a production/operational system?” Issues of latency, algorithmic novelty, etc. can follow after those. In particular, let’s consider the tasks: Read more
|Categories: Business intelligence, Data warehousing, Databricks, Spark and BDAS, Hadoop, Netezza, NoSQL, Predictive modeling and advanced analytics, Tableau Software||1 Comment|