Databricks, Spark and BDAS

Discussion of BDAS (Berkeley Data Analytics Systems), especially Spark and related projects, and also of Databricks, the company commercializing Spark.

October 13, 2014

Context for Cloudera

Hadoop World/Strata is this week, so of course my clients at Cloudera will have a bunch of announcements. Without front-running those, I think it might be interesting to review the current state of the Cloudera product line. Details may be found on the Cloudera product comparison page. Examining those details helps, I think, with understanding where Cloudera does and doesn’t place sales and marketing focus, which given Cloudera’s Hadoop market stature is in my opinion an interesting thing to analyze.

So far as I can tell (and there may be some errors in this, as Cloudera is not always accurate in explaining the fine details):

In analyzing all this, I’m focused on two particular aspects:

Read more

October 10, 2014

Notes on predictive modeling, October 10, 2014

As planned, I’m getting more active in predictive modeling. Anyhow …

1. I still believe most of what I said in a July, 2013 predictive modeling catch-all post. However, I haven’t heard as much subsequently about Ayasdi as I had expected to.

2. The most controversial part of that post was probably the claim:

I think the predictive modeling state of the art has become:

  • Cluster in some way.
  • Model separately on each cluster.

In particular:

3. Nutonian is now a client. I just had my first meeting with them this week. To a first approximation, they’re somewhat like KXEN (sophisticated math, non-linear models, ease of modeling, quasi-automagic feature selection), but with differences that start: Read more

October 5, 2014

Spark vs. Tez, revisited

I’m on record as noting and agreeing with an industry near-consensus that Spark, rather than Tez, will be the replacement for Hadoop MapReduce. I presumed that Hortonworks, which is pushing Tez, disagreed. But Shaun Connolly of Hortonworks suggested a more nuanced view. Specifically, Shaun tweeted thoughts including:

Tez vs Spark = Apples vs Oranges.

Spark is general-purpose engine with elegant APIs for app devs creating modern data-driven apps, analytics, and ML algos.

Tez is a framework for expressing purpose-built YARN-based DAGs; its APIs are for ISVs & engine/tool builders who embed it

[For example], Hive embeds Tez to convert its SQL needs into purpose-built DAGs expressed optimally and leveraging YARN

That said, I haven’t yet had a chance to understand what advantages Tez might have over Spark in the use cases that Shaun relegates it to.

Related link

September 28, 2014

Some stuff on my mind, September 28, 2014

1. I wish I had some good, practical ideas about how to make a political difference around privacy and surveillance. Nothing else we discuss here is remotely as important. I presumably can contribute an opinion piece to, more or less, the technology publication(s) of my choice; that can have a small bit of impact. But I’d love to do better than that. Ideas, anybody?

2. A few thoughts on cloud, colocation, etc.:

3. As for the analytic DBMS industry: Read more

September 7, 2014

An idealized log management and analysis system — from whom?

I’ve talked with many companies recently that believe they are:

At best, I think such competitive claims are overwrought. Still, it’s a genuinely important subject and opportunity, so let’s consider what a great log management and analysis system might look like.

Much of this discussion could apply to machine-generated data in general. But right now I think more players are doing product management with an explicit conception either of log management or event-series analytics, so for this post I’ll share that focus too.

A short answer might be “Splunk, but with more analytic functionality and more scalable performance, at lower cost, plus numerous coupons for free pizza.” A more constructive and bottoms-up approach might start with:  Read more

June 8, 2014

Optimism, pessimism, and fatalism — fault-tolerance, Part 2

The pessimist thinks the glass is half-empty.
The optimist thinks the glass is half-full.
The engineer thinks the glass was poorly designed.

Most of what I wrote in Part 1 of this post was already true 15 years ago. But much gets added in the modern era, considering that:

And so there’s been innovation in numerous cluster-related subjects, two of which are:

Distributed database consistency

When a distributed database lives up to the same consistency standards as a single-node one, distributed query is straightforward. Performance may be an issue, however, which is why we have seen a lot of:

But in workloads with low-latency writes, living up to those standards is hard. The 1980s approach to distributed writing was two-phase commit (2PC), which may be summarized as:  Read more

May 6, 2014

Notes and comments, May 6, 2014

After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.

Here is a catch-all post to complete the set.  Read more

April 30, 2014

Spark on fire

Spark is on the rise, to an even greater degree than I thought last month.

*Yes, my fingerprints are showing again.

The most official description of what Spark now contains is probably the “Spark ecosystem” diagram from Databricks. However, at the time of this writing it is slightly out of date, as per some email from Databricks CEO Ion Stoica (quoted with permission):

… but if I were to redraw it, SparkSQL will replace Shark, and Shark will eventually become a thin layer above SparkSQL and below BlinkDB.

With this change, all the modules on top of Spark (i.e., SparkStreaming, SparkSQL, GraphX, and MLlib) are part of the Spark distribution. You can think of these modules as libraries that come with Spark.

Read more

March 17, 2014

Notes and comments, March 17, 2014

I have ever more business-advice posts up on Strategic Messaging. Recent subjects include pricing and stealth-mode marketing. Other stuff I’ve been up to includes:

The Spark buzz keeps increasing; almost everybody I talk with expects Spark to win big, probably across several use cases.

Disclosure: I’ll soon be in a substantial client relationship with Databricks, hoping to improve their stealth-mode marketing. :D

The “real-time analytics” gold rush I called out last year continues. A large fraction of the vendors I talk with have some variant of “real-time analytics” as a central message.

Basho had a major change in leadership. A Twitter exchange ensued. :) Joab Jackson offered a more sober — figuratively and literally — take.

Hadapt laid off its sales and marketing folks, and perhaps some engineers as well. In a nutshell, Hadapt’s approach to SQL-on-Hadoop wasn’t selling vs. the many alternatives, and Hadapt is doubling down on poly-structured data*/schema-on-need.

*While Hadapt doesn’t to my knowledge use the term “poly-structured data”, some other vendors do. And so I may start using it more myself, at least when the poly-structured/multi-structured distinction actually seems significant.

WibiData is partnering with DataStax, WibiData is of course pleased to get access to Cassandra’s user base, which gave me the opportunity to ask why they thought Cassandra had beaten HBase in those accounts. The answer was performance and availability, while Cassandra’s traditional lead in geo-distribution wasn’t mentioned at all.

Disclosure: My fingerprints are all over that deal.

In other news, WibiData has had some executive departures as well, but seems to be staying the course on its strategy. I continue to think that WibiData has a really interesting vision about how to do large-data-volume interactive computing, and anybody in that space would do well to talk with them or at least look into the open source projects WibiData sponsors.

I encountered another apparently-popular machine-learning term — bandit model. It seems to be glorified A/B testing, and it seems to be popular. I think the point is that it tries to optimize for just how much you invest in testing unproven (for good or bad) alternatives.

I had an awkward set of interactions with Gooddata, including my longest conversations with them since 2009. Gooddata is in the early days of trying to offer an all-things-to-all-people analytic stack via SaaS (Software as a Service). I gather that Hadoop, Vertica, PostgreSQL (a cheaper Vertica alternative), Spark, Shark (as a faster version of Hive) and Cassandra (under the covers) are all in the mix — but please don’t hold me to those details.

I continue to think that computing is moving to a combination of appliances, clusters, and clouds. That said, I recently bought a new gaming-class computer, and spent many hours gaming on it just yesterday.* I.e., there’s room for general-purpose workstations as well. But otherwise, I’m not hearing anything that contradicts my core point.

*The last beta weekend for The Elder Scrolls Online; I loved Morrowind.

February 2, 2014

Some stuff I’m thinking about (early 2014)

From time to time I like to do “what I’m working on” posts. From my recent blogging, you probably already know that includes:

Other stuff on my mind includes but is not limited to:

1. Certain categories of buying organizations are inherently leading-edge.

Fine. But what really intrigues me is when more ordinary enterprises also put leading-edge technologies into production. I pester everybody for examples of that.

Read more

Next Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.