MapReduce

Analysis of implementations of and issues associated with the parallel programming framework MapReduce. Related subjects include:

May 6, 2014

Notes and comments, May 6, 2014

After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.

Here is a catch-all post to complete the set.  Read more

April 30, 2014

Spark on fire

Spark is on the rise, to an even greater degree than I thought last month.

*Yes, my fingerprints are showing again.

The most official description of what Spark now contains is probably the “Spark ecosystem” diagram from Databricks. However, at the time of this writing it is slightly out of date, as per some email from Databricks CEO Ion Stoica (quoted with permission):

… but if I were to redraw it, SparkSQL will replace Shark, and Shark will eventually become a thin layer above SparkSQL and below BlinkDB.

With this change, all the modules on top of Spark (i.e., SparkStreaming, SparkSQL, GraphX, and MLlib) are part of the Spark distribution. You can think of these modules as libraries that come with Spark.

Read more

February 9, 2014

Distinctions in SQL/Hadoop integration

Ever more products try to integrate SQL with Hadoop, and discussions of them seem confused, in line with Monash’s First Law of Commercial Semantics. So let’s draw some distinctions, starting with (and these overlap):

In particular:

Let’s go to some examples. Read more

November 19, 2013

How Revolution Analytics parallelizes R

I talked tonight with Lee Edlefsen, Chief Scientist of Revolution Analytics, and now think I understand Revolution’s parallel R much better than I did before.

There are four primary ways that people try to parallelize predictive modeling:

One confusing aspect of this discussion is that it could reference several heavily-overlapping but not identical categories of algorithms, including:

  1. External memory algorithms, which operates on datasets too big to fit in main memory, by — for starters — reading in and working on a part of the data at a time. Lee observes that these are almost always parallelizable.
  2. What Revolution markets as External Memory Algorithms, which are those external memory algorithms it has gotten around to implementing so far. These are all parallelized. They are also all in the category of …
  3. … algorithms that can be parallelized by:
    • Operating on data in parts.
    • Getting intermediate results.
    • Combining them in some way for a final result.
  4. Algorithms of the previous category, where the way of combining them specifically is in the form of summation, such as those discussed in the famous paper Map-Reduce for Machine Learning on Multicore. Not all of Revolution’s current parallel algorithms fall into this group.

To be clear, all Revolution’s parallel algorithms are in Category #2 by definition and Category #3 in practice. However, they aren’t all in Category #4.

Read more

August 25, 2013

Cloudera Hadoop strategy and usage notes

When we scheduled a call to talk about Sentry, Cloudera’s Charles Zedlewski and I found time to discuss other stuff as well. One interesting part of our discussion was around the processing “frameworks” Cloudera sees as most important.

HBase was artificially omitted from this “frameworks” discussion because Cloudera sees it as a little bit more of a “storage” system than a processing one.

Another good subject was offloading work to Hadoop, in a couple different senses of “offload”: Read more

August 8, 2013

Curt Monash on video

I made a remarkably rumpled video appearance yesterday with SiliconAngle honchos John Furrier and Dave Vellante. (Excuses include <3 hours sleep, and then a scrambling reaction to a schedule change.) Topics covered included, with approximate timechecks:

Edit: Some of my remarks were transcribed.

Related links

August 6, 2013

Hortonworks, Hadoop, Stinger and Hive

I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger —  but at my request we cycled through a bunch of other topics as well. Company-specific notes include:

Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:

*By the way — Teradata seems serious about pushing the UDA as a core message.

Ecosystem notes, in Hortonworks’ perception, included:

I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.

Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:

Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.

Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include:  Read more

June 23, 2013

Impala and Parquet

I visited Cloudera Friday for, among other things, a chat about Impala with Marcel Kornacker and colleagues. Highlights included:

Data gets into Parquet via batch jobs only — one reason it’s important that Impala run against multiple file formats — but background format conversion is another roadmap item. A single table can be split across multiple formats — e.g., the freshest data could be in HBase, with the rest is in Parquet.

Read more

June 6, 2013

Dave DeWitt responds to Daniel Abadi

A few days ago I posted Daniel Abadi’s thoughts in a discussion of Hadapt, Microsoft PDW (Parallel Data Warehouse)/PolyBase, Pivotal/Greenplum Hawq, and other SQL-Hadoop combinations. This is Dave DeWitt’s response. Emphasis mine.

Read more

June 2, 2013

SQL-Hadoop architectures compared

The genesis of this post is:

I love my life.

Per Daniel (emphasis mine): Read more

← Previous PageNext Page →

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.