November 19, 2013

How Revolution Analytics parallelizes R

I talked tonight with Lee Edlefsen, Chief Scientist of Revolution Analytics, and now think I understand Revolution’s parallel R much better than I did before.

There are four primary ways that people try to parallelize predictive modeling:

One confusing aspect of this discussion is that it could reference several heavily-overlapping but not identical categories of algorithms, including:

  1. External memory algorithms, which operates on datasets too big to fit in main memory, by — for starters — reading in and working on a part of the data at a time. Lee observes that these are almost always parallelizable.
  2. What Revolution markets as External Memory Algorithms, which are those external memory algorithms it has gotten around to implementing so far. These are all parallelized. They are also all in the category of …
  3. … algorithms that can be parallelized by:
    • Operating on data in parts.
    • Getting intermediate results.
    • Combining them in some way for a final result.
  4. Algorithms of the previous category, where the way of combining them specifically is in the form of summation, such as those discussed in the famous paper Map-Reduce for Machine Learning on Multicore. Not all of Revolution’s current parallel algorithms fall into this group.

To be clear, all Revolution’s parallel algorithms are in Category #2 by definition and Category #3 in practice. However, they aren’t all in Category #4.

The canonical example of how to parallelize an algorithm via intermediate results is taking the mean of a large set of numbers. Specifically:

Unfortunately, it’s hard to clearly articulate a precise description of these parallelizable algorithms. That said:

I asked Lee about algorithms that were inherently difficult to parallelize in this style, and he expressed optimism that some other approach would usually work in practice. In particular, we had a lively discussion about finding the exact median, or more generally finding n-tiles and the whole “empirical distribution”. Lee said that, for example, it is extremely fast to bin billions of values into 10,000 buckets. Further, he suggested it is very fast in general to do the operation for integer values, and hence also for any values with a reasonably limited number of significant digits.

As should be clear from this discussion, Revolution’s parallel algorithms are indeed parallel for any reasonable kind of distribution of work. Although they were shipped first for multi-core single-server and MPI environments, the recent ports to Teradata and generic Hadoop MapReduce seem to have been fairly straightforward. Revolution seems to have good modularity between the algorithms themselves, the intermediate data passing, and the original algorithm launch, and hence makes strong claims of R code portability — but the list of exceptions in “portable except for ____” did seem to lengthen a bit each time we returned to the subject.

Finally, notes on Revolution’s Teradata implementation include:

while notes on Revolution’s initial Hadoop implementation start:

Related link

Comments

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.