I talked tonight with Lee Edlefsen, Chief Scientist of Revolution Analytics, and now think I understand Revolution’s parallel R much better than I did before.
There are four primary ways that people try to parallelize predictive modeling:
- They can run the same algorithm on different parts of a dataset on different nodes, then return all the results, and claim they’ve parallelized. This is trivial and not really a solution. It is also the last-ditch fallback position for those who parallelize more seriously.
- They can generate intermediate results from different parts of a dataset on different nodes, then generate and return a single final result. This is what Revolution does.
- They can parallelize the linear algebra that underlies so many algorithms. Netezza and Greenplum tried this, but I don’t think it worked out very well in either case. Lee cited a saying in statistical computing “If you’re using matrices, you’re doing it wrong”; he thinks shortcuts and workarounds are almost always the better way to go.
- They can jack up the speed of inter-node communication, perhaps via MPI (Messaging Passing Interface), so that full parallelization isn’t needed. That’s SAS’ main approach.
One confusing aspect of this discussion is that it could reference several heavily-overlapping but not identical categories of algorithms, including:
- External memory algorithms, which operates on datasets too big to fit in main memory, by — for starters — reading in and working on a part of the data at a time. Lee observes that these are almost always parallelizable.
- What Revolution markets as External Memory Algorithms, which are those external memory algorithms it has gotten around to implementing so far. These are all parallelized. They are also all in the category of …
- … algorithms that can be parallelized by:
- Operating on data in parts.
- Getting intermediate results.
- Combining them in some way for a final result.
- Algorithms of the previous category, where the way of combining them specifically is in the form of summation, such as those discussed in the famous paper Map-Reduce for Machine Learning on Multicore. Not all of Revolution’s current parallel algorithms fall into this group.
To be clear, all Revolution’s parallel algorithms are in Category #2 by definition and Category #3 in practice. However, they aren’t all in Category #4.
The canonical example of how to parallelize an algorithm via intermediate results is taking the mean of a large set of numbers. Specifically:
- For each subset of data, you both count the entries and sum the values.
- Then to combine those intermediate results:
- You sum the sums.
- You also sum the counts.
- You divide the former result by the latter.
Unfortunately, it’s hard to clearly articulate a precise description of these parallelizable algorithms. That said:
- What you want is for the end result to be identical irrespective of how the data is split up. (Duh!)
- Lee suggested that it is sufficient but not necessary that the way of combining the intermediate results be both commutative and associative.
- To date, all of Revolution’s algorithms are — you guessed it! — commutative and associative.
I asked Lee about algorithms that were inherently difficult to parallelize in this style, and he expressed optimism that some other approach would usually work in practice. In particular, we had a lively discussion about finding the exact median, or more generally finding n-tiles and the whole “empirical distribution”. Lee said that, for example, it is extremely fast to bin billions of values into 10,000 buckets. Further, he suggested it is very fast in general to do the operation for integer values, and hence also for any values with a reasonably limited number of significant digits.
As should be clear from this discussion, Revolution’s parallel algorithms are indeed parallel for any reasonable kind of distribution of work. Although they were shipped first for multi-core single-server and MPI environments, the recent ports to Teradata and generic Hadoop MapReduce seem to have been fairly straightforward. Revolution seems to have good modularity between the algorithms themselves, the intermediate data passing, and the original algorithm launch, and hence makes strong claims of R code portability — but the list of exceptions in “portable except for ____” did seem to lengthen a bit each time we returned to the subject.
Finally, notes on Revolution’s Teradata implementation include:
- There’s a master process (external stored procedure) which then generates SQL and invokes table operators.
- The whole thing runs in protected mode (i.e. out-of-process). Lee thinks that there’s only a small performance penalty vs. in-process.
- (For some reason I found this amusing) When you send an R job to Teradata, the R code itself is shipped via ODBC.
while notes on Revolution’s initial Hadoop implementation start:
- One way it talks to data in HDFS (Hadoop Distributed File System) is through LibHDFS. The other, when available, is ODBC.
- It uses generic MapReduce. Faster alternatives may be implemented down the road.
- Teradata is seeing interest in in-database R. (September, 2013)