April 6, 2011
So can logistic regression be parallelized or not?
A core point in SAS’ pitch for its new MPI (Message-Passing Interface) in-memory technology seems to be logistic regression is really important, and shared-nothing MPP doesn’t let you parallelize it. The Mahout/Hadoop folks also seem to despair of parallelizing logistic regression.
On the other hand, Aster Data said it had parallelized logistic regression a year ago. (Slides 6-7 from a mid-2010 Aster deck may be clearer.) I’m guessing Fuzzy Logix might make a similar claim, although I’m not really sure.
What gives?
Categories: Aster Data, Hadoop, Parallelization, Predictive modeling and advanced analytics, SAS Institute
Subscribe to our complete feed!
Comments
7 Responses to “So can logistic regression be parallelized or not?”
Leave a Reply

The standard estimation technique is to transform the estimators into a linear estimate of the log of the odds:
https://secure.wikimedia.org/wikipedia/en/wiki/Logistic_regression#Formal_mathematical_specification
If ML Estimates can be determined by Iteratively Reweighted Least Squares, I don’t see why each WLS iteration couldn’t be parallelized.
Mahout/Hadoop’s choice of SGD instead of IRLS appears to be the problem there. SGD does allow for incremental updating, which could be important for some uses.
Many “non-parallelizable” algorithms are parallelized by finding an approximation that for practical purposes is just as good. Mathematically they are not equivalent, and there may be corner cases not handled as well, nevertheless, for many practical examples… nobody cares.
By analogy, I’ve always been a fan of piece-wise linear techniques, vs. more careful curve fits. By many technical measures they are coarse, inelegant, etc. Practically…. they rock.
Curt,
Among the different algorithms and approximations for logistic regression, some definitely can be parallelized in a shared-nothing MPP system even while others are not suitable for parallelization there. In the Mahout case the algorithm used is stochastic gradient descent, which as they mention is inherently not suited to parallelization in a shared-nothing MPP system. However, other algorithms such as batch gradient descent and Newton’s method (also referred to as Newton-Raphson) are parallelizable.
As you mention, Aster Data’s logistic regression function is parallelized—we initially released a version last year based on the batch gradient descent method and have been busy expanding the algorithms available since then. Of note is that our logistic regression implementation is designed for not only cases that fit in memory but can also process cases that are larger than available memory.
–Jon
Curt,
My first response was to ask “is there nothing that MPI can’t do?’ but a quick search showed up this paper where the authors are using MapReduce to solve this problem as well:
( http://www.siam.org/proceedings/datamining/2009/dm09_107_singhs.pdf )
MPI has the advantage that shared memory communicators can be used as well as network communicators, so problems of granularity related efficiency can be better addressed.
The more interesting subject that you touched on is what we call “orthogonal parallelism”; this is where instead of just splitting records up over various CPU’s in the shared nothing cluster, large objects and computations that may relate to a single field can be orthogonally parallelised over the cluster as well.
For example: a large database contains many EHR’s that include a fields containing large images, RNA or DNA sequences. Although a conventional MPP system would split the records over the cluster, each large DNA object (for example) would still exist entirely on only 1 node.
A better solution would be to split the large objects up by storing these fields in parallel.
Additionally orthogonally parallel computations such as the parallel hammer algorithm can be used to analyse these large DNA fields in parallel.
Version 1.1 of DeepCloud MPP will permit orthogonally parallel UDF’s for this purpose.
A big part to logistic modeling can be done in parallel. In fact they can be done using separate SAS sessions on a multi core machine as well.
In fact I used to run parallel logistic regressions in SAS System circa 2004(by changing one or two variables and redoing proc logistic). Essentially I was running two or three logistic regressions with one or two variables changed in order to finalize the model, and estimating VIF, etc etc
Fitting the parameters, estimating deviation from actual, and scoring model can all be parallelized in my opinion.
Mathematics on a computer, when dealing with all but the most trivial computations, is approximation to begin with.
As long as the results remain inside the safety range required by the specific use I really appreciate a parallel implementation that makes me able to scale out.
[...] Meanwhile, in another interview I heard about, SAS emphasized retailers. Indeed, that’s what spawned my recent post about logistic regression. [...]