April 6, 2011

So can logistic regression be parallelized or not?

A core point in SAS’ pitch for its new MPI (Message-Passing Interface) in-memory technology seems to be logistic regression is really important, and shared-nothing MPP doesn’t let you parallelize it. The Mahout/Hadoop folks also seem to despair of parallelizing logistic regression.

On the other hand, Aster Data said it had parallelized logistic regression a year ago. (Slides 6-7 from a mid-2010 Aster deck may be clearer.) I’m guessing Fuzzy Logix might make a similar claim, although I’m not really sure.

What gives?

Comments

8 Responses to “So can logistic regression be parallelized or not?”

  1. John M. Wildenthal on April 6th, 2011 11:46 am

    The standard estimation technique is to transform the estimators into a linear estimate of the log of the odds:

    https://secure.wikimedia.org/wikipedia/en/wiki/Logistic_regression#Formal_mathematical_specification

    If ML Estimates can be determined by Iteratively Reweighted Least Squares, I don’t see why each WLS iteration couldn’t be parallelized.

    Mahout/Hadoop’s choice of SGD instead of IRLS appears to be the problem there. SGD does allow for incremental updating, which could be important for some uses.

  2. Mike Beckerle on April 7th, 2011 6:34 am

    Many “non-parallelizable” algorithms are parallelized by finding an approximation that for practical purposes is just as good. Mathematically they are not equivalent, and there may be corner cases not handled as well, nevertheless, for many practical examples… nobody cares.

    By analogy, I’ve always been a fan of piece-wise linear techniques, vs. more careful curve fits. By many technical measures they are coarse, inelegant, etc. Practically…. they rock.

  3. Jon Bock on April 7th, 2011 2:51 pm

    Curt,

    Among the different algorithms and approximations for logistic regression, some definitely can be parallelized in a shared-nothing MPP system even while others are not suitable for parallelization there. In the Mahout case the algorithm used is stochastic gradient descent, which as they mention is inherently not suited to parallelization in a shared-nothing MPP system. However, other algorithms such as batch gradient descent and Newton’s method (also referred to as Newton-Raphson) are parallelizable.

    As you mention, Aster Data’s logistic regression function is parallelized—we initially released a version last year based on the batch gradient descent method and have been busy expanding the algorithms available since then. Of note is that our logistic regression implementation is designed for not only cases that fit in memory but can also process cases that are larger than available memory.

    –Jon

  4. Randolph on April 7th, 2011 10:02 pm

    Curt,
    My first response was to ask “is there nothing that MPI can’t do?’ but a quick search showed up this paper where the authors are using MapReduce to solve this problem as well:
    ( http://www.siam.org/proceedings/datamining/2009/dm09_107_singhs.pdf )

    MPI has the advantage that shared memory communicators can be used as well as network communicators, so problems of granularity related efficiency can be better addressed.

    The more interesting subject that you touched on is what we call “orthogonal parallelism”; this is where instead of just splitting records up over various CPU’s in the shared nothing cluster, large objects and computations that may relate to a single field can be orthogonally parallelised over the cluster as well.

    For example: a large database contains many EHR’s that include a fields containing large images, RNA or DNA sequences. Although a conventional MPP system would split the records over the cluster, each large DNA object (for example) would still exist entirely on only 1 node.
    A better solution would be to split the large objects up by storing these fields in parallel.

    Additionally orthogonally parallel computations such as the parallel hammer algorithm can be used to analyse these large DNA fields in parallel.

    Version 1.1 of DeepCloud MPP will permit orthogonally parallel UDF’s for this purpose.

  5. Ajay Ohri on April 8th, 2011 1:48 am

    A big part to logistic modeling can be done in parallel. In fact they can be done using separate SAS sessions on a multi core machine as well.
    In fact I used to run parallel logistic regressions in SAS System circa 2004(by changing one or two variables and redoing proc logistic). Essentially I was running two or three logistic regressions with one or two variables changed in order to finalize the model, and estimating VIF, etc etc
    Fitting the parameters, estimating deviation from actual, and scoring model can all be parallelized in my opinion.

  6. Marco Ullasci on April 8th, 2011 3:14 am

    Mathematics on a computer, when dealing with all but the most trivial computations, is approximation to begin with.
    As long as the results remain inside the safety range required by the specific use I really appreciate a parallel implementation that makes me able to scale out.

  7. Application areas for SAS HPA | DBMS 2 : DataBase Management System Services on April 22nd, 2011 3:54 am

    [...] Meanwhile, in another interview I heard about, SAS emphasized retailers. Indeed, that’s what spawned my recent post about logistic regression. [...]

  8. http://toramspsak.bplaced.net on November 28th, 2013 12:31 pm

    tak duża liczba Wart podkreślenia maszyny. goryczy oraz
    taniej

    złośliwości.

    - Przepraszam. bucowa (http://toramspsak.bplaced.net) Oczywiście pamiętam oraz – jakże mówiłem pierwej – dziękuję zbyt

    jacyś. Choćby nawet jesteś po tej stronie służbowo.
    Wiesz moje miano, Przeglądałaś rejestr?
    Te

    z Mińska? Tudzież być może późniejsze,
    z Litwy?

    Usiadł, aż jęknęły sprężyny. – Owo oraz Kiroiczew gryzie wie…
    Popatrz, skurwysyn pustka

    nie wygadał. Otóżdama powstała. Uzależniła śpiwó.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.