[scikit-learn] NearestNeighbors without replacement

Tue Apr 3 09:58:25 EDT 2018

Hi Dr. Varoquaux,

It seems like the SciPy function only assigns one row to one column. I need
to assign 20 controls to each case. Does the linear_sum_assignment
function, since it assigns unique pairs, depend on the order of the rows
and columns? If so, perhaps I could shuffle and then combine the pairs
together until each case has 20 unique controls.

Any thoughts on this are greatly appreciated.

Best,

Randy

On Tue, Apr 3, 2018 at 8:57 AM, Gael Varoquaux <
gael.varoquaux at normalesup.org> wrote:

> Matching to minimize a cost is known as the linear assignment problem,
> can be solved in n^3 cost, and is implemented in scikit-learn in
> sklearn.utils.linear_assignment_.linear_assignment or in recent versions
> of scipy as scipy.optimize.linear_sum_assignment
>
> Of course, this problem will require much more coding (you need to build
> your pairwise cost matrix) and much more computing cost (n^3 instead of
> n^2) than a standard nearest-neighbor.
>
> Gaël
>
> On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote:
> > Hi Jake,
>
> > Thanks for the reply. Yes, trying this out resulted from looking for
> ways in
> > python to implement propensity score matching. I found a package,
> pscore_match
> > (http://www.kellieottoboni.com/pscore_match/), but the matching was
> really
> > terrible. Specifically, I'm matching based on age, race, gender, HIV
> status,
> > hepatitis C status, and sickle-cell disease status. Using
> NearestNeighbors for
> > matching performed WAY better, I was so surprised at how well every
> factor was
> > matched for. The only issue is that it uses replacement.
>
> > Here's what I'm currently testing. I need each case to match to 20
> controls, so
> > since NearestNeighbors uses replacement, I'm matching each case to many
> > controls (15000), taking all of the distances for all of the pairs, and
> > retaining only the smallest distances for each control. Since many
> controls are
> > re-used (since the algorithm uses replacement), the hope is that enough
> > controls are matched to many different cases so that each case ends up
> being
> > matched to 20 unique controls. Does this method make sense??
>
> > Best,
>
> > Randy
>
> > On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas <
> jakevdp at cs.washington.edu>
> > wrote:
>
> >     On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljellis at gmail.com
> >
> >     wrote:
>
> >         Hello to the Scikit-learn community!
>
> >         I am doing case-control matching for an electronic health records
> >         study. My question is, is it possible to run Sklearn's
> NearestNeighbors
> >         function without replacement? As in, match the treated group to
> the
> >         untreated group without re-using any of the untreated group data
> >         points? If so, how? By default, it uses replacement. I know this
> >         because I tested it on some data of mine.
>
> >         The code I used is in the confirmed answer here: https://
> >         stats.stackexchange.com/questions/206832/matched-pai
> >         rs-in-python-propensity-score-matching
>
> >         Thanks so much in advance,
>
>
> >     No, pairwise matching without replacement is not implemented within
> >     scikit-learn's nearest neighbors routines.
>
> >     It seems like an algorithm you would have to think carefully about
> because
> >     the number of potential pairs grows exponentially with the number of
> >     points, and I don't think it's true that choosing the nearest
> available
> >     neighbor of points in sequence will guarantee you to find the optimal
> >     configuration. You'd also have to carefully define what you mean by
> >     "optimal"... are you seeking to minimize the sum of all distances?
> The sum
> >     of squared distances? The maximum distance? The results would change
> >     depending on the metric you define. And you'd probably have to
> figure out
> >     some way to reduce the exponential search space in order to
> calculate the
> >     result in a reasonable amount of time for your data.
>
> >     You might look into the literature on propensity score matching; I
> think
> >     that's one area where this kind of neighbors-without-replacement
> algorithm
> >     is often used.
>
> >     Best,
> >        Jake
> >
> --
>     Gael Varoquaux
>     Senior Researcher, INRIA Parietal
>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>     Phone:  ++ 33-1-69-08-79-68
>     http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

-- 
*Randall J. Ellis, B.S.*
PhD Student, Biomedical Science, Mount Sinai
Special Volunteer, http://www.michaelideslab.org/, NIDA IRP
Cell: (954)-260-9891
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180403/60fa48dc/attachment-0001.html>