[scikit-learn] NearestNeighbors without replacement

Gael Varoquaux gael.varoquaux at normalesup.org
Tue Apr 3 08:57:07 EDT 2018

Matching to minimize a cost is known as the linear assignment problem,
can be solved in n^3 cost, and is implemented in scikit-learn in
sklearn.utils.linear_assignment_.linear_assignment or in recent versions
of scipy as scipy.optimize.linear_sum_assignment

Of course, this problem will require much more coding (you need to build
your pairwise cost matrix) and much more computing cost (n^3 instead of
n^2) than a standard nearest-neighbor.


On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote:
> Hi Jake,

> Thanks for the reply. Yes, trying this out resulted from looking for ways in
> python to implement propensity score matching. I found a package, pscore_match
> (http://www.kellieottoboni.com/pscore_match/), but the matching was really
> terrible. Specifically, I'm matching based on age, race, gender, HIV status,
> hepatitis C status, and sickle-cell disease status. Using NearestNeighbors for
> matching performed WAY better, I was so surprised at how well every factor was
> matched for. The only issue is that it uses replacement. 

> Here's what I'm currently testing. I need each case to match to 20 controls, so
> since NearestNeighbors uses replacement, I'm matching each case to many
> controls (15000), taking all of the distances for all of the pairs, and
> retaining only the smallest distances for each control. Since many controls are
> re-used (since the algorithm uses replacement), the hope is that enough
> controls are matched to many different cases so that each case ends up being
> matched to 20 unique controls. Does this method make sense??

> Best,

> Randy  

> On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas <jakevdp at cs.washington.edu>
> wrote:

>     On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljellis at gmail.com>
>     wrote:

>         Hello to the Scikit-learn community!

>         I am doing case-control matching for an electronic health records
>         study. My question is, is it possible to run Sklearn's NearestNeighbors
>         function without replacement? As in, match the treated group to the
>         untreated group without re-using any of the untreated group data
>         points? If so, how? By default, it uses replacement. I know this
>         because I tested it on some data of mine.

>         The code I used is in the confirmed answer here: https://
>         stats.stackexchange.com/questions/206832/matched-pai
>         rs-in-python-propensity-score-matching

>         Thanks so much in advance,

>     No, pairwise matching without replacement is not implemented within
>     scikit-learn's nearest neighbors routines.

>     It seems like an algorithm you would have to think carefully about because
>     the number of potential pairs grows exponentially with the number of
>     points, and I don't think it's true that choosing the nearest available
>     neighbor of points in sequence will guarantee you to find the optimal
>     configuration. You'd also have to carefully define what you mean by
>     "optimal"... are you seeking to minimize the sum of all distances? The sum
>     of squared distances? The maximum distance? The results would change
>     depending on the metric you define. And you'd probably have to figure out
>     some way to reduce the exponential search space in order to calculate the
>     result in a reasonable amount of time for your data.

>     You might look into the literature on propensity score matching; I think
>     that's one area where this kind of neighbors-without-replacement algorithm
>     is often used.

>     Best,
>        Jake
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

More information about the scikit-learn mailing list