[scikit-learn] NearestNeighbors without replacement
gael.varoquaux at normalesup.org
Tue Apr 3 08:57:07 EDT 2018
Matching to minimize a cost is known as the linear assignment problem,
can be solved in n^3 cost, and is implemented in scikit-learn in
sklearn.utils.linear_assignment_.linear_assignment or in recent versions
of scipy as scipy.optimize.linear_sum_assignment
Of course, this problem will require much more coding (you need to build
your pairwise cost matrix) and much more computing cost (n^3 instead of
n^2) than a standard nearest-neighbor.
On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote:
> Hi Jake,
> Thanks for the reply. Yes, trying this out resulted from looking for ways in
> python to implement propensity score matching. I found a package, pscore_match
> (http://www.kellieottoboni.com/pscore_match/), but the matching was really
> terrible. Specifically, I'm matching based on age, race, gender, HIV status,
> hepatitis C status, and sickle-cell disease status. Using NearestNeighbors for
> matching performed WAY better, I was so surprised at how well every factor was
> matched for. The only issue is that it uses replacement.
> Here's what I'm currently testing. I need each case to match to 20 controls, so
> since NearestNeighbors uses replacement, I'm matching each case to many
> controls (15000), taking all of the distances for all of the pairs, and
> retaining only the smallest distances for each control. Since many controls are
> re-used (since the algorithm uses replacement), the hope is that enough
> controls are matched to many different cases so that each case ends up being
> matched to 20 unique controls. Does this method make sense??
> On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas <jakevdp at cs.washington.edu>
> On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljellis at gmail.com>
> Hello to the Scikit-learn community!
> I am doing case-control matching for an electronic health records
> study. My question is, is it possible to run Sklearn's NearestNeighbors
> function without replacement? As in, match the treated group to the
> untreated group without re-using any of the untreated group data
> points? If so, how? By default, it uses replacement. I know this
> because I tested it on some data of mine.
> The code I used is in the confirmed answer here: https://
> Thanks so much in advance,
> No, pairwise matching without replacement is not implemented within
> scikit-learn's nearest neighbors routines.
> It seems like an algorithm you would have to think carefully about because
> the number of potential pairs grows exponentially with the number of
> points, and I don't think it's true that choosing the nearest available
> neighbor of points in sequence will guarantee you to find the optimal
> configuration. You'd also have to carefully define what you mean by
> "optimal"... are you seeking to minimize the sum of all distances? The sum
> of squared distances? The maximum distance? The results would change
> depending on the metric you define. And you'd probably have to figure out
> some way to reduce the exponential search space in order to calculate the
> result in a reasonable amount of time for your data.
> You might look into the literature on propensity score matching; I think
> that's one area where this kind of neighbors-without-replacement algorithm
> is often used.
Senior Researcher, INRIA Parietal
NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
Phone: ++ 33-1-69-08-79-68
More information about the scikit-learn