[scikit-learn] NearestNeighbors without replacement
randalljellis at gmail.com
Tue Apr 3 09:00:32 EDT 2018
Thanks Dr. Varoquax, it’s awesome you’re on this list, I’m a fan of your
Will look into this strategy.
On Tue, Apr 3, 2018 at 8:57 AM Gael Varoquaux <gael.varoquaux at normalesup.org>
> Matching to minimize a cost is known as the linear assignment problem,
> can be solved in n^3 cost, and is implemented in scikit-learn in
> sklearn.utils.linear_assignment_.linear_assignment or in recent versions
> of scipy as scipy.optimize.linear_sum_assignment
> Of course, this problem will require much more coding (you need to build
> your pairwise cost matrix) and much more computing cost (n^3 instead of
> n^2) than a standard nearest-neighbor.
> On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote:
> > Hi Jake,
> > Thanks for the reply. Yes, trying this out resulted from looking for
> ways in
> > python to implement propensity score matching. I found a package,
> > (http://www.kellieottoboni.com/pscore_match/), but the matching was
> > terrible. Specifically, I'm matching based on age, race, gender, HIV
> > hepatitis C status, and sickle-cell disease status. Using
> NearestNeighbors for
> > matching performed WAY better, I was so surprised at how well every
> factor was
> > matched for. The only issue is that it uses replacement.
> > Here's what I'm currently testing. I need each case to match to 20
> controls, so
> > since NearestNeighbors uses replacement, I'm matching each case to many
> > controls (15000), taking all of the distances for all of the pairs, and
> > retaining only the smallest distances for each control. Since many
> controls are
> > re-used (since the algorithm uses replacement), the hope is that enough
> > controls are matched to many different cases so that each case ends up
> > matched to 20 unique controls. Does this method make sense??
> > Best,
> > Randy
> > On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas <
> jakevdp at cs.washington.edu>
> > wrote:
> > On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljellis at gmail.com
> > wrote:
> > Hello to the Scikit-learn community!
> > I am doing case-control matching for an electronic health records
> > study. My question is, is it possible to run Sklearn's
> > function without replacement? As in, match the treated group to
> > untreated group without re-using any of the untreated group data
> > points? If so, how? By default, it uses replacement. I know this
> > because I tested it on some data of mine.
> > The code I used is in the confirmed answer here: https://
> > stats.stackexchange.com/questions/206832/matched-pai
> > rs-in-python-propensity-score-matching
> > Thanks so much in advance,
> > No, pairwise matching without replacement is not implemented within
> > scikit-learn's nearest neighbors routines.
> > It seems like an algorithm you would have to think carefully about
> > the number of potential pairs grows exponentially with the number of
> > points, and I don't think it's true that choosing the nearest
> > neighbor of points in sequence will guarantee you to find the optimal
> > configuration. You'd also have to carefully define what you mean by
> > "optimal"... are you seeking to minimize the sum of all distances?
> The sum
> > of squared distances? The maximum distance? The results would change
> > depending on the metric you define. And you'd probably have to
> figure out
> > some way to reduce the exponential search space in order to
> calculate the
> > result in a reasonable amount of time for your data.
> > You might look into the literature on propensity score matching; I
> > that's one area where this kind of neighbors-without-replacement
> > is often used.
> > Best,
> > Jake
> Gael Varoquaux
> Senior Researcher, INRIA Parietal
> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
> Phone: ++ 33-1-69-08-79-68
> http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
> scikit-learn mailing list
> scikit-learn at python.org
*Randall J. Ellis, B.S.*
PhD Student, Biomedical Science, Mount Sinai
Special Volunteer, http://www.michaelideslab.org/, NIDA IRP
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the scikit-learn