[scikit-learn] NearestNeighbors without replacement
Randy Ellis
randalljellis at gmail.com
Tue Apr 3 09:58:25 EDT 2018
Hi Dr. Varoquaux,
It seems like the SciPy function only assigns one row to one column. I need
to assign 20 controls to each case. Does the linear_sum_assignment
function, since it assigns unique pairs, depend on the order of the rows
and columns? If so, perhaps I could shuffle and then combine the pairs
together until each case has 20 unique controls.
Any thoughts on this are greatly appreciated.
Best,
Randy
On Tue, Apr 3, 2018 at 8:57 AM, Gael Varoquaux <
gael.varoquaux at normalesup.org> wrote:
> Matching to minimize a cost is known as the linear assignment problem,
> can be solved in n^3 cost, and is implemented in scikit-learn in
> sklearn.utils.linear_assignment_.linear_assignment or in recent versions
> of scipy as scipy.optimize.linear_sum_assignment
>
> Of course, this problem will require much more coding (you need to build
> your pairwise cost matrix) and much more computing cost (n^3 instead of
> n^2) than a standard nearest-neighbor.
>
> Gaël
>
> On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote:
> > Hi Jake,
>
> > Thanks for the reply. Yes, trying this out resulted from looking for
> ways in
> > python to implement propensity score matching. I found a package,
> pscore_match
> > (http://www.kellieottoboni.com/pscore_match/), but the matching was
> really
> > terrible. Specifically, I'm matching based on age, race, gender, HIV
> status,
> > hepatitis C status, and sickle-cell disease status. Using
> NearestNeighbors for
> > matching performed WAY better, I was so surprised at how well every
> factor was
> > matched for. The only issue is that it uses replacement.
>
> > Here's what I'm currently testing. I need each case to match to 20
> controls, so
> > since NearestNeighbors uses replacement, I'm matching each case to many
> > controls (15000), taking all of the distances for all of the pairs, and
> > retaining only the smallest distances for each control. Since many
> controls are
> > re-used (since the algorithm uses replacement), the hope is that enough
> > controls are matched to many different cases so that each case ends up
> being
> > matched to 20 unique controls. Does this method make sense??
>
> > Best,
>
> > Randy
>
> > On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas <
> jakevdp at cs.washington.edu>
> > wrote:
>
> > On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljellis at gmail.com
> >
> > wrote:
>
> > Hello to the Scikit-learn community!
>
> > I am doing case-control matching for an electronic health records
> > study. My question is, is it possible to run Sklearn's
> NearestNeighbors
> > function without replacement? As in, match the treated group to
> the
> > untreated group without re-using any of the untreated group data
> > points? If so, how? By default, it uses replacement. I know this
> > because I tested it on some data of mine.
>
> > The code I used is in the confirmed answer here: https://
> > stats.stackexchange.com/questions/206832/matched-pai
> > rs-in-python-propensity-score-matching
>
> > Thanks so much in advance,
>
>
> > No, pairwise matching without replacement is not implemented within
> > scikit-learn's nearest neighbors routines.
>
> > It seems like an algorithm you would have to think carefully about
> because
> > the number of potential pairs grows exponentially with the number of
> > points, and I don't think it's true that choosing the nearest
> available
> > neighbor of points in sequence will guarantee you to find the optimal
> > configuration. You'd also have to carefully define what you mean by
> > "optimal"... are you seeking to minimize the sum of all distances?
> The sum
> > of squared distances? The maximum distance? The results would change
> > depending on the metric you define. And you'd probably have to
> figure out
> > some way to reduce the exponential search space in order to
> calculate the
> > result in a reasonable amount of time for your data.
>
> > You might look into the literature on propensity score matching; I
> think
> > that's one area where this kind of neighbors-without-replacement
> algorithm
> > is often used.
>
> > Best,
> > Jake
> >
> --
> Gael Varoquaux
> Senior Researcher, INRIA Parietal
> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
> Phone: ++ 33-1-69-08-79-68
> http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
--
*Randall J. Ellis, B.S.*
PhD Student, Biomedical Science, Mount Sinai
Special Volunteer, http://www.michaelideslab.org/, NIDA IRP
Cell: (954)-260-9891
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180403/60fa48dc/attachment-0001.html>
More information about the scikit-learn
mailing list