<div dir="ltr">Hi Dr. Varoquaux,Â <div><br></div><div>It seems like the SciPy function only assigns one row to one column. I need to assign 20 controls to each case. Does the linear_sum_assignment function, since it assigns unique pairs, depend on the order of the rows and columns? If so, perhaps I could shuffle and then combine the pairs together until each case has 20 unique controls.</div><div><br></div><div>Any thoughts on this are greatly appreciated.</div><div><br></div><div>Best,</div><div><br></div><div>RandyÂ </div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Apr 3, 2018 at 8:57 AM, Gael Varoquaux <span dir="ltr"><<a href="mailto:gael.varoquaux@normalesup.org" target="_blank">gael.varoquaux@normalesup.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Matching to minimize a cost is known as the linear assignment problem,<br>

can be solved in n^3 cost, and is implemented in scikit-learn in<br>

sklearn.utils.linear_<wbr>assignment_.linear_assignment or in recent versions<br>

of scipy as scipy.optimize.linear_sum_<wbr>assignment<br>

<br>

Of course, this problem will require much more coding (you need to build<br>

your pairwise cost matrix) and much more computing cost (n^3 instead of<br>

n^2) than a standard nearest-neighbor.<br>

<br>

GaÃ«l<br>

<div class="HOEnZb"><div class="h5"><br>

On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote:<br>

> Hi Jake,<br>

<br>

> Thanks for the reply. Yes, trying this out resulted from looking for ways in<br>

> python to implement propensity score matching. I found a package, pscore_match<br>

> (<a href="http://www.kellieottoboni.com/pscore_match/" rel="noreferrer" target="_blank">http://www.kellieottoboni.<wbr>com/pscore_match/</a>), but the matching was really<br>

> terrible. Specifically, I'm matching based on age, race, gender, HIV status,<br>

> hepatitis C status, and sickle-cell disease status. Using NearestNeighbors for<br>

> matching performed WAY better, I was so surprised at how well every factor was<br>

> matched for. The only issue is that it uses replacement.Â <br>

<br>

> Here's what I'm currently testing. I need each case to match to 20 controls, so<br>

> since NearestNeighbors uses replacement, I'm matching each case to many<br>

> controls (15000), taking all of the distances for all of the pairs, and<br>

> retaining only the smallest distances for each control. Since many controls are<br>

> re-used (since the algorithm uses replacement), the hope is that enough<br>

> controls are matched to many different cases so that each case ends up being<br>

> matched to 20 unique controls. Does this method make sense??<br>

<br>

> Best,<br>

<br>

> RandyÂ Â <br>

<br>

> On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas <<a href="mailto:jakevdp@cs.washington.edu">jakevdp@cs.washington.edu</a>><br>

> wrote:<br>

<br>

>Â  Â  Â On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <<a href="mailto:randalljellis@gmail.com">randalljellis@gmail.com</a>><br>

>Â  Â  Â wrote:<br>

<br>

>Â  Â  Â  Â  Â Hello to the Scikit-learn community!<br>

<br>

>Â  Â  Â  Â  Â I am doing case-control matching for an electronic health records<br>

>Â  Â  Â  Â  Â study. My question is, is it possible to run Sklearn's NearestNeighbors<br>

>Â  Â  Â  Â  Â function without replacement? As in, match the treated group to the<br>

>Â  Â  Â  Â  Â untreated group without re-using any of the untreated group data<br>

>Â  Â  Â  Â  Â points? If so, how? By default, it uses replacement. I know this<br>

>Â  Â  Â  Â  Â because I tested it on some data of mine.<br>

<br>

>Â  Â  Â  Â  Â The code I used is in the confirmed answer here: https://<br>

>Â  Â  Â  Â  Â <a href="http://stats.stackexchange.com/questions/206832/matched-pai" rel="noreferrer" target="_blank">stats.stackexchange.com/<wbr>questions/206832/matched-pai</a><br>

>Â  Â  Â  Â  Â rs-in-python-propensity-score-<wbr>matching<br>

<br>

>Â  Â  Â  Â  Â Thanks so much in advance,<br>

<br>

<br>

>Â  Â  Â No, pairwise matching without replacement is not implemented within<br>

>Â  Â  Â scikit-learn's nearest neighbors routines.<br>

<br>

>Â  Â  Â It seems like an algorithm you would have to think carefully about because<br>

>Â  Â  Â the number of potential pairs grows exponentially with the number of<br>

>Â  Â  Â points, and I don't think it's true that choosing the nearest available<br>

>Â  Â  Â neighbor of points in sequence will guarantee you to find the optimal<br>

>Â  Â  Â configuration. You'd also have to carefully define what you mean by<br>

>Â  Â  Â "optimal"... are you seeking to minimize the sum of all distances? The sum<br>

>Â  Â  Â of squared distances? The maximum distance? The results would change<br>

>Â  Â  Â depending on the metric you define. And you'd probably have to figure out<br>

>Â  Â  Â some way to reduce the exponential search space in order to calculate the<br>

>Â  Â  Â result in a reasonable amount of time for your data.<br>

<br>

>Â  Â  Â You might look into the literature on propensity score matching; I think<br>

>Â  Â  Â that's one area where this kind of neighbors-without-replacement algorithm<br>

>Â  Â  Â is often used.<br>

<br>

>Â  Â  Â Best,<br>

>Â  Â  Â Â  Â Jake<br>

>Â  Â  Â Â <br>

--<br>

</div></div><span class="HOEnZb"><font color="#888888">Â  Â  Gael Varoquaux<br>

Â  Â  Senior Researcher, INRIA Parietal<br>

Â  Â  NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France<br>

Â  Â  Phone:Â  <a href="tel:%2B%2B%2033-1-69-08-79-68" value="+33169087968">++ 33-1-69-08-79-68</a><br>

Â  Â  <a href="http://gael-varoquaux.info" rel="noreferrer" target="_blank">http://gael-varoquaux.info</a>Â  Â  Â  Â  Â  Â  <a href="http://twitter.com/GaelVaroquaux" rel="noreferrer" target="_blank">http://twitter.com/<wbr>GaelVaroquaux</a><br>

</font></span><div class="HOEnZb"><div class="h5">______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div style="color:rgb(0,0,0);font-family:Tahoma;font-size:13px"><div style="font-family:arial,sans-serif"><font face="arial, helvetica, sans-serif"><span style="font-size:13.3333px"><b>Randall J. Ellis, B.S.</b></span></font></div><div style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px"><font face="arial, helvetica, sans-serif"><font style="color:rgb(0,0,0);font-size:13px">PhD Student, Biomedical Science, Mount Sinai<br style="font-size:13.3333px"></font><span style="color:rgb(0,0,0);font-size:13.3333px">Special Volunteer,Â </span><font color="#000000"><span style="font-size:13.3333px"><a href="http://www.michaelideslab.org/" target="_blank">http://www.michaelideslab.org/</a>, NIDA IRP</span></font></font></div><div style="font-family:arial,sans-serif"><span style="font-size:13.3333px"><font face="arial, helvetica, sans-serif">Cell: (954)-260-9891</font></span></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div>

</div>