<div dir="ltr">Hi Dr. Varoquaux, <div><br></div><div>It seems like the SciPy function only assigns one row to one column. I need to assign 20 controls to each case. Does the linear_sum_assignment function, since it assigns unique pairs, depend on the order of the rows and columns? If so, perhaps I could shuffle and then combine the pairs together until each case has 20 unique controls.</div><div><br></div><div>Any thoughts on this are greatly appreciated.</div><div><br></div><div>Best,</div><div><br></div><div>Randy </div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Apr 3, 2018 at 8:57 AM, Gael Varoquaux <span dir="ltr"><<a href="mailto:gael.varoquaux@normalesup.org" target="_blank">gael.varoquaux@normalesup.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Matching to minimize a cost is known as the linear assignment problem,<br>
can be solved in n^3 cost, and is implemented in scikit-learn in<br>
sklearn.utils.linear_<wbr>assignment_.linear_assignment or in recent versions<br>
of scipy as scipy.optimize.linear_sum_<wbr>assignment<br>
<br>
Of course, this problem will require much more coding (you need to build<br>
your pairwise cost matrix) and much more computing cost (n^3 instead of<br>
n^2) than a standard nearest-neighbor.<br>
<br>
Gaël<br>
<div class="HOEnZb"><div class="h5"><br>
On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote:<br>
> Hi Jake,<br>
<br>
> Thanks for the reply. Yes, trying this out resulted from looking for ways in<br>
> python to implement propensity score matching. I found a package, pscore_match<br>
> (<a href="http://www.kellieottoboni.com/pscore_match/" rel="noreferrer" target="_blank">http://www.kellieottoboni.<wbr>com/pscore_match/</a>), but the matching was really<br>
> terrible. Specifically, I'm matching based on age, race, gender, HIV status,<br>
> hepatitis C status, and sickle-cell disease status. Using NearestNeighbors for<br>
> matching performed WAY better, I was so surprised at how well every factor was<br>
> matched for. The only issue is that it uses replacement. <br>
<br>
> Here's what I'm currently testing. I need each case to match to 20 controls, so<br>
> since NearestNeighbors uses replacement, I'm matching each case to many<br>
> controls (15000), taking all of the distances for all of the pairs, and<br>
> retaining only the smallest distances for each control. Since many controls are<br>
> re-used (since the algorithm uses replacement), the hope is that enough<br>
> controls are matched to many different cases so that each case ends up being<br>
> matched to 20 unique controls. Does this method make sense??<br>
<br>
> Best,<br>
<br>
> Randy  <br>
<br>
> On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas <<a href="mailto:jakevdp@cs.washington.edu">jakevdp@cs.washington.edu</a>><br>
> wrote:<br>
<br>
>Â Â Â On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <<a href="mailto:randalljellis@gmail.com">randalljellis@gmail.com</a>><br>
>Â Â Â wrote:<br>
<br>
>Â Â Â Â Â Hello to the Scikit-learn community!<br>
<br>
>Â Â Â Â Â I am doing case-control matching for an electronic health records<br>
>Â Â Â Â Â study. My question is, is it possible to run Sklearn's NearestNeighbors<br>
>Â Â Â Â Â function without replacement? As in, match the treated group to the<br>
>Â Â Â Â Â untreated group without re-using any of the untreated group data<br>
>Â Â Â Â Â points? If so, how? By default, it uses replacement. I know this<br>
>Â Â Â Â Â because I tested it on some data of mine.<br>
<br>
>Â Â Â Â Â The code I used is in the confirmed answer here: https://<br>
>Â Â Â Â Â <a href="http://stats.stackexchange.com/questions/206832/matched-pai" rel="noreferrer" target="_blank">stats.stackexchange.com/<wbr>questions/206832/matched-pai</a><br>
>Â Â Â Â Â rs-in-python-propensity-score-<wbr>matching<br>
<br>
>Â Â Â Â Â Thanks so much in advance,<br>
<br>
<br>
>Â Â Â No, pairwise matching without replacement is not implemented within<br>
>Â Â Â scikit-learn's nearest neighbors routines.<br>
<br>
>Â Â Â It seems like an algorithm you would have to think carefully about because<br>
>Â Â Â the number of potential pairs grows exponentially with the number of<br>
>Â Â Â points, and I don't think it's true that choosing the nearest available<br>
>Â Â Â neighbor of points in sequence will guarantee you to find the optimal<br>
>Â Â Â configuration. You'd also have to carefully define what you mean by<br>
>Â Â Â "optimal"... are you seeking to minimize the sum of all distances? The sum<br>
>Â Â Â of squared distances? The maximum distance? The results would change<br>
>Â Â Â depending on the metric you define. And you'd probably have to figure out<br>
>Â Â Â some way to reduce the exponential search space in order to calculate the<br>
>Â Â Â result in a reasonable amount of time for your data.<br>
<br>
>Â Â Â You might look into the literature on propensity score matching; I think<br>
>Â Â Â that's one area where this kind of neighbors-without-replacement algorithm<br>
>Â Â Â is often used.<br>
<br>
>Â Â Â Best,<br>
>Â Â Â Â Â Jake<br>
>Â Â Â Â <br>
--<br>
</div></div><span class="HOEnZb"><font color="#888888">Â Â Gael Varoquaux<br>
  Senior Researcher, INRIA Parietal<br>
  NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France<br>
  Phone: <a href="tel:%2B%2B%2033-1-69-08-79-68" value="+33169087968">++ 33-1-69-08-79-68</a><br>
  <a href="http://gael-varoquaux.info" rel="noreferrer" target="_blank">http://gael-varoquaux.info</a>      <a href="http://twitter.com/GaelVaroquaux" rel="noreferrer" target="_blank">http://twitter.com/<wbr>GaelVaroquaux</a><br>
</font></span><div class="HOEnZb"><div class="h5">______________________________<wbr>_________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div style="color:rgb(0,0,0);font-family:Tahoma;font-size:13px"><div style="font-family:arial,sans-serif"><font face="arial, helvetica, sans-serif"><span style="font-size:13.3333px"><b>Randall J. Ellis, B.S.</b></span></font></div><div style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px"><font face="arial, helvetica, sans-serif"><font style="color:rgb(0,0,0);font-size:13px">PhD Student, Biomedical Science, Mount Sinai<br style="font-size:13.3333px"></font><span style="color:rgb(0,0,0);font-size:13.3333px">Special Volunteer, </span><font color="#000000"><span style="font-size:13.3333px"><a href="http://www.michaelideslab.org/" target="_blank">http://www.michaelideslab.org/</a>, NIDA IRP</span></font></font></div><div style="font-family:arial,sans-serif"><span style="font-size:13.3333px"><font face="arial, helvetica, sans-serif">Cell: (954)-260-9891</font></span></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div>
</div>