[scikit-learn] NearestNeighbors without replacement
Randy Ellis
randalljellis at gmail.com
Mon Apr 2 14:18:28 EDT 2018
Hi Jake,
Thank you for the feedback. Yeah, working without replacement, certain
cases are going to more appropriate matches than others. I proposed the
idea of using replacement and compensating for the re-use of controls with
frequency weighting, but you gotta do what your PI tells you sometimes! :P
Best,
Randy
On Mon, Apr 2, 2018 at 2:15 PM, Jacob Vanderplas <jakevdp at cs.washington.edu>
wrote:
> Hi Randy,
> I think that approach is probably a good heuristic, but it will not
> necessarily find the optimal result. That said, if you don't care about
> having guarantees that you're finding the optimal pairing, but only that
> you can find a reasonable set of pairs, it will probably work out fine.
> Jake
>
> Jake VanderPlas
> Senior Data Science Fellow
> Director of Open Software
> University of Washington eScience Institute
>
> On Mon, Apr 2, 2018 at 10:47 AM, Randy Ellis <randalljellis at gmail.com>
> wrote:
>
>> Hi Jake,
>>
>> Thanks for the reply. Yes, trying this out resulted from looking for ways
>> in python to implement propensity score matching. I found a package,
>> pscore_match (http://www.kellieottoboni.com/pscore_match/), but the
>> matching was really terrible. Specifically, I'm matching based on age,
>> race, gender, HIV status, hepatitis C status, and sickle-cell disease
>> status. Using NearestNeighbors for matching performed WAY better, I was so
>> surprised at how well every factor was matched for. The only issue is that
>> it uses replacement.
>>
>> Here's what I'm currently testing. I need each case to match to 20
>> controls, so since NearestNeighbors uses replacement, I'm matching each
>> case to many controls (15000), taking all of the distances for all of the
>> pairs, and retaining only the smallest distances for each control. Since
>> many controls are re-used (since the algorithm uses replacement), the hope
>> is that enough controls are matched to many different cases so that each
>> case ends up being matched to 20 unique controls. Does this method make
>> sense??
>>
>> Best,
>>
>> Randy
>>
>> On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas <
>> jakevdp at cs.washington.edu> wrote:
>>
>>> On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljellis at gmail.com>
>>> wrote:
>>>
>>>> Hello to the Scikit-learn community!
>>>>
>>>> I am doing case-control matching for an electronic health records
>>>> study. My question is, is it possible to run Sklearn's NearestNeighbors
>>>> function without replacement? As in, match the treated group to the
>>>> untreated group without re-using any of the untreated group data points? If
>>>> so, how? By default, it uses replacement. I know this because I tested it
>>>> on some data of mine.
>>>>
>>>> The code I used is in the confirmed answer here:
>>>> https://stats.stackexchange.com/questions/206832/matched-pai
>>>> rs-in-python-propensity-score-matching
>>>>
>>>> Thanks so much in advance,
>>>>
>>>
>>> No, pairwise matching without replacement is not implemented within
>>> scikit-learn's nearest neighbors routines.
>>>
>>> It seems like an algorithm you would have to think carefully about
>>> because the number of potential pairs grows exponentially with the number
>>> of points, and I don't think it's true that choosing the nearest available
>>> neighbor of points in sequence will guarantee you to find the optimal
>>> configuration. You'd also have to carefully define what you mean by
>>> "optimal"... are you seeking to minimize the sum of all distances? The sum
>>> of squared distances? The maximum distance? The results would change
>>> depending on the metric you define. And you'd probably have to figure out
>>> some way to reduce the exponential search space in order to calculate the
>>> result in a reasonable amount of time for your data.
>>>
>>> You might look into the literature on propensity score matching; I think
>>> that's one area where this kind of neighbors-without-replacement algorithm
>>> is often used.
>>>
>>> Best,
>>> Jake
>>>
>>>
>>>>
>>>> --
>>>> *Randall J. Ellis, B.S.*
>>>> PhD Student, Biomedical Science, Mount Sinai
>>>> Special Volunteer, http://www.michaelideslab.org/, NIDA IRP
>>>> Cell: (954)-260-9891 <(954)%20260-9891>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>> *Randall J. Ellis, B.S.*
>> PhD Student, Biomedical Science, Mount Sinai
>> Special Volunteer, http://www.michaelideslab.org/, NIDA IRP
>> Cell: (954)-260-9891 <(954)%20260-9891>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
--
*Randall J. Ellis, B.S.*
PhD Student, Biomedical Science, Mount Sinai
Special Volunteer, http://www.michaelideslab.org/, NIDA IRP
Cell: (954)-260-9891
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180402/1a2a3717/attachment.html>
More information about the scikit-learn
mailing list