Hello to the Scikit-learn community! I am doing case-control matching for an electronic health records study. My question is, is it possible to run Sklearn's NearestNeighbors function without replacement? As in, match the treated group to the untreated group without re-using any of the untreated group data points? If so, how? By default, it uses replacement. I know this because I tested it on some data of mine. The code I used is in the confirmed answer here: https://stats.stackexchange.com/questions/206832/matched-pairs-in-python-pro... Thanks so much in advance, -- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891
On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljellis@gmail.com> wrote:
Hello to the Scikit-learn community!
I am doing case-control matching for an electronic health records study. My question is, is it possible to run Sklearn's NearestNeighbors function without replacement? As in, match the treated group to the untreated group without re-using any of the untreated group data points? If so, how? By default, it uses replacement. I know this because I tested it on some data of mine.
The code I used is in the confirmed answer here: https://stats.stackexchange.com/questions/206832/matched- pairs-in-python-propensity-score-matching
Thanks so much in advance,
No, pairwise matching without replacement is not implemented within scikit-learn's nearest neighbors routines. It seems like an algorithm you would have to think carefully about because the number of potential pairs grows exponentially with the number of points, and I don't think it's true that choosing the nearest available neighbor of points in sequence will guarantee you to find the optimal configuration. You'd also have to carefully define what you mean by "optimal"... are you seeking to minimize the sum of all distances? The sum of squared distances? The maximum distance? The results would change depending on the metric you define. And you'd probably have to figure out some way to reduce the exponential search space in order to calculate the result in a reasonable amount of time for your data. You might look into the literature on propensity score matching; I think that's one area where this kind of neighbors-without-replacement algorithm is often used. Best, Jake
-- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891 <(954)%20260-9891>
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Jake, Thanks for the reply. Yes, trying this out resulted from looking for ways in python to implement propensity score matching. I found a package, pscore_match (http://www.kellieottoboni.com/pscore_match/), but the matching was really terrible. Specifically, I'm matching based on age, race, gender, HIV status, hepatitis C status, and sickle-cell disease status. Using NearestNeighbors for matching performed WAY better, I was so surprised at how well every factor was matched for. The only issue is that it uses replacement. Here's what I'm currently testing. I need each case to match to 20 controls, so since NearestNeighbors uses replacement, I'm matching each case to many controls (15000), taking all of the distances for all of the pairs, and retaining only the smallest distances for each control. Since many controls are re-used (since the algorithm uses replacement), the hope is that enough controls are matched to many different cases so that each case ends up being matched to 20 unique controls. Does this method make sense?? Best, Randy On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas <jakevdp@cs.washington.edu
wrote:
On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljellis@gmail.com> wrote:
Hello to the Scikit-learn community!
I am doing case-control matching for an electronic health records study. My question is, is it possible to run Sklearn's NearestNeighbors function without replacement? As in, match the treated group to the untreated group without re-using any of the untreated group data points? If so, how? By default, it uses replacement. I know this because I tested it on some data of mine.
The code I used is in the confirmed answer here: https://stats.stackexchange.com/questions/206832/matched-pai rs-in-python-propensity-score-matching
Thanks so much in advance,
No, pairwise matching without replacement is not implemented within scikit-learn's nearest neighbors routines.
It seems like an algorithm you would have to think carefully about because the number of potential pairs grows exponentially with the number of points, and I don't think it's true that choosing the nearest available neighbor of points in sequence will guarantee you to find the optimal configuration. You'd also have to carefully define what you mean by "optimal"... are you seeking to minimize the sum of all distances? The sum of squared distances? The maximum distance? The results would change depending on the metric you define. And you'd probably have to figure out some way to reduce the exponential search space in order to calculate the result in a reasonable amount of time for your data.
You might look into the literature on propensity score matching; I think that's one area where this kind of neighbors-without-replacement algorithm is often used.
Best, Jake
-- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891 <(954)%20260-9891>
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891
Hi Randy, I think that approach is probably a good heuristic, but it will not necessarily find the optimal result. That said, if you don't care about having guarantees that you're finding the optimal pairing, but only that you can find a reasonable set of pairs, it will probably work out fine. Jake Jake VanderPlas Senior Data Science Fellow Director of Open Software University of Washington eScience Institute On Mon, Apr 2, 2018 at 10:47 AM, Randy Ellis <randalljellis@gmail.com> wrote:
Hi Jake,
Thanks for the reply. Yes, trying this out resulted from looking for ways in python to implement propensity score matching. I found a package, pscore_match (http://www.kellieottoboni.com/pscore_match/), but the matching was really terrible. Specifically, I'm matching based on age, race, gender, HIV status, hepatitis C status, and sickle-cell disease status. Using NearestNeighbors for matching performed WAY better, I was so surprised at how well every factor was matched for. The only issue is that it uses replacement.
Here's what I'm currently testing. I need each case to match to 20 controls, so since NearestNeighbors uses replacement, I'm matching each case to many controls (15000), taking all of the distances for all of the pairs, and retaining only the smallest distances for each control. Since many controls are re-used (since the algorithm uses replacement), the hope is that enough controls are matched to many different cases so that each case ends up being matched to 20 unique controls. Does this method make sense??
Best,
Randy
On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas < jakevdp@cs.washington.edu> wrote:
On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljellis@gmail.com> wrote:
Hello to the Scikit-learn community!
I am doing case-control matching for an electronic health records study. My question is, is it possible to run Sklearn's NearestNeighbors function without replacement? As in, match the treated group to the untreated group without re-using any of the untreated group data points? If so, how? By default, it uses replacement. I know this because I tested it on some data of mine.
The code I used is in the confirmed answer here: https://stats.stackexchange.com/questions/206832/matched-pai rs-in-python-propensity-score-matching
Thanks so much in advance,
No, pairwise matching without replacement is not implemented within scikit-learn's nearest neighbors routines.
It seems like an algorithm you would have to think carefully about because the number of potential pairs grows exponentially with the number of points, and I don't think it's true that choosing the nearest available neighbor of points in sequence will guarantee you to find the optimal configuration. You'd also have to carefully define what you mean by "optimal"... are you seeking to minimize the sum of all distances? The sum of squared distances? The maximum distance? The results would change depending on the metric you define. And you'd probably have to figure out some way to reduce the exponential search space in order to calculate the result in a reasonable amount of time for your data.
You might look into the literature on propensity score matching; I think that's one area where this kind of neighbors-without-replacement algorithm is often used.
Best, Jake
-- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891 <(954)%20260-9891>
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891 <(954)%20260-9891>
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Jake, Thank you for the feedback. Yeah, working without replacement, certain cases are going to more appropriate matches than others. I proposed the idea of using replacement and compensating for the re-use of controls with frequency weighting, but you gotta do what your PI tells you sometimes! :P Best, Randy On Mon, Apr 2, 2018 at 2:15 PM, Jacob Vanderplas <jakevdp@cs.washington.edu> wrote:
Hi Randy, I think that approach is probably a good heuristic, but it will not necessarily find the optimal result. That said, if you don't care about having guarantees that you're finding the optimal pairing, but only that you can find a reasonable set of pairs, it will probably work out fine. Jake
Jake VanderPlas Senior Data Science Fellow Director of Open Software University of Washington eScience Institute
On Mon, Apr 2, 2018 at 10:47 AM, Randy Ellis <randalljellis@gmail.com> wrote:
Hi Jake,
Thanks for the reply. Yes, trying this out resulted from looking for ways in python to implement propensity score matching. I found a package, pscore_match (http://www.kellieottoboni.com/pscore_match/), but the matching was really terrible. Specifically, I'm matching based on age, race, gender, HIV status, hepatitis C status, and sickle-cell disease status. Using NearestNeighbors for matching performed WAY better, I was so surprised at how well every factor was matched for. The only issue is that it uses replacement.
Here's what I'm currently testing. I need each case to match to 20 controls, so since NearestNeighbors uses replacement, I'm matching each case to many controls (15000), taking all of the distances for all of the pairs, and retaining only the smallest distances for each control. Since many controls are re-used (since the algorithm uses replacement), the hope is that enough controls are matched to many different cases so that each case ends up being matched to 20 unique controls. Does this method make sense??
Best,
Randy
On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas < jakevdp@cs.washington.edu> wrote:
On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljellis@gmail.com> wrote:
Hello to the Scikit-learn community!
I am doing case-control matching for an electronic health records study. My question is, is it possible to run Sklearn's NearestNeighbors function without replacement? As in, match the treated group to the untreated group without re-using any of the untreated group data points? If so, how? By default, it uses replacement. I know this because I tested it on some data of mine.
The code I used is in the confirmed answer here: https://stats.stackexchange.com/questions/206832/matched-pai rs-in-python-propensity-score-matching
Thanks so much in advance,
No, pairwise matching without replacement is not implemented within scikit-learn's nearest neighbors routines.
It seems like an algorithm you would have to think carefully about because the number of potential pairs grows exponentially with the number of points, and I don't think it's true that choosing the nearest available neighbor of points in sequence will guarantee you to find the optimal configuration. You'd also have to carefully define what you mean by "optimal"... are you seeking to minimize the sum of all distances? The sum of squared distances? The maximum distance? The results would change depending on the metric you define. And you'd probably have to figure out some way to reduce the exponential search space in order to calculate the result in a reasonable amount of time for your data.
You might look into the literature on propensity score matching; I think that's one area where this kind of neighbors-without-replacement algorithm is often used.
Best, Jake
-- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891 <(954)%20260-9891>
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891 <(954)%20260-9891>
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891
Matching to minimize a cost is known as the linear assignment problem, can be solved in n^3 cost, and is implemented in scikit-learn in sklearn.utils.linear_assignment_.linear_assignment or in recent versions of scipy as scipy.optimize.linear_sum_assignment Of course, this problem will require much more coding (you need to build your pairwise cost matrix) and much more computing cost (n^3 instead of n^2) than a standard nearest-neighbor. Gaël On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote:
Hi Jake,
Thanks for the reply. Yes, trying this out resulted from looking for ways in python to implement propensity score matching. I found a package, pscore_match (http://www.kellieottoboni.com/pscore_match/), but the matching was really terrible. Specifically, I'm matching based on age, race, gender, HIV status, hepatitis C status, and sickle-cell disease status. Using NearestNeighbors for matching performed WAY better, I was so surprised at how well every factor was matched for. The only issue is that it uses replacement.
Here's what I'm currently testing. I need each case to match to 20 controls, so since NearestNeighbors uses replacement, I'm matching each case to many controls (15000), taking all of the distances for all of the pairs, and retaining only the smallest distances for each control. Since many controls are re-used (since the algorithm uses replacement), the hope is that enough controls are matched to many different cases so that each case ends up being matched to 20 unique controls. Does this method make sense??
Best,
Randy
On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas <jakevdp@cs.washington.edu> wrote:
On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljellis@gmail.com> wrote:
Hello to the Scikit-learn community!
I am doing case-control matching for an electronic health records study. My question is, is it possible to run Sklearn's NearestNeighbors function without replacement? As in, match the treated group to the untreated group without re-using any of the untreated group data points? If so, how? By default, it uses replacement. I know this because I tested it on some data of mine.
The code I used is in the confirmed answer here: https:// stats.stackexchange.com/questions/206832/matched-pai rs-in-python-propensity-score-matching
Thanks so much in advance,
No, pairwise matching without replacement is not implemented within scikit-learn's nearest neighbors routines.
It seems like an algorithm you would have to think carefully about because the number of potential pairs grows exponentially with the number of points, and I don't think it's true that choosing the nearest available neighbor of points in sequence will guarantee you to find the optimal configuration. You'd also have to carefully define what you mean by "optimal"... are you seeking to minimize the sum of all distances? The sum of squared distances? The maximum distance? The results would change depending on the metric you define. And you'd probably have to figure out some way to reduce the exponential search space in order to calculate the result in a reasonable amount of time for your data.
You might look into the literature on propensity score matching; I think that's one area where this kind of neighbors-without-replacement algorithm is often used.
Best, Jake
-- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
Thanks Dr. Varoquax, it’s awesome you’re on this list, I’m a fan of your work! Will look into this strategy. Best, Randy On Tue, Apr 3, 2018 at 8:57 AM Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:
Matching to minimize a cost is known as the linear assignment problem, can be solved in n^3 cost, and is implemented in scikit-learn in sklearn.utils.linear_assignment_.linear_assignment or in recent versions of scipy as scipy.optimize.linear_sum_assignment
Of course, this problem will require much more coding (you need to build your pairwise cost matrix) and much more computing cost (n^3 instead of n^2) than a standard nearest-neighbor.
Gaël
On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote:
Hi Jake,
Thanks for the reply. Yes, trying this out resulted from looking for ways in python to implement propensity score matching. I found a package, pscore_match (http://www.kellieottoboni.com/pscore_match/), but the matching was really terrible. Specifically, I'm matching based on age, race, gender, HIV status, hepatitis C status, and sickle-cell disease status. Using NearestNeighbors for matching performed WAY better, I was so surprised at how well every factor was matched for. The only issue is that it uses replacement.
Here's what I'm currently testing. I need each case to match to 20 controls, so since NearestNeighbors uses replacement, I'm matching each case to many controls (15000), taking all of the distances for all of the pairs, and retaining only the smallest distances for each control. Since many controls are re-used (since the algorithm uses replacement), the hope is that enough controls are matched to many different cases so that each case ends up being matched to 20 unique controls. Does this method make sense??
Best,
Randy
On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas < jakevdp@cs.washington.edu> wrote:
On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljellis@gmail.com
wrote:
Hello to the Scikit-learn community!
I am doing case-control matching for an electronic health records study. My question is, is it possible to run Sklearn's
NearestNeighbors
function without replacement? As in, match the treated group to
the
untreated group without re-using any of the untreated group data points? If so, how? By default, it uses replacement. I know this because I tested it on some data of mine.
The code I used is in the confirmed answer here: https:// stats.stackexchange.com/questions/206832/matched-pai rs-in-python-propensity-score-matching
Thanks so much in advance,
No, pairwise matching without replacement is not implemented within scikit-learn's nearest neighbors routines.
It seems like an algorithm you would have to think carefully about
because
the number of potential pairs grows exponentially with the number of points, and I don't think it's true that choosing the nearest
available
neighbor of points in sequence will guarantee you to find the optimal configuration. You'd also have to carefully define what you mean by "optimal"... are you seeking to minimize the sum of all distances?
The sum
of squared distances? The maximum distance? The results would change depending on the metric you define. And you'd probably have to
figure out
some way to reduce the exponential search space in order to
calculate the
result in a reasonable amount of time for your data.
You might look into the literature on propensity score matching; I
think
that's one area where this kind of neighbors-without-replacement
algorithm
is often used.
Best, Jake
-- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891
Hi Dr. Varoquaux, It seems like the SciPy function only assigns one row to one column. I need to assign 20 controls to each case. Does the linear_sum_assignment function, since it assigns unique pairs, depend on the order of the rows and columns? If so, perhaps I could shuffle and then combine the pairs together until each case has 20 unique controls. Any thoughts on this are greatly appreciated. Best, Randy On Tue, Apr 3, 2018 at 8:57 AM, Gael Varoquaux < gael.varoquaux@normalesup.org> wrote:
Matching to minimize a cost is known as the linear assignment problem, can be solved in n^3 cost, and is implemented in scikit-learn in sklearn.utils.linear_assignment_.linear_assignment or in recent versions of scipy as scipy.optimize.linear_sum_assignment
Of course, this problem will require much more coding (you need to build your pairwise cost matrix) and much more computing cost (n^3 instead of n^2) than a standard nearest-neighbor.
Gaël
On Mon, Apr 02, 2018 at 01:47:51PM -0400, Randy Ellis wrote:
Hi Jake,
Thanks for the reply. Yes, trying this out resulted from looking for ways in python to implement propensity score matching. I found a package, pscore_match (http://www.kellieottoboni.com/pscore_match/), but the matching was really terrible. Specifically, I'm matching based on age, race, gender, HIV status, hepatitis C status, and sickle-cell disease status. Using NearestNeighbors for matching performed WAY better, I was so surprised at how well every factor was matched for. The only issue is that it uses replacement.
Here's what I'm currently testing. I need each case to match to 20 controls, so since NearestNeighbors uses replacement, I'm matching each case to many controls (15000), taking all of the distances for all of the pairs, and retaining only the smallest distances for each control. Since many controls are re-used (since the algorithm uses replacement), the hope is that enough controls are matched to many different cases so that each case ends up being matched to 20 unique controls. Does this method make sense??
Best,
Randy
On Sun, Apr 1, 2018 at 10:13 PM, Jacob Vanderplas < jakevdp@cs.washington.edu> wrote:
On Sun, Apr 1, 2018 at 6:36 PM, Randy Ellis <randalljellis@gmail.com
wrote:
Hello to the Scikit-learn community!
I am doing case-control matching for an electronic health records study. My question is, is it possible to run Sklearn's
NearestNeighbors
function without replacement? As in, match the treated group to
the
untreated group without re-using any of the untreated group data points? If so, how? By default, it uses replacement. I know this because I tested it on some data of mine.
The code I used is in the confirmed answer here: https:// stats.stackexchange.com/questions/206832/matched-pai rs-in-python-propensity-score-matching
Thanks so much in advance,
No, pairwise matching without replacement is not implemented within scikit-learn's nearest neighbors routines.
It seems like an algorithm you would have to think carefully about
because
the number of potential pairs grows exponentially with the number of points, and I don't think it's true that choosing the nearest
available
neighbor of points in sequence will guarantee you to find the optimal configuration. You'd also have to carefully define what you mean by "optimal"... are you seeking to minimize the sum of all distances?
The sum
of squared distances? The maximum distance? The results would change depending on the metric you define. And you'd probably have to
figure out
some way to reduce the exponential search space in order to
calculate the
result in a reasonable amount of time for your data.
You might look into the literature on propensity score matching; I
think
that's one area where this kind of neighbors-without-replacement
algorithm
is often used.
Best, Jake
-- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891
participants (3)
-
Gael Varoquaux -
Jacob Vanderplas -
Randy Ellis