Re: [scikit-learn] scikit-learn Digest, Vol 6, Issue 40
Hi David, When applying Train_test_split to the sample space, we have a single row per subject. I am looking for some other function like Train_test_split that can deal with pairs of rows (for each subject), which does not lead to a biased accuracy. We are studying memory and have a row of features for successful memory encoding, and a second row for unsuccessful memory encoding in each of the subjects. Our target space being 1 for successful and 0 for unsuccessful encoding respectively. How do you recommend me to split this set of data in order to get a reasonable/unbiased accuracy? Thanks, Afarin ________________________________________ From: scikit-learn <scikit-learn-bounces+afarin.famili=utsouthwestern.edu@python.org> on behalf of scikit-learn-request@python.org <scikit-learn-request@python.org> Sent: Monday, September 26, 2016 2:43 PM To: scikit-learn@python.org Subject: scikit-learn Digest, Vol 6, Issue 40 Send scikit-learn mailing list submissions to scikit-learn@python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request@python.org You can reach the person managing the list at scikit-learn-owner@python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. header intact (Afarin Famili) 2. Is there a built-in function for pairs of data? (Afarin Famili) 3. Re: Is there a built-in function for pairs of data? (Pedro Pazzini) 4. Re: Is there a built-in function for pairs of data? (David Nicholson) 5. Large computation time for homogeneous data with agglomerative clustering (Md. Khairullah) ---------------------------------------------------------------------- Message: 1 Date: Mon, 26 Sep 2016 18:03:27 +0000 From: Afarin Famili <Afarin.Famili@UTSouthwestern.edu> To: "scikit-learn@python.org" <scikit-learn@python.org> Subject: [scikit-learn] header intact Message-ID: <1474913007611.80841@UTSouthwestern.edu> Content-Type: text/plain; charset="iso-8859-1" ? ________________________________ UT Southwestern Medical Center The future of medicine, today.
What if you split the data pairwise(i.e. X_success, X_fail, etc) with subjects matched by row index, then run train_test_split on each one with the same random_state? Naoya Kanai Sent from https://polymail.io/ On Mon, Sep 26, 2016 at 2:06 PM Afarin Famili < mailto:Afarin Famili <Afarin.Famili@utsouthwestern.edu>
wrote:
a, pre, code, a:link, body { word-wrap: break-word !important; } Hi David, When applying Train_test_split to the sample space, we have a single row per subject. I am looking for some other function like Train_test_split that can deal with pairs of rows (for each subject), which does not lead to a biased accuracy. We are studying memory and have a row of features for successful memory encoding, and a second row for unsuccessful memory encoding in each of the subjects. Our target space being 1 for successful and 0 for unsuccessful encoding respectively. How do you recommend me to split this set of data in order to get a reasonable/unbiased accuracy? Thanks, Afarin ________________________________________ From: scikit-learn mailto:utsouthwestern.edu@python.org
Hi Arafin, You appear to be talking about a situation in which your dataset is divided into subsets in which the data are highly correlated (but perhaps conditionally independent given the subject / group identifier). In Scikit-learn 0.18 these might be called "grouped cross validation" strategies. See http://scikit-learn.org/dev/modules/cross_validation.html#cross-validation-i... . (In earlier versions of Scikit-learn, you will find the corresponding CV objects as LabelKFold, LeaveOneLabelOut, etc., but we decided to rename them for clarity when redesigning CV objects and moving them to the new sklearn.model_selection subpackage.) I hope that helps. Joel On 27 September 2016 at 07:06, Afarin Famili < Afarin.Famili@utsouthwestern.edu> wrote:
Hi David,
When applying Train_test_split to the sample space, we have a single row per subject. I am looking for some other function like Train_test_split that can deal with pairs of rows (for each subject), which does not lead to a biased accuracy. We are studying memory and have a row of features for successful memory encoding, and a second row for unsuccessful memory encoding in each of the subjects. Our target space being 1 for successful and 0 for unsuccessful encoding respectively. How do you recommend me to split this set of data in order to get a reasonable/unbiased accuracy?
Thanks, Afarin
________________________________________ From: scikit-learn <scikit-learn-bounces+afarin.famili=utsouthwestern.edu@ python.org> on behalf of scikit-learn-request@python.org < scikit-learn-request@python.org> Sent: Monday, September 26, 2016 2:43 PM To: scikit-learn@python.org Subject: scikit-learn Digest, Vol 6, Issue 40
Send scikit-learn mailing list submissions to scikit-learn@python.org
To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request@python.org
You can reach the person managing the list at scikit-learn-owner@python.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..."
Today's Topics:
1. header intact (Afarin Famili) 2. Is there a built-in function for pairs of data? (Afarin Famili) 3. Re: Is there a built-in function for pairs of data? (Pedro Pazzini) 4. Re: Is there a built-in function for pairs of data? (David Nicholson) 5. Large computation time for homogeneous data with agglomerative clustering (Md. Khairullah)
----------------------------------------------------------------------
Message: 1 Date: Mon, 26 Sep 2016 18:03:27 +0000 From: Afarin Famili <Afarin.Famili@UTSouthwestern.edu> To: "scikit-learn@python.org" <scikit-learn@python.org> Subject: [scikit-learn] header intact Message-ID: <1474913007611.80841@UTSouthwestern.edu> Content-Type: text/plain; charset="iso-8859-1"
?
________________________________
UT Southwestern
Medical Center
The future of medicine, today.
It's not really clear to me what you want to achieve. What do you mean by "does not lead to a biased accuracy"? On 09/26/2016 05:06 PM, Afarin Famili wrote:
Hi David,
When applying Train_test_split to the sample space, we have a single row per subject. I am looking for some other function like Train_test_split that can deal with pairs of rows (for each subject), which does not lead to a biased accuracy. We are studying memory and have a row of features for successful memory encoding, and a second row for unsuccessful memory encoding in each of the subjects. Our target space being 1 for successful and 0 for unsuccessful encoding respectively. How do you recommend me to split this set of data in order to get a reasonable/unbiased accuracy?
Thanks, Afarin
Afarin, can you please describe your full data set, as maybe you are making a mistake in how you are setting up the data. My understanding of what Afarin is saying is that for each person he has a row for successes and a row for failures (but cannot understand why only two rows - would expect multiple rows according to different feature configurations) So what Afarin wants to do is split by person rather than by row? Sean On Wed, Sep 28, 2016 at 5:26 PM, Andreas Mueller <t3kcit@gmail.com> wrote:
It's not really clear to me what you want to achieve. What do you mean by "does not lead to a biased accuracy"?
On 09/26/2016 05:06 PM, Afarin Famili wrote:
Hi David,
When applying Train_test_split to the sample space, we have a single row per subject. I am looking for some other function like Train_test_split that can deal with pairs of rows (for each subject), which does not lead to a biased accuracy. We are studying memory and have a row of features for successful memory encoding, and a second row for unsuccessful memory encoding in each of the subjects. Our target space being 1 for successful and 0 for unsuccessful encoding respectively. How do you recommend me to split this set of data in order to get a reasonable/unbiased accuracy?
Thanks, Afarin
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (5)
-
Afarin Famili -
Andreas Mueller -
Joel Nothman -
Naoya Kanai -
Sean Violante