<div dir="ltr">Hi Arafin,<div><br></div><div>You appear to be talking about a situation in which your dataset is divided into subsets in which the data are highly correlated (but perhaps conditionally independent given the subject / group identifier). In Scikit-learn 0.18 these might be called "grouped cross validation" strategies. See <a href="http://scikit-learn.org/dev/modules/cross_validation.html#cross-validation-iterators-for-grouped-data">http://scikit-learn.org/dev/modules/cross_validation.html#cross-validation-iterators-for-grouped-data</a>.</div><div><br></div><div>(In earlier versions of Scikit-learn, you will find the corresponding CV objects as LabelKFold, LeaveOneLabelOut, etc., but we decided to rename them for clarity when redesigning CV objects and moving them to the new sklearn.model_selection subpackage.)</div><div><br></div><div>I hope that helps.</div><div><br></div><div>Joel</div></div><div class="gmail_extra"><br><div class="gmail_quote">On 27 September 2016 at 07:06, Afarin Famili <span dir="ltr"><<a href="mailto:Afarin.Famili@utsouthwestern.edu" target="_blank">Afarin.Famili@utsouthwestern.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi David,<br>
<br>
When applying Train_test_split to the sample space, we have a single row per subject. I am looking for some other function like Train_test_split that can deal with pairs of rows (for each subject), which does not lead to a biased accuracy. We are studying memory and have a row of features for successful memory encoding, and a second row for unsuccessful memory encoding in each of the subjects. Our target space being 1 for successful and 0 for unsuccessful encoding respectively.<br>
How do you recommend me to split this set of data in order to get a reasonable/unbiased accuracy?<br>
<br>
Thanks,<br>
Afarin<br>
<br>
<br>
<br>
______________________________<wbr>__________<br>
From: scikit-learn <scikit-learn-bounces+afarin.<wbr>famili=<a href="mailto:utsouthwestern.edu@python.org">utsouthwestern.edu@<wbr>python.org</a>> on behalf of <a href="mailto:scikit-learn-request@python.org">scikit-learn-request@python.<wbr>org</a> <<a href="mailto:scikit-learn-request@python.org">scikit-learn-request@python.<wbr>org</a>><br>
Sent: Monday, September 26, 2016 2:43 PM<br>
To: <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>
Subject: scikit-learn Digest, Vol 6, Issue 40<br>
<br>
Send scikit-learn mailing list submissions to<br>
        <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>
<br>
To subscribe or unsubscribe via the World Wide Web, visit<br>
        <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>
or, via email, send a message with subject or body 'help' to<br>
        <a href="mailto:scikit-learn-request@python.org">scikit-learn-request@python.<wbr>org</a><br>
<br>
You can reach the person managing the list at<br>
        <a href="mailto:scikit-learn-owner@python.org">scikit-learn-owner@python.org</a><br>
<br>
When replying, please edit your Subject line so it is more specific<br>
than "Re: Contents of scikit-learn digest..."<br>
<br>
<br>
Today's Topics:<br>
<br>
   1. header intact (Afarin Famili)<br>
   2. Is there a built-in function for pairs of data? (Afarin Famili)<br>
   3. Re: Is there a built-in function for pairs of data?<br>
      (Pedro Pazzini)<br>
   4. Re: Is there a built-in function for pairs of data?<br>
      (David Nicholson)<br>
   5. Large computation time for homogeneous data with<br>
      agglomerative clustering (Md. Khairullah)<br>
<br>
<br>
------------------------------<wbr>------------------------------<wbr>----------<br>
<br>
Message: 1<br>
Date: Mon, 26 Sep 2016 18:03:27 +0000<br>
From: Afarin Famili <Afarin.Famili@UTSouthwestern.<wbr>edu><br>
To: "<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>" <<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>><br>
Subject: [scikit-learn] header intact<br>
Message-ID: <1474913007611.80841@<wbr>UTSouthwestern.edu><br>
Content-Type: text/plain; charset="iso-8859-1"<br>
<br>
?<br>
<br>
<br>
<br>
______________________________<wbr>__<br>
<br>
UT Southwestern<br>
<br>
<br>
Medical Center<br>
<br>
<br>
<br>
The future of medicine, today.<br>
<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://mail.python.org/pipermail/scikit-learn/attachments/20160926/92efd185/attachment-0001.html" rel="noreferrer" target="_blank">http://mail.python.org/<wbr>pipermail/scikit-learn/<wbr>attachments/20160926/92efd185/<wbr>attachment-0001.html</a>><br>
<br>
------------------------------<br>
<br>
Message: 2<br>
Date: Mon, 26 Sep 2016 18:06:49 +0000<br>
From: Afarin Famili <Afarin.Famili@UTSouthwestern.<wbr>edu><br>
To: "<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>" <<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>><br>
Subject: [scikit-learn] Is there a built-in function for pairs of<br>
        data?<br>
Message-ID: <1474913209751.36283@<wbr>UTSouthwestern.edu><br>
Content-Type: text/plain; charset="iso-8859-1"<br>
<br>
<br>
Dear Scikit-learn team,<br>
<br>
<br>
We need to deal with pairs of data in our classification task. I was wondering if there is already a built-in function in Scikit-learn that can partition the pairs of data into train and test sets?<br>
<br>
<br>
Regards,<br>
<br>
Afarin<br>
<br>
<br>
<br>
______________________________<wbr>__<br>
<br>
UT Southwestern<br>
<br>
<br>
Medical Center<br>
<br>
<br>
<br>
The future of medicine, today.<br>
<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://mail.python.org/pipermail/scikit-learn/attachments/20160926/983b9036/attachment-0001.html" rel="noreferrer" target="_blank">http://mail.python.org/<wbr>pipermail/scikit-learn/<wbr>attachments/20160926/983b9036/<wbr>attachment-0001.html</a>><br>
<br>
------------------------------<br>
<br>
Message: 3<br>
Date: Mon, 26 Sep 2016 15:47:26 -0300<br>
From: Pedro Pazzini <<a href="mailto:pedropazzini@gmail.com">pedropazzini@gmail.com</a>><br>
To: Scikit-learn user and developer mailing list<br>
        <<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>><br>
Subject: Re: [scikit-learn] Is there a built-in function for pairs of<br>
        data?<br>
Message-ID:<br>
        <CAAY8FkB2LjnegwFbn=<a href="mailto:gSOawLBcBQ3dnYa6BxDxN6-cvLT1RsfA@mail.gmail.com">gSOawLBcBQ<wbr>3dnYa6BxDxN6-cvLT1RsfA@mail.<wbr>gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
Like this?:<br>
<a href="http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html" rel="noreferrer" target="_blank">http://scikit-learn.org/<wbr>stable/modules/generated/<wbr>sklearn.cross_validation.<wbr>train_test_split.html</a><br>
<br>
2016-09-26 15:06 GMT-03:00 Afarin Famili <<a href="mailto:Afarin.Famili@utsouthwestern.edu">Afarin.Famili@utsouthwestern.<wbr>edu</a>>:<br>
<br>
><br>
> Dear Scikit-learn team,<br>
><br>
><br>
> We need to deal with pairs of data in our classification task. I was<br>
> wondering if there is already a built-in function in Scikit-learn that can<br>
> partition the pairs of data into train and test sets?<br>
><br>
><br>
> Regards,<br>
><br>
> Afarin<br>
><br>
><br>
><br>
> ------------------------------<br>
><br>
> UT Southwestern<br>
><br>
> Medical Center<br>
><br>
> The future of medicine, today.<br>
><br>
> ______________________________<wbr>_________________<br>
> scikit-learn mailing list<br>
> <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>
> <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>
><br>
><br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://mail.python.org/pipermail/scikit-learn/attachments/20160926/2ba60e6a/attachment-0001.html" rel="noreferrer" target="_blank">http://mail.python.org/<wbr>pipermail/scikit-learn/<wbr>attachments/20160926/2ba60e6a/<wbr>attachment-0001.html</a>><br>
<br>
------------------------------<br>
<br>
Message: 4<br>
Date: Mon, 26 Sep 2016 14:53:05 -0400<br>
From: David Nicholson <<a href="mailto:nicholdav@gmail.com">nicholdav@gmail.com</a>><br>
To: Scikit-learn user and developer mailing list<br>
        <<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>><br>
Subject: Re: [scikit-learn] Is there a built-in function for pairs of<br>
        data?<br>
Message-ID:<br>
        <<a href="mailto:CAMabFbXamB5KzQY9_WU%2B8BFxpSECbs2fSiQqad18zi9zmOjvVQ@mail.gmail.com">CAMabFbXamB5KzQY9_WU+<wbr>8BFxpSECbs2fSiQqad18zi9zmOjvVQ<wbr>@mail.gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
Do you mean like train_test_split?<br>
<a href="http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html" rel="noreferrer" target="_blank">http://scikit-learn.org/<wbr>stable/modules/generated/<wbr>sklearn.cross_validation.<wbr>train_test_split.html</a><br>
<br>
On Sep 26, 2016 14:43, "Afarin Famili" <<a href="mailto:Afarin.Famili@utsouthwestern.edu">Afarin.Famili@utsouthwestern.<wbr>edu</a>><br>
wrote:<br>
<br>
><br>
> Dear Scikit-learn team,<br>
><br>
><br>
> We need to deal with pairs of data in our classification task. I was<br>
> wondering if there is already a built-in function in Scikit-learn that can<br>
> partition the pairs of data into train and test sets?<br>
><br>
><br>
> Regards,<br>
><br>
> Afarin<br>
><br>
><br>
><br>
> ------------------------------<br>
><br>
> UT Southwestern<br>
><br>
> Medical Center<br>
><br>
> The future of medicine, today.<br>
><br>
> ______________________________<wbr>_________________<br>
> scikit-learn mailing list<br>
> <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>
> <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>
><br>
><br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://mail.python.org/pipermail/scikit-learn/attachments/20160926/191ef81d/attachment-0001.html" rel="noreferrer" target="_blank">http://mail.python.org/<wbr>pipermail/scikit-learn/<wbr>attachments/20160926/191ef81d/<wbr>attachment-0001.html</a>><br>
<br>
------------------------------<br>
<br>
Message: 5<br>
Date: Mon, 26 Sep 2016 21:43:05 +0200<br>
From: "Md. Khairullah" <<a href="mailto:md.khairullah@gmail.com">md.khairullah@gmail.com</a>><br>
To: <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>
Subject: [scikit-learn] Large computation time for homogeneous data<br>
        with agglomerative clustering<br>
Message-ID:<br>
        <<a href="mailto:CA%2BxrTcKMkwSN2Y7jFg12nEx-Ch_V5bw7eLhG5UO39wN%2BebBozg@mail.gmail.com">CA+xrTcKMkwSN2Y7jFg12nEx-Ch_<wbr>V5bw7eLhG5UO39wN+ebBozg@mail.<wbr>gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
Dear Scikit-learners,<br>
This is my first post here and I hope you experts can help me a lot.<br>
<br>
We are using the agglomerative clustering with ward's linkage and<br>
connectivity constraint. The data size is around 205,000 (each is a single<br>
scalar feature). The data set is dynamic (in time) and we need to apply<br>
clustering at different time thorough the process. Initially all data is 0<br>
and they increase gradually. Alternatively, in the early stage the data is<br>
more homogeneous and the heterogeneity among the data increases gradually.<br>
If the clustering is applied at the final stage (most heterogeneous data,<br>
but off course having patterns/clusters) requesting 20 clusters it takes<br>
only 61s of CPU time. But, if clustering is run in an early stage (more<br>
homogeneous data but all are not 0 and off course there are<br>
patterns/clusters in the data) with the same settings the time rises up to<br>
1h 5m. The CPU time is in-between of these two if the data come from an<br>
in-between time stamp. I also tried the the other linkage options too, but<br>
the situation does not improve. My understanding is that the homogeneity is<br>
playing the role.<br>
<br>
Have you experienced this too? What solution do you suggest?<br>
<br>
Thanks in advance for your attention and help.<br>
<br>
--<br>
Best regards<br>
<br>
Md. Khairullah<br>
PhD Student, KU Leuven<br>
Numerical Analysis and Applied Mathematics Section<br>
Celestijnenlaan 200a - box 2402<br>
3001 Leuven<br>
room: 03.18<br>
tel. +32 16 37 39 66<br>
fax +32 16 3 27996<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://mail.python.org/pipermail/scikit-learn/attachments/20160926/da13ef50/attachment.html" rel="noreferrer" target="_blank">http://mail.python.org/<wbr>pipermail/scikit-learn/<wbr>attachments/20160926/da13ef50/<wbr>attachment.html</a>><br>
<br>
------------------------------<br>
<br>
Subject: Digest Footer<br>
<br>
______________________________<wbr>_________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>
<br>
<br>
------------------------------<br>
<br>
End of scikit-learn Digest, Vol 6, Issue 40<br>
******************************<wbr>*************<br>
<br>
______________________________<wbr>_________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>
</blockquote></div><br></div>