<div dir="ltr">Hi Arafin,<div><br></div><div>You appear to be talking about a situation in which your dataset is divided into subsets in which the data are highly correlated (but perhaps conditionally independent given the subject / group identifier). In Scikit-learn 0.18 these might be called "grouped cross validation" strategies. See <a href="http://scikit-learn.org/dev/modules/cross_validation.html#cross-validation-iterators-for-grouped-data">http://scikit-learn.org/dev/modules/cross_validation.html#cross-validation-iterators-for-grouped-data</a>.</div><div><br></div><div>(In earlier versions of Scikit-learn, you will find the corresponding CV objects as LabelKFold, LeaveOneLabelOut, etc., but we decided to rename them for clarity when redesigning CV objects and moving them to the new sklearn.model_selection subpackage.)</div><div><br></div><div>I hope that helps.</div><div><br></div><div>Joel</div></div><div class="gmail_extra"><br><div class="gmail_quote">On 27 September 2016 at 07:06, Afarin Famili <span dir="ltr"><<a href="mailto:Afarin.Famili@utsouthwestern.edu" target="_blank">Afarin.Famili@utsouthwestern.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi David,<br>

<br>

When applying Train_test_split to the sample space, we have a single row per subject. I am looking for some other function like Train_test_split that can deal with pairs of rows (for each subject), which does not lead to a biased accuracy. We are studying memory and have a row of features for successful memory encoding, and a second row for unsuccessful memory encoding in each of the subjects. Our target space being 1 for successful and 0 for unsuccessful encoding respectively.<br>

How do you recommend me to split this set of data in order to get a reasonable/unbiased accuracy?<br>

<br>

Thanks,<br>

Afarin<br>

<br>

<br>

<br>

______________________________<wbr>__________<br>

From: scikit-learn <scikit-learn-bounces+afarin.<wbr>famili=<a href="mailto:utsouthwestern.edu@python.org">utsouthwestern.edu@<wbr>python.org</a>> on behalf of <a href="mailto:scikit-learn-request@python.org">scikit-learn-request@python.<wbr>org</a> <<a href="mailto:scikit-learn-request@python.org">scikit-learn-request@python.<wbr>org</a>><br>

Sent: Monday, September 26, 2016 2:43 PM<br>

To: <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

Subject: scikit-learn Digest, Vol 6, Issue 40<br>

<br>

Send scikit-learn mailing list submissions to<br>

        <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

<br>

To subscribe or unsubscribe via the World Wide Web, visit<br>

        <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>

or, via email, send a message with subject or body 'help' to<br>

        <a href="mailto:scikit-learn-request@python.org">scikit-learn-request@python.<wbr>org</a><br>

<br>

You can reach the person managing the list at<br>

        <a href="mailto:scikit-learn-owner@python.org">scikit-learn-owner@python.org</a><br>

<br>

When replying, please edit your Subject line so it is more specific<br>

than "Re: Contents of scikit-learn digest..."<br>

<br>

<br>

Today's Topics:<br>

<br>

   1. header intact (Afarin Famili)<br>

   2. Is there a built-in function for pairs of data? (Afarin Famili)<br>

   3. Re: Is there a built-in function for pairs of data?<br>

      (Pedro Pazzini)<br>

   4. Re: Is there a built-in function for pairs of data?<br>

      (David Nicholson)<br>

   5. Large computation time for homogeneous data with<br>

      agglomerative clustering (Md. Khairullah)<br>

<br>

<br>

------------------------------<wbr>------------------------------<wbr>----------<br>

<br>

Message: 1<br>

Date: Mon, 26 Sep 2016 18:03:27 +0000<br>

From: Afarin Famili <Afarin.Famili@UTSouthwestern.<wbr>edu><br>

To: "<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>" <<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>><br>

Subject: [scikit-learn] header intact<br>

Message-ID: <1474913007611.80841@<wbr>UTSouthwestern.edu><br>

Content-Type: text/plain; charset="iso-8859-1"<br>

<br>

?<br>

<br>

<br>

<br>

______________________________<wbr>__<br>

<br>

UT Southwestern<br>

<br>

<br>

Medical Center<br>

<br>

<br>

<br>

The future of medicine, today.<br>

<br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <<a href="http://mail.python.org/pipermail/scikit-learn/attachments/20160926/92efd185/attachment-0001.html" rel="noreferrer" target="_blank">http://mail.python.org/<wbr>pipermail/scikit-learn/<wbr>attachments/20160926/92efd185/<wbr>attachment-0001.html</a>><br>

<br>

------------------------------<br>

<br>

Message: 2<br>

Date: Mon, 26 Sep 2016 18:06:49 +0000<br>

From: Afarin Famili <Afarin.Famili@UTSouthwestern.<wbr>edu><br>

To: "<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>" <<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>><br>

Subject: [scikit-learn] Is there a built-in function for pairs of<br>

        data?<br>

Message-ID: <1474913209751.36283@<wbr>UTSouthwestern.edu><br>

Content-Type: text/plain; charset="iso-8859-1"<br>

<br>

<br>

Dear Scikit-learn team,<br>

<br>

<br>

We need to deal with pairs of data in our classification task. I was wondering if there is already a built-in function in Scikit-learn that can partition the pairs of data into train and test sets?<br>

<br>

<br>

Regards,<br>

<br>

Afarin<br>

<br>

<br>

<br>

______________________________<wbr>__<br>

<br>

UT Southwestern<br>

<br>

<br>

Medical Center<br>

<br>

<br>

<br>

The future of medicine, today.<br>

<br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <<a href="http://mail.python.org/pipermail/scikit-learn/attachments/20160926/983b9036/attachment-0001.html" rel="noreferrer" target="_blank">http://mail.python.org/<wbr>pipermail/scikit-learn/<wbr>attachments/20160926/983b9036/<wbr>attachment-0001.html</a>><br>

<br>

------------------------------<br>

<br>

Message: 3<br>

Date: Mon, 26 Sep 2016 15:47:26 -0300<br>

From: Pedro Pazzini <<a href="mailto:pedropazzini@gmail.com">pedropazzini@gmail.com</a>><br>

To: Scikit-learn user and developer mailing list<br>

        <<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>><br>

Subject: Re: [scikit-learn] Is there a built-in function for pairs of<br>

        data?<br>

Message-ID:<br>

        <CAAY8FkB2LjnegwFbn=<a href="mailto:gSOawLBcBQ3dnYa6BxDxN6-cvLT1RsfA@mail.gmail.com">gSOawLBcBQ<wbr>3dnYa6BxDxN6-cvLT1RsfA@mail.<wbr>gmail.com</a>><br>

Content-Type: text/plain; charset="utf-8"<br>

<br>

Like this?:<br>

<a href="http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html" rel="noreferrer" target="_blank">http://scikit-learn.org/<wbr>stable/modules/generated/<wbr>sklearn.cross_validation.<wbr>train_test_split.html</a><br>

<br>

2016-09-26 15:06 GMT-03:00 Afarin Famili <<a href="mailto:Afarin.Famili@utsouthwestern.edu">Afarin.Famili@utsouthwestern.<wbr>edu</a>>:<br>

<br>

><br>

> Dear Scikit-learn team,<br>

><br>

><br>

> We need to deal with pairs of data in our classification task. I was<br>

> wondering if there is already a built-in function in Scikit-learn that can<br>

> partition the pairs of data into train and test sets?<br>

><br>

><br>

> Regards,<br>

><br>

> Afarin<br>

><br>

><br>

><br>

> ------------------------------<br>

><br>

> UT Southwestern<br>

><br>

> Medical Center<br>

><br>

> The future of medicine, today.<br>

><br>

> ______________________________<wbr>_________________<br>

> scikit-learn mailing list<br>

> <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

> <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>

><br>

><br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <<a href="http://mail.python.org/pipermail/scikit-learn/attachments/20160926/2ba60e6a/attachment-0001.html" rel="noreferrer" target="_blank">http://mail.python.org/<wbr>pipermail/scikit-learn/<wbr>attachments/20160926/2ba60e6a/<wbr>attachment-0001.html</a>><br>

<br>

------------------------------<br>

<br>

Message: 4<br>

Date: Mon, 26 Sep 2016 14:53:05 -0400<br>

From: David Nicholson <<a href="mailto:nicholdav@gmail.com">nicholdav@gmail.com</a>><br>

To: Scikit-learn user and developer mailing list<br>

        <<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>><br>

Subject: Re: [scikit-learn] Is there a built-in function for pairs of<br>

        data?<br>

Message-ID:<br>

        <<a href="mailto:CAMabFbXamB5KzQY9_WU%2B8BFxpSECbs2fSiQqad18zi9zmOjvVQ@mail.gmail.com">CAMabFbXamB5KzQY9_WU+<wbr>8BFxpSECbs2fSiQqad18zi9zmOjvVQ<wbr>@mail.gmail.com</a>><br>

Content-Type: text/plain; charset="utf-8"<br>

<br>

Do you mean like train_test_split?<br>

<a href="http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html" rel="noreferrer" target="_blank">http://scikit-learn.org/<wbr>stable/modules/generated/<wbr>sklearn.cross_validation.<wbr>train_test_split.html</a><br>

<br>

On Sep 26, 2016 14:43, "Afarin Famili" <<a href="mailto:Afarin.Famili@utsouthwestern.edu">Afarin.Famili@utsouthwestern.<wbr>edu</a>><br>

wrote:<br>

<br>

><br>

> Dear Scikit-learn team,<br>

><br>

><br>

> We need to deal with pairs of data in our classification task. I was<br>

> wondering if there is already a built-in function in Scikit-learn that can<br>

> partition the pairs of data into train and test sets?<br>

><br>

><br>

> Regards,<br>

><br>

> Afarin<br>

><br>

><br>

><br>

> ------------------------------<br>

><br>

> UT Southwestern<br>

><br>

> Medical Center<br>

><br>

> The future of medicine, today.<br>

><br>

> ______________________________<wbr>_________________<br>

> scikit-learn mailing list<br>

> <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

> <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>

><br>

><br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <<a href="http://mail.python.org/pipermail/scikit-learn/attachments/20160926/191ef81d/attachment-0001.html" rel="noreferrer" target="_blank">http://mail.python.org/<wbr>pipermail/scikit-learn/<wbr>attachments/20160926/191ef81d/<wbr>attachment-0001.html</a>><br>

<br>

------------------------------<br>

<br>

Message: 5<br>

Date: Mon, 26 Sep 2016 21:43:05 +0200<br>

From: "Md. Khairullah" <<a href="mailto:md.khairullah@gmail.com">md.khairullah@gmail.com</a>><br>

To: <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

Subject: [scikit-learn] Large computation time for homogeneous data<br>

        with agglomerative clustering<br>

Message-ID:<br>

        <<a href="mailto:CA%2BxrTcKMkwSN2Y7jFg12nEx-Ch_V5bw7eLhG5UO39wN%2BebBozg@mail.gmail.com">CA+xrTcKMkwSN2Y7jFg12nEx-Ch_<wbr>V5bw7eLhG5UO39wN+ebBozg@mail.<wbr>gmail.com</a>><br>

Content-Type: text/plain; charset="utf-8"<br>

<br>

Dear Scikit-learners,<br>

This is my first post here and I hope you experts can help me a lot.<br>

<br>

We are using the agglomerative clustering with ward's linkage and<br>

connectivity constraint. The data size is around 205,000 (each is a single<br>

scalar feature). The data set is dynamic (in time) and we need to apply<br>

clustering at different time thorough the process. Initially all data is 0<br>

and they increase gradually. Alternatively, in the early stage the data is<br>

more homogeneous and the heterogeneity among the data increases gradually.<br>

If the clustering is applied at the final stage (most heterogeneous data,<br>

but off course having patterns/clusters) requesting 20 clusters it takes<br>

only 61s of CPU time. But, if clustering is run in an early stage (more<br>

homogeneous data but all are not 0 and off course there are<br>

patterns/clusters in the data) with the same settings the time rises up to<br>

1h 5m. The CPU time is in-between of these two if the data come from an<br>

in-between time stamp. I also tried the the other linkage options too, but<br>

the situation does not improve. My understanding is that the homogeneity is<br>

playing the role.<br>

<br>

Have you experienced this too? What solution do you suggest?<br>

<br>

Thanks in advance for your attention and help.<br>

<br>

--<br>

Best regards<br>

<br>

Md. Khairullah<br>

PhD Student, KU Leuven<br>

Numerical Analysis and Applied Mathematics Section<br>

Celestijnenlaan 200a - box 2402<br>

3001 Leuven<br>

room: 03.18<br>

tel. +32 16 37 39 66<br>

fax +32 16 3 27996<br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <<a href="http://mail.python.org/pipermail/scikit-learn/attachments/20160926/da13ef50/attachment.html" rel="noreferrer" target="_blank">http://mail.python.org/<wbr>pipermail/scikit-learn/<wbr>attachments/20160926/da13ef50/<wbr>attachment.html</a>><br>

<br>

------------------------------<br>

<br>

Subject: Digest Footer<br>

<br>

______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>

<br>

<br>

------------------------------<br>

<br>

End of scikit-learn Digest, Vol 6, Issue 40<br>

******************************<wbr>*************<br>

<br>

______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>

</blockquote></div><br></div>