From sarah.zaranek at gmail.com Sun Feb 4 23:10:55 2018 From: sarah.zaranek at gmail.com (Sarah Wait Zaranek) Date: Sun, 4 Feb 2018 23:10:55 -0500 Subject: [scikit-learn] One-hot encoding Message-ID: Hello - I was just wondering if there was a way to improve performance on the one-hot encoder. Or, is there any plans to do so in the future? I am working with a matrix that will ultimately have 20 million categorical variables, and my bottleneck is the one-hot encoder. Let me know if this isn't the place to inquire. My code is very simple when using the encoder, but I cut and pasted it here for completeness. enc = OneHotEncoder(sparse=True) Xtrain = enc.fit_transform(tiledata) Thanks, Sarah -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Feb 4 23:27:37 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 5 Feb 2018 15:27:37 +1100 Subject: [scikit-learn] One-hot encoding In-Reply-To: References: Message-ID: 20 million categories, or 20 million categorical variables? OneHotEncoder is pretty efficient if you specify n_values. On 5 February 2018 at 15:10, Sarah Wait Zaranek wrote: > Hello - > > I was just wondering if there was a way to improve performance on the > one-hot encoder. Or, is there any plans to do so in the future? I am > working with a matrix that will ultimately have 20 million categorical > variables, and my bottleneck is the one-hot encoder. > > Let me know if this isn't the place to inquire. My code is very simple > when using the encoder, but I cut and pasted it here for completeness. > > enc = OneHotEncoder(sparse=True) > Xtrain = enc.fit_transform(tiledata) > > > Thanks, > Sarah > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Feb 4 23:30:19 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 5 Feb 2018 15:30:19 +1100 Subject: [scikit-learn] One-hot encoding In-Reply-To: References: Message-ID: You will also benefit from assume_finite (see http://scikit-learn.org/stable/modules/generated/sklearn.config_context.html ) -------------- next part -------------- An HTML attachment was scrubbed... URL: From sarah.zaranek at gmail.com Sun Feb 4 23:31:21 2018 From: sarah.zaranek at gmail.com (Sarah Wait Zaranek) Date: Sun, 4 Feb 2018 23:31:21 -0500 Subject: [scikit-learn] One-hot encoding In-Reply-To: References: Message-ID: Hi Joel - 20 million categorical variables. It comes from segmenting the genome into 20 million parts. Genomes are big :) For n_values, I am a bit confused. Is the input the same as the output for n values. Originally, I thought it was just the number of levels per column, but it seems like it is more like the highest value of the levels (in terms of integers). Cheers, Sarah On Sun, Feb 4, 2018 at 11:27 PM, Joel Nothman wrote: > 20 million categories, or 20 million categorical variables? > > OneHotEncoder is pretty efficient if you specify n_values. > > On 5 February 2018 at 15:10, Sarah Wait Zaranek > wrote: > >> Hello - >> >> I was just wondering if there was a way to improve performance on the >> one-hot encoder. Or, is there any plans to do so in the future? I am >> working with a matrix that will ultimately have 20 million categorical >> variables, and my bottleneck is the one-hot encoder. >> >> Let me know if this isn't the place to inquire. My code is very simple >> when using the encoder, but I cut and pasted it here for completeness. >> >> enc = OneHotEncoder(sparse=True) >> Xtrain = enc.fit_transform(tiledata) >> >> >> Thanks, >> Sarah >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sarah.zaranek at gmail.com Sun Feb 4 23:32:12 2018 From: sarah.zaranek at gmail.com (Sarah Wait Zaranek) Date: Sun, 4 Feb 2018 23:32:12 -0500 Subject: [scikit-learn] One-hot encoding In-Reply-To: References: Message-ID: ?Sorry - your second message popped up when I was writing my response. I will look at this as well. Thanks for being so speedy! Cheers, Sarah? On Sun, Feb 4, 2018 at 11:30 PM, Joel Nothman wrote: > You will also benefit from assume_finite (see http://scikit-learn.org/ > stable/modules/generated/sklearn.config_context.html) > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Feb 5 00:02:38 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 5 Feb 2018 16:02:38 +1100 Subject: [scikit-learn] One-hot encoding In-Reply-To: References: Message-ID: If each input column is encoded as a value from 0 to the (number of possible values for that column - 1) then n_values for that column should be the highest value + 1, which is also the number of levels per column. Does that make sense? Actually, I've realised there's a somewhat slow and unnecessary bit of code in the one-hot encoder: where the COO matrix is converted to CSR. I suspect this was done because most of our ML algorithms perform better on CSR, or else to maintain backwards compatibility with an earlier implementation. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sarah.zaranek at gmail.com Mon Feb 5 00:25:57 2018 From: sarah.zaranek at gmail.com (Sarah Wait Zaranek) Date: Mon, 5 Feb 2018 00:25:57 -0500 Subject: [scikit-learn] One-hot encoding In-Reply-To: References: Message-ID: Hi Joel - Conceptually, that makes sense. But when I assign n_values, I can't make it match the result when you don't specify them. See below. I used the number of unique levels per column. >>> enc = OneHotEncoder(sparse=False) >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) >>> test array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) >>> test array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], [0., 1., 0., 0., 0., 2., 0., 0., 0.], [1., 0., 0., 0., 0., 1., 1., 0., 0.], [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) Cheers, Sarah Cheers, Sarah On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman wrote: > If each input column is encoded as a value from 0 to the (number of > possible values for that column - 1) then n_values for that column should > be the highest value + 1, which is also the number of levels per column. > Does that make sense? > > Actually, I've realised there's a somewhat slow and unnecessary bit of > code in the one-hot encoder: where the COO matrix is converted to CSR. I > suspect this was done because most of our ML algorithms perform better on > CSR, or else to maintain backwards compatibility with an earlier > implementation. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sarah.zaranek at gmail.com Mon Feb 5 00:31:19 2018 From: sarah.zaranek at gmail.com (Sarah Wait Zaranek) Date: Mon, 5 Feb 2018 00:31:19 -0500 Subject: [scikit-learn] One-hot encoding In-Reply-To: References: Message-ID: If I use the n+1 approach, then I get the correct matrix, except with the columns of zeros: >>> test array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]]) On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek wrote: > Hi Joel - > > Conceptually, that makes sense. But when I assign n_values, I can't make > it match the result when you don't specify them. See below. I used the > number of unique levels per column. > > >>> enc = OneHotEncoder(sparse=False) > >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) > >>> test > array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], > [0., 1., 0., 0., 1., 1., 0., 0., 0.], > [1., 0., 0., 0., 1., 0., 1., 0., 0.], > [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) > >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) > >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) > >>> test > array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], > [0., 1., 0., 0., 0., 2., 0., 0., 0.], > [1., 0., 0., 0., 0., 1., 1., 0., 0.], > [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) > > Cheers, > Sarah > > Cheers, > Sarah > > On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman > wrote: > >> If each input column is encoded as a value from 0 to the (number of >> possible values for that column - 1) then n_values for that column should >> be the highest value + 1, which is also the number of levels per column. >> Does that make sense? >> >> Actually, I've realised there's a somewhat slow and unnecessary bit of >> code in the one-hot encoder: where the COO matrix is converted to CSR. I >> suspect this was done because most of our ML algorithms perform better on >> CSR, or else to maintain backwards compatibility with an earlier >> implementation. >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Feb 5 00:56:25 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 5 Feb 2018 16:56:25 +1100 Subject: [scikit-learn] One-hot encoding In-Reply-To: References: Message-ID: If you specify n_values=[list_of_vals_for_column1, list_of_vals_for_column2], you should be able to engineer it to how you want. On 5 February 2018 at 16:31, Sarah Wait Zaranek wrote: > If I use the n+1 approach, then I get the correct matrix, except with the > columns of zeros: > > >>> test > array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], > [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], > [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], > [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]]) > > > On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < > sarah.zaranek at gmail.com> wrote: > >> Hi Joel - >> >> Conceptually, that makes sense. But when I assign n_values, I can't make >> it match the result when you don't specify them. See below. I used the >> number of unique levels per column. >> >> >>> enc = OneHotEncoder(sparse=False) >> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) >> >>> test >> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], >> [0., 1., 0., 0., 1., 1., 0., 0., 0.], >> [1., 0., 0., 0., 1., 0., 1., 0., 0.], >> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, 2]]) >> >>> test >> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], >> [0., 1., 0., 0., 0., 2., 0., 0., 0.], >> [1., 0., 0., 0., 0., 1., 1., 0., 0.], >> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >> >> Cheers, >> Sarah >> >> Cheers, >> Sarah >> >> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman >> wrote: >> >>> If each input column is encoded as a value from 0 to the (number of >>> possible values for that column - 1) then n_values for that column should >>> be the highest value + 1, which is also the number of levels per column. >>> Does that make sense? >>> >>> Actually, I've realised there's a somewhat slow and unnecessary bit of >>> code in the one-hot encoder: where the COO matrix is converted to CSR. I >>> suspect this was done because most of our ML algorithms perform better on >>> CSR, or else to maintain backwards compatibility with an earlier >>> implementation. >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sarah.zaranek at gmail.com Mon Feb 5 01:05:11 2018 From: sarah.zaranek at gmail.com (Sarah Wait Zaranek) Date: Mon, 5 Feb 2018 01:05:11 -0500 Subject: [scikit-learn] One-hot encoding In-Reply-To: References: Message-ID: Great. Thank you for all your help. Cheers, Sarah On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman wrote: > If you specify n_values=[list_of_vals_for_column1, > list_of_vals_for_column2], you should be able to engineer it to how you > want. > > On 5 February 2018 at 16:31, Sarah Wait Zaranek > wrote: > >> If I use the n+1 approach, then I get the correct matrix, except with the >> columns of zeros: >> >> >>> test >> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], >> [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], >> [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], >> [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]]) >> >> >> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < >> sarah.zaranek at gmail.com> wrote: >> >>> Hi Joel - >>> >>> Conceptually, that makes sense. But when I assign n_values, I can't >>> make it match the result when you don't specify them. See below. I used >>> the number of unique levels per column. >>> >>> >>> enc = OneHotEncoder(sparse=False) >>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, >>> 2]]) >>> >>> test >>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], >>> [0., 1., 0., 0., 1., 1., 0., 0., 0.], >>> [1., 0., 0., 0., 1., 0., 1., 0., 0.], >>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, >>> 2]]) >>> >>> test >>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], >>> [0., 1., 0., 0., 0., 2., 0., 0., 0.], >>> [1., 0., 0., 0., 0., 1., 1., 0., 0.], >>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>> >>> Cheers, >>> Sarah >>> >>> Cheers, >>> Sarah >>> >>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman >>> wrote: >>> >>>> If each input column is encoded as a value from 0 to the (number of >>>> possible values for that column - 1) then n_values for that column should >>>> be the highest value + 1, which is also the number of levels per column. >>>> Does that make sense? >>>> >>>> Actually, I've realised there's a somewhat slow and unnecessary bit of >>>> code in the one-hot encoder: where the COO matrix is converted to CSR. I >>>> suspect this was done because most of our ML algorithms perform better on >>>> CSR, or else to maintain backwards compatibility with an earlier >>>> implementation. >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From y.mazari at gmail.com Mon Feb 5 10:31:26 2018 From: y.mazari at gmail.com (Yacine MAZARI) Date: Tue, 6 Feb 2018 00:31:26 +0900 Subject: [scikit-learn] Need Help with Failing Travis/Appveyor Build Message-ID: Hello, I added some additional unit tests to this PR , which required me to do > from numpy.testing import assert_array_compare > This caused the build on both Travis and Appveyor to fail with > ImportError: cannot import name assert_array_compare > I don't have this issue on my locale (Ubuntu 16.04, Python 2.7). Any hints? Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: From raphaelrcampos at yahoo.com.br Mon Feb 5 11:02:54 2018 From: raphaelrcampos at yahoo.com.br (RAPHAEL RODRIGUES CAMPOS) Date: Mon, 5 Feb 2018 16:02:54 +0000 (UTC) Subject: [scikit-learn] Weighted feature sampling in Random Forest References: <1390243594.3420665.1517846574948.ref@mail.yahoo.com> Message-ID: <1390243594.3420665.1517846574948@mail.yahoo.com> Hello Scikit community, I want the Random Forest to be able to draw the candidate features for split using weighted sampling. Basically, I want to reproduce this paper's results https://pdfs.semanticscholar.org/9b2f/84d85e5b6979bf375a2d4b15f7526597fc70.pdf I wonder whether I only need to change the function on the following line: https://github.com/scikit-learn/scikit-learn/blob/6919a22c8023f4c2f30849c7ce05de745b6d6c1a/sklearn/tree/_splitter.pyx#L1287, and permutate the weight of the features as well. Or it would be more complicated. Could someone shed a light on this? Thanks in advance. -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Feb 5 16:28:50 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 6 Feb 2018 08:28:50 +1100 Subject: [scikit-learn] Need Help with Failing Travis/Appveyor Build In-Reply-To: References: Message-ID: I assume it is not available in all supported versions of numpy. but I can't imagine you need it if we have not used it before! On 6 Feb 2018 2:32 am, "Yacine MAZARI" wrote: > Hello, > > I added some additional unit tests to this PR > , which required > me to do > >> from numpy.testing import assert_array_compare >> > > This caused the build on both Travis > > and Appveyor > > to fail with > >> ImportError: cannot import name assert_array_compare >> > > I don't have this issue on my locale (Ubuntu 16.04, Python 2.7). > > Any hints? > > Thank you. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sarah.zaranek at gmail.com Mon Feb 5 21:24:46 2018 From: sarah.zaranek at gmail.com (Sarah Wait Zaranek) Date: Mon, 5 Feb 2018 21:24:46 -0500 Subject: [scikit-learn] One-hot encoding In-Reply-To: References: Message-ID: Hi Joel - I am also seeing a huge overhead in memory for calling the onehot-encoder. I have hacked it by running it splitting by matrix into 4-5 smaller matrices (by columns) and then concatenating the results. But, I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or is this to be expected. Cheers, Sarah On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek wrote: > Great. Thank you for all your help. > > Cheers, > Sarah > > On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman > wrote: > >> If you specify n_values=[list_of_vals_for_column1, >> list_of_vals_for_column2], you should be able to engineer it to how you >> want. >> >> On 5 February 2018 at 16:31, Sarah Wait Zaranek >> wrote: >> >>> If I use the n+1 approach, then I get the correct matrix, except with >>> the columns of zeros: >>> >>> >>> test >>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], >>> [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], >>> [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], >>> [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]]) >>> >>> >>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < >>> sarah.zaranek at gmail.com> wrote: >>> >>>> Hi Joel - >>>> >>>> Conceptually, that makes sense. But when I assign n_values, I can't >>>> make it match the result when you don't specify them. See below. I used >>>> the number of unique levels per column. >>>> >>>> >>> enc = OneHotEncoder(sparse=False) >>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, >>>> 2]]) >>>> >>> test >>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], >>>> [0., 1., 0., 0., 1., 1., 0., 0., 0.], >>>> [1., 0., 0., 0., 1., 0., 1., 0., 0.], >>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, >>>> 2]]) >>>> >>> test >>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], >>>> [0., 1., 0., 0., 0., 2., 0., 0., 0.], >>>> [1., 0., 0., 0., 0., 1., 1., 0., 0.], >>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>> >>>> Cheers, >>>> Sarah >>>> >>>> Cheers, >>>> Sarah >>>> >>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman >>>> wrote: >>>> >>>>> If each input column is encoded as a value from 0 to the (number of >>>>> possible values for that column - 1) then n_values for that column should >>>>> be the highest value + 1, which is also the number of levels per column. >>>>> Does that make sense? >>>>> >>>>> Actually, I've realised there's a somewhat slow and unnecessary bit of >>>>> code in the one-hot encoder: where the COO matrix is converted to CSR. I >>>>> suspect this was done because most of our ML algorithms perform better on >>>>> CSR, or else to maintain backwards compatibility with an earlier >>>>> implementation. >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Feb 5 21:50:55 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 6 Feb 2018 13:50:55 +1100 Subject: [scikit-learn] One-hot encoding In-Reply-To: References: Message-ID: OneHotEncoder will not magically reduce the size of your input. It will necessarily increase the memory of the input data as long as we are storing the results in scipy.sparse matrices. The sparse representation will be less expensive than the dense representation, but it won't be less expensive than the input. On 6 February 2018 at 13:24, Sarah Wait Zaranek wrote: > Hi Joel - > > I am also seeing a huge overhead in memory for calling the > onehot-encoder. I have hacked it by running it splitting by matrix into > 4-5 smaller matrices (by columns) and then concatenating the results. But, > I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or > is this to be expected. > > Cheers, > Sarah > > On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek < > sarah.zaranek at gmail.com> wrote: > >> Great. Thank you for all your help. >> >> Cheers, >> Sarah >> >> On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman >> wrote: >> >>> If you specify n_values=[list_of_vals_for_column1, >>> list_of_vals_for_column2], you should be able to engineer it to how you >>> want. >>> >>> On 5 February 2018 at 16:31, Sarah Wait Zaranek >> > wrote: >>> >>>> If I use the n+1 approach, then I get the correct matrix, except with >>>> the columns of zeros: >>>> >>>> >>> test >>>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], >>>> [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], >>>> [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], >>>> [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]]) >>>> >>>> >>>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < >>>> sarah.zaranek at gmail.com> wrote: >>>> >>>>> Hi Joel - >>>>> >>>>> Conceptually, that makes sense. But when I assign n_values, I can't >>>>> make it match the result when you don't specify them. See below. I used >>>>> the number of unique levels per column. >>>>> >>>>> >>> enc = OneHotEncoder(sparse=False) >>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, >>>>> 2]]) >>>>> >>> test >>>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], >>>>> [0., 1., 0., 0., 1., 1., 0., 0., 0.], >>>>> [1., 0., 0., 0., 1., 0., 1., 0., 0.], >>>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, >>>>> 2]]) >>>>> >>> test >>>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], >>>>> [0., 1., 0., 0., 0., 2., 0., 0., 0.], >>>>> [1., 0., 0., 0., 0., 1., 1., 0., 0.], >>>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>>> >>>>> Cheers, >>>>> Sarah >>>>> >>>>> Cheers, >>>>> Sarah >>>>> >>>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman >>>>> wrote: >>>>> >>>>>> If each input column is encoded as a value from 0 to the (number of >>>>>> possible values for that column - 1) then n_values for that column should >>>>>> be the highest value + 1, which is also the number of levels per column. >>>>>> Does that make sense? >>>>>> >>>>>> Actually, I've realised there's a somewhat slow and unnecessary bit >>>>>> of code in the one-hot encoder: where the COO matrix is converted to CSR. I >>>>>> suspect this was done because most of our ML algorithms perform better on >>>>>> CSR, or else to maintain backwards compatibility with an earlier >>>>>> implementation. >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sarah.zaranek at gmail.com Mon Feb 5 21:53:21 2018 From: sarah.zaranek at gmail.com (Sarah Wait Zaranek) Date: Mon, 5 Feb 2018 21:53:21 -0500 Subject: [scikit-learn] One-hot encoding In-Reply-To: References: Message-ID: Yes, of course. What I mean is the I start out with 19 Gigs (initial matrix size) or so, it balloons to 100 Gigs *within the encoder function* and returns 28 Gigs (sparse one-hot matrix size). These numbers aren't exact, but you can see my point. Cheers, Sarah On Mon, Feb 5, 2018 at 9:50 PM, Joel Nothman wrote: > OneHotEncoder will not magically reduce the size of your input. It will > necessarily increase the memory of the input data as long as we are storing > the results in scipy.sparse matrices. The sparse representation will be > less expensive than the dense representation, but it won't be less > expensive than the input. > > On 6 February 2018 at 13:24, Sarah Wait Zaranek > wrote: > >> Hi Joel - >> >> I am also seeing a huge overhead in memory for calling the >> onehot-encoder. I have hacked it by running it splitting by matrix into >> 4-5 smaller matrices (by columns) and then concatenating the results. But, >> I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or >> is this to be expected. >> >> Cheers, >> Sarah >> >> On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek < >> sarah.zaranek at gmail.com> wrote: >> >>> Great. Thank you for all your help. >>> >>> Cheers, >>> Sarah >>> >>> On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman >>> wrote: >>> >>>> If you specify n_values=[list_of_vals_for_column1, >>>> list_of_vals_for_column2], you should be able to engineer it to how you >>>> want. >>>> >>>> On 5 February 2018 at 16:31, Sarah Wait Zaranek < >>>> sarah.zaranek at gmail.com> wrote: >>>> >>>>> If I use the n+1 approach, then I get the correct matrix, except with >>>>> the columns of zeros: >>>>> >>>>> >>> test >>>>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], >>>>> [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], >>>>> [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], >>>>> [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]]) >>>>> >>>>> >>>>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < >>>>> sarah.zaranek at gmail.com> wrote: >>>>> >>>>>> Hi Joel - >>>>>> >>>>>> Conceptually, that makes sense. But when I assign n_values, I can't >>>>>> make it match the result when you don't specify them. See below. I used >>>>>> the number of unique levels per column. >>>>>> >>>>>> >>> enc = OneHotEncoder(sparse=False) >>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, >>>>>> 2]]) >>>>>> >>> test >>>>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], >>>>>> [0., 1., 0., 0., 1., 1., 0., 0., 0.], >>>>>> [1., 0., 0., 0., 1., 0., 1., 0., 0.], >>>>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0, >>>>>> 2]]) >>>>>> >>> test >>>>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], >>>>>> [0., 1., 0., 0., 0., 2., 0., 0., 0.], >>>>>> [1., 0., 0., 0., 0., 1., 1., 0., 0.], >>>>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>>>> >>>>>> Cheers, >>>>>> Sarah >>>>>> >>>>>> Cheers, >>>>>> Sarah >>>>>> >>>>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman >>>>> > wrote: >>>>>> >>>>>>> If each input column is encoded as a value from 0 to the (number of >>>>>>> possible values for that column - 1) then n_values for that column should >>>>>>> be the highest value + 1, which is also the number of levels per column. >>>>>>> Does that make sense? >>>>>>> >>>>>>> Actually, I've realised there's a somewhat slow and unnecessary bit >>>>>>> of code in the one-hot encoder: where the COO matrix is converted to CSR. I >>>>>>> suspect this was done because most of our ML algorithms perform better on >>>>>>> CSR, or else to maintain backwards compatibility with an earlier >>>>>>> implementation. >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Feb 5 22:33:45 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 6 Feb 2018 14:33:45 +1100 Subject: [scikit-learn] One-hot encoding In-Reply-To: References: Message-ID: Yes, the output CSR representation requires: 1 (dtype) value per entry 1 int32 per entry 1 int32 per row The intermediate COO representation requires: 1 (dtype) value per entry 2 int32 per entry So as long as the transformation from COO to CSR is done over the whole data, it will occupy roughly 5x the input size, which is exactly what you are experienciong. The CategoricalEncoder currently available in the development version of scikit-learn does not have this problem, but might be slower due to handling non-integer categories. It will also possibly disappear and be merged into OneHotEncoder soon (see PR #10523). On 6 February 2018 at 13:53, Sarah Wait Zaranek wrote: > Yes, of course. What I mean is the I start out with 19 Gigs (initial > matrix size) or so, it balloons to 100 Gigs *within the encoder function* > and returns 28 Gigs (sparse one-hot matrix size). These numbers aren't > exact, but you can see my point. > > Cheers, > Sarah > > On Mon, Feb 5, 2018 at 9:50 PM, Joel Nothman > wrote: > >> OneHotEncoder will not magically reduce the size of your input. It will >> necessarily increase the memory of the input data as long as we are storing >> the results in scipy.sparse matrices. The sparse representation will be >> less expensive than the dense representation, but it won't be less >> expensive than the input. >> >> On 6 February 2018 at 13:24, Sarah Wait Zaranek >> wrote: >> >>> Hi Joel - >>> >>> I am also seeing a huge overhead in memory for calling the >>> onehot-encoder. I have hacked it by running it splitting by matrix into >>> 4-5 smaller matrices (by columns) and then concatenating the results. But, >>> I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or >>> is this to be expected. >>> >>> Cheers, >>> Sarah >>> >>> On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek < >>> sarah.zaranek at gmail.com> wrote: >>> >>>> Great. Thank you for all your help. >>>> >>>> Cheers, >>>> Sarah >>>> >>>> On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman >>>> wrote: >>>> >>>>> If you specify n_values=[list_of_vals_for_column1, >>>>> list_of_vals_for_column2], you should be able to engineer it to how you >>>>> want. >>>>> >>>>> On 5 February 2018 at 16:31, Sarah Wait Zaranek < >>>>> sarah.zaranek at gmail.com> wrote: >>>>> >>>>>> If I use the n+1 approach, then I get the correct matrix, except with >>>>>> the columns of zeros: >>>>>> >>>>>> >>> test >>>>>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], >>>>>> [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], >>>>>> [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], >>>>>> [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]]) >>>>>> >>>>>> >>>>>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < >>>>>> sarah.zaranek at gmail.com> wrote: >>>>>> >>>>>>> Hi Joel - >>>>>>> >>>>>>> Conceptually, that makes sense. But when I assign n_values, I can't >>>>>>> make it match the result when you don't specify them. See below. I used >>>>>>> the number of unique levels per column. >>>>>>> >>>>>>> >>> enc = OneHotEncoder(sparse=False) >>>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, >>>>>>> 0, 2]]) >>>>>>> >>> test >>>>>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], >>>>>>> [0., 1., 0., 0., 1., 1., 0., 0., 0.], >>>>>>> [1., 0., 0., 0., 1., 0., 1., 0., 0.], >>>>>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>>>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >>>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, >>>>>>> 0, 2]]) >>>>>>> >>> test >>>>>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], >>>>>>> [0., 1., 0., 0., 0., 2., 0., 0., 0.], >>>>>>> [1., 0., 0., 0., 0., 1., 1., 0., 0.], >>>>>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>>>>> >>>>>>> Cheers, >>>>>>> Sarah >>>>>>> >>>>>>> Cheers, >>>>>>> Sarah >>>>>>> >>>>>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman < >>>>>>> joel.nothman at gmail.com> wrote: >>>>>>> >>>>>>>> If each input column is encoded as a value from 0 to the (number of >>>>>>>> possible values for that column - 1) then n_values for that column should >>>>>>>> be the highest value + 1, which is also the number of levels per column. >>>>>>>> Does that make sense? >>>>>>>> >>>>>>>> Actually, I've realised there's a somewhat slow and unnecessary bit >>>>>>>> of code in the one-hot encoder: where the COO matrix is converted to CSR. I >>>>>>>> suspect this was done because most of our ML algorithms perform better on >>>>>>>> CSR, or else to maintain backwards compatibility with an earlier >>>>>>>> implementation. >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sarah.zaranek at gmail.com Mon Feb 5 22:46:16 2018 From: sarah.zaranek at gmail.com (Sarah Wait Zaranek) Date: Mon, 5 Feb 2018 22:46:16 -0500 Subject: [scikit-learn] One-hot encoding In-Reply-To: References: Message-ID: Thanks, this makes sense. I will try using the CategoricalEncoder to see the difference. It wouldn't be such a big deal if my input matrix wasn't so large. Thanks again for all your help. Cheers, Sarah On Mon, Feb 5, 2018 at 10:33 PM, Joel Nothman wrote: > Yes, the output CSR representation requires: > 1 (dtype) value per entry > 1 int32 per entry > 1 int32 per row > > The intermediate COO representation requires: > 1 (dtype) value per entry > 2 int32 per entry > > So as long as the transformation from COO to CSR is done over the whole > data, it will occupy roughly 5x the input size, which is exactly what you > are experienciong. > > The CategoricalEncoder currently available in the development version of > scikit-learn does not have this problem, but might be slower due to > handling non-integer categories. It will also possibly disappear and be > merged into OneHotEncoder soon (see PR #10523). > > > > On 6 February 2018 at 13:53, Sarah Wait Zaranek > wrote: > >> Yes, of course. What I mean is the I start out with 19 Gigs (initial >> matrix size) or so, it balloons to 100 Gigs *within the encoder function* >> and returns 28 Gigs (sparse one-hot matrix size). These numbers aren't >> exact, but you can see my point. >> >> Cheers, >> Sarah >> >> On Mon, Feb 5, 2018 at 9:50 PM, Joel Nothman >> wrote: >> >>> OneHotEncoder will not magically reduce the size of your input. It will >>> necessarily increase the memory of the input data as long as we are storing >>> the results in scipy.sparse matrices. The sparse representation will be >>> less expensive than the dense representation, but it won't be less >>> expensive than the input. >>> >>> On 6 February 2018 at 13:24, Sarah Wait Zaranek >> > wrote: >>> >>>> Hi Joel - >>>> >>>> I am also seeing a huge overhead in memory for calling the >>>> onehot-encoder. I have hacked it by running it splitting by matrix into >>>> 4-5 smaller matrices (by columns) and then concatenating the results. But, >>>> I am seeing upwards of 100 Gigs overhead. Should I file a bug report? Or >>>> is this to be expected. >>>> >>>> Cheers, >>>> Sarah >>>> >>>> On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek < >>>> sarah.zaranek at gmail.com> wrote: >>>> >>>>> Great. Thank you for all your help. >>>>> >>>>> Cheers, >>>>> Sarah >>>>> >>>>> On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman >>>>> wrote: >>>>> >>>>>> If you specify n_values=[list_of_vals_for_column1, >>>>>> list_of_vals_for_column2], you should be able to engineer it to how you >>>>>> want. >>>>>> >>>>>> On 5 February 2018 at 16:31, Sarah Wait Zaranek < >>>>>> sarah.zaranek at gmail.com> wrote: >>>>>> >>>>>>> If I use the n+1 approach, then I get the correct matrix, except >>>>>>> with the columns of zeros: >>>>>>> >>>>>>> >>> test >>>>>>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.], >>>>>>> [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.], >>>>>>> [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.], >>>>>>> [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]]) >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek < >>>>>>> sarah.zaranek at gmail.com> wrote: >>>>>>> >>>>>>>> Hi Joel - >>>>>>>> >>>>>>>> Conceptually, that makes sense. But when I assign n_values, I >>>>>>>> can't make it match the result when you don't specify them. See below. I >>>>>>>> used the number of unique levels per column. >>>>>>>> >>>>>>>> >>> enc = OneHotEncoder(sparse=False) >>>>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, >>>>>>>> 0, 2]]) >>>>>>>> >>> test >>>>>>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.], >>>>>>>> [0., 1., 0., 0., 1., 1., 0., 0., 0.], >>>>>>>> [1., 0., 0., 0., 1., 0., 1., 0., 0.], >>>>>>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>>>>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4]) >>>>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, >>>>>>>> 0, 2]]) >>>>>>>> >>> test >>>>>>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.], >>>>>>>> [0., 1., 0., 0., 0., 2., 0., 0., 0.], >>>>>>>> [1., 0., 0., 0., 0., 1., 1., 0., 0.], >>>>>>>> [0., 1., 0., 1., 0., 0., 0., 1., 0.]]) >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Sarah >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Sarah >>>>>>>> >>>>>>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman < >>>>>>>> joel.nothman at gmail.com> wrote: >>>>>>>> >>>>>>>>> If each input column is encoded as a value from 0 to the (number >>>>>>>>> of possible values for that column - 1) then n_values for that column >>>>>>>>> should be the highest value + 1, which is also the number of levels per >>>>>>>>> column. Does that make sense? >>>>>>>>> >>>>>>>>> Actually, I've realised there's a somewhat slow and unnecessary >>>>>>>>> bit of code in the one-hot encoder: where the COO matrix is converted to >>>>>>>>> CSR. I suspect this was done because most of our ML algorithms perform >>>>>>>>> better on CSR, or else to maintain backwards compatibility with an earlier >>>>>>>>> implementation. >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> scikit-learn mailing list >>>>>>>>> scikit-learn at python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff1evesque at yahoo.com Mon Feb 5 23:06:00 2018 From: jeff1evesque at yahoo.com (Jeffrey Levesque) Date: Mon, 5 Feb 2018 23:06:00 -0500 Subject: [scikit-learn] Jeff Levesque: custom json encoder Message-ID: <11817B9B-9BE4-443E-831F-ACC7B0014307@yahoo.com> Hi, Was wondering if anyone has written custom json encoder for sklearn classes, like SVC, or ensemble classes. Is it even possible to json encode, with a custom encoder for ensemble classes, like bagging techniques? Thank you, Jeff Levesque https://github.com/jeff1evesque From joel.nothman at gmail.com Mon Feb 5 23:49:11 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 6 Feb 2018 15:49:11 +1100 Subject: [scikit-learn] Jeff Levesque: custom json encoder In-Reply-To: <11817B9B-9BE4-443E-831F-ACC7B0014307@yahoo.com> References: <11817B9B-9BE4-443E-831F-ACC7B0014307@yahoo.com> Message-ID: I think you need to describe what use cases you intend once you've encoded the thing. JSON's pretty generic. You can convert any pickle into JSON, but it'll still have the security and versioning issues of a pickle. You can convert to PMML and convert the XML to JSON, but it'd still be limited by what PMML can represent, and not be deserializable into a sklearn estimator. Are you archiving the model? Are you hoping to predict from the model? Are you hoping to continue training the model? Does it need to be deserializable to a sklearn estimator? To a generic estimator not in the central sklearn library? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff1evesque at yahoo.com Tue Feb 6 01:06:15 2018 From: jeff1evesque at yahoo.com (Jeffrey Levesque) Date: Tue, 6 Feb 2018 01:06:15 -0500 Subject: [scikit-learn] Jeff Levesque: custom json encoder In-Reply-To: References: <11817B9B-9BE4-443E-831F-ACC7B0014307@yahoo.com> Message-ID: I'd like to be able to reinflate the model from json, or deserialize it, so that it can be used for prediction estimation. For now I'm dealing exclusively with sklearn. I've been successful with pickle serializing + deserializing models, so they can be used for s SVC related predictions. These I think would be the easiest of a custom json encoder would to be created? Has anyone created a custom json encoder + decoder for SVC like SVM or SVR. And to continue, is it possible to create a custom encoder for ensemble methods like bagging, since it's a little more randomized internally? Thank you, Jeff Levesque https://github.com/jeff1evesque > On Feb 5, 2018, at 11:49 PM, Joel Nothman wrote: > > I think you need to describe what use cases you intend once you've encoded the thing. > > JSON's pretty generic. You can convert any pickle into JSON, but it'll still have the security and versioning issues of a pickle. You can convert to PMML and convert the XML to JSON, but it'd still be limited by what PMML can represent, and not be deserializable into a sklearn estimator. > > Are you archiving the model? Are you hoping to predict from the model? Are you hoping to continue training the model? Does it need to be deserializable to a sklearn estimator? To a generic estimator not in the central sklearn library? > ? > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Wed Feb 7 10:49:29 2018 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Wed, 7 Feb 2018 16:49:29 +0100 Subject: [scikit-learn] Pipegraph is on its way! Message-ID: Dear all, after some playing with the concept we have developed a module for implementing the functionality of Pipeline in more general contexts as first introduced in a former thread ( https://mail.python.org/ pipermail/scikit-learn/2018-January/002158.html ) In order to expand the possibilities of Pipeline for non linearly sequential workflows a graph like structure has been deployed while keeping as much as possible the already known syntax we all love and honor: X = pd.DataFrame(dict(X=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])) y = 2 * X sc = MinMaxScaler() lm = LinearRegression() steps = [('scaler', sc), ('linear_model', lm)] connections = {'scaler': dict(X='X'), 'linear_model': dict(X=('scaler', 'predict'), y='y')} pgraph = PipeGraph(steps=steps, connections=connections, use_for_fit='all', use_for_predict='all') As you can see the biggest difference for the final user is the dictionary describing the connections. Another major contribution for developers wanting to expand scikit learn is a collection of adapters for scikit learn models in order to provide them a common API irrespectively of whether they originally implemented predict, transform or fit_predict as an atomic operation without predict. These adapters accept as many positional or keyword parameters in their fit predict methods through *pargs and **kwargs. As general as PipeGraph is, it cannot work under the restrictions imposed by GridSearchCV on the input parameters, namely X and y since PipeGraph can accept as many input signals as needed. Thus, an adhoc GridSearchCv version is also needed and we will provide a basic initial version in a later version. We need to write the documentation and we will propose it as a contrib-project in a few days. Best wishes, Manuel Castej?n-Limas -------------- next part -------------- An HTML attachment was scrubbed... URL: From sir.perales at gmail.com Wed Feb 7 11:17:32 2018 From: sir.perales at gmail.com (Carlos Perales) Date: Wed, 7 Feb 2018 17:17:32 +0100 Subject: [scikit-learn] Kernel Extreme Learning Machine in Python 3 Message-ID: Dear all, After working with Extreme Learning Machine classifiers for a while in MATLAB, I think it is a surprisingly good supervised classifier for multiclass problems. Although originally it was designed as a single hidden layer neural network with some randon weights, from 2012 it has a similar mathematical formulation than SVM (DOI: https://doi.org/10.1109/TSMCB.2011.2168604 , download link: http://www.neuromorphs.net/nm/raw-attachment/wiki/2015/scc15/ELM-Unified-Learning.pdf ). This allows to use ELM with kernels, just as SVM, but avoiding "one-by-one" and "one-against-rest" techniques. Pull request is 10602 ( https://github.com/scikit-learn/scikit-learn/pull/10602 ), and all contributions are welcome! Un saludo, Carlos Perales. -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Wed Feb 7 12:01:37 2018 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Wed, 7 Feb 2018 18:01:37 +0100 Subject: [scikit-learn] clustering on big dataset Message-ID: Hope this helps! Manuel @Article{Ciampi2008, author="Ciampi, Antonio and Lechevallier, Yves and Limas, Manuel Castej{\'o}n and Marcos, Ana Gonz{\'a}lez", title="Hierarchical clustering of subpopulations with a dissimilarity based on the likelihood ratio statistic: application to clustering massive data sets", journal="Pattern Analysis and Applications", year="2008", month="Jun", day="01", volume="11", number="2", pages="199--220", abstract="The problem of clustering subpopulations on the basis of samples is considered within a statistical framework: a distribution for the variables is assumed for each subpopulation and the dissimilarity between any two populations is defined as the likelihood ratio statistic which compares the hypothesis that the two subpopulations differ in the parameter of their distributions to the hypothesis that they do not. A general algorithm for the construction of a hierarchical classification is described which has the important property of not having inversions in the dendrogram. The essential elements of the algorithm are specified for the case of well-known distributions (normal, multinomial and Poisson) and an outline of the general parametric case is also discussed. Several applications are discussed, the main one being a novel approach to dealing with massive data in the context of a two-step approach. After clustering the data in a reasonable number of `bins' by a fast algorithm such as k-Means, we apply a version of our algorithm to the resulting bins. Multivariate normality for the means calculated on each bin is assumed: this is justified by the central limit theorem and the assumption that each bin contains a large number of units, an assumption generally justified when dealing with truly massive data such as currently found in modern data analysis. However, no assumption is made about the data generating distribution.", issn="1433-755X", doi="10.1007/s10044-007-0088-4", url="https://doi.org/10.1007/s10044-007-0088-4" } 2018-01-04 12:55 GMT+01:00 Joel Nothman : > Can you use nearest neighbors with a KD tree to build a distance matrix > that is sparse, in that distances to all but the nearest neighbors of a > point are (near-)infinite? Yes, this again has an additional parameter > (neighborhood size), just as BIRCH has its threshold. I suspect you will > not be able to improve on having another, approximating, parameter. You do > not need to set n_clusters to a fixed value for BIRCH. You only need to > provide another clusterer, which has its own parameters, although you > should be able to experiment with different "global clusterers". > > On 4 January 2018 at 11:04, Shiheng Duan wrote: > >> Yes, it is an efficient method, still, we need to specify the number of >> clusters or the threshold. Is there another way to run hierarchy clustering >> on the big dataset? The main problem is the distance matrix. >> Thanks. >> >> On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel >> wrote: >> >>> Have you had a look at BIRCH? >>> >>> http://scikit-learn.org/stable/modules/clustering.html#birch >>> >>> -- >>> Olivier >>> ? >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Feb 7 15:46:43 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 8 Feb 2018 07:46:43 +1100 Subject: [scikit-learn] Pipegraph is on its way! In-Reply-To: References: Message-ID: cool! We have been talking for a while about how to pass other things around grid search and other meta-analysis estimators. This injection approach looks pretty neat as a way to express it. Will need to mull on it. On 8 Feb 2018 2:51 am, "Manuel Castej?n Limas" wrote: > Dear all, > > after some playing with the concept we have developed a module for > implementing the functionality of Pipeline in more general contexts as > first introduced in a former thread ( https://mail.python.org/piperm > ail/scikit-learn/2018-January/002158.html ) > > In order to expand the possibilities of Pipeline for non linearly > sequential workflows a graph like structure has been deployed while keeping > as much as possible the already known syntax we all love and honor: > > X = pd.DataFrame(dict(X=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])) > y = 2 * X > sc = MinMaxScaler() > lm = LinearRegression() > steps = [('scaler', sc), > ('linear_model', lm)] > connections = {'scaler': dict(X='X'), > 'linear_model': dict(X=('scaler', 'predict'), > y='y')} > pgraph = PipeGraph(steps=steps, > connections=connections, > use_for_fit='all', > use_for_predict='all') > > As you can see the biggest difference for the final user is the dictionary > describing the connections. > > Another major contribution for developers wanting to expand scikit learn > is a collection of adapters for scikit learn models in order to provide > them a common API irrespectively of whether they originally implemented > predict, transform or fit_predict as an atomic operation without predict. > These adapters accept as many positional or keyword parameters in their fit > predict methods through *pargs and **kwargs. > > As general as PipeGraph is, it cannot work under the restrictions imposed > by GridSearchCV on the input parameters, namely X and y since PipeGraph can > accept as many input signals as needed. Thus, an adhoc GridSearchCv version > is also needed and we will provide a basic initial version in a later > version. > > We need to write the documentation and we will propose it as a > contrib-project in a few days. > > Best wishes, > Manuel Castej?n-Limas > > > > > > > > > > > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Feb 7 17:29:52 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 7 Feb 2018 17:29:52 -0500 Subject: [scikit-learn] Pipegraph is on its way! In-Reply-To: References: Message-ID: <9f0a744d-d572-15f2-fc9a-f50c7f05459f@gmail.com> Thanks Manuel, that looks pretty cool. Do you have a write-up about it? I don't entirely understand the connections setup. From ahowe42 at gmail.com Wed Feb 7 23:35:31 2018 From: ahowe42 at gmail.com (Andrew Howe) Date: Thu, 8 Feb 2018 07:35:31 +0300 Subject: [scikit-learn] Pipegraph is on its way! In-Reply-To: References: Message-ID: Very cool! Thanks for all the great work. Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD www.andrewhowe.com http://orcid.org/0000-0002-3553-1990 http://www.linkedin.com/in/ahowe42 https://www.researchgate.net/profile/John_Howe12/ I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Wed, Feb 7, 2018 at 6:49 PM, Manuel Castej?n Limas < manuel.castejon at gmail.com> wrote: > Dear all, > > after some playing with the concept we have developed a module for > implementing the functionality of Pipeline in more general contexts as > first introduced in a former thread ( https://mail.python.org/piperm > ail/scikit-learn/2018-January/002158.html ) > > In order to expand the possibilities of Pipeline for non linearly > sequential workflows a graph like structure has been deployed while keeping > as much as possible the already known syntax we all love and honor: > > X = pd.DataFrame(dict(X=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])) > y = 2 * X > sc = MinMaxScaler() > lm = LinearRegression() > steps = [('scaler', sc), > ('linear_model', lm)] > connections = {'scaler': dict(X='X'), > 'linear_model': dict(X=('scaler', 'predict'), > y='y')} > pgraph = PipeGraph(steps=steps, > connections=connections, > use_for_fit='all', > use_for_predict='all') > > As you can see the biggest difference for the final user is the dictionary > describing the connections. > > Another major contribution for developers wanting to expand scikit learn > is a collection of adapters for scikit learn models in order to provide > them a common API irrespectively of whether they originally implemented > predict, transform or fit_predict as an atomic operation without predict. > These adapters accept as many positional or keyword parameters in their fit > predict methods through *pargs and **kwargs. > > As general as PipeGraph is, it cannot work under the restrictions imposed > by GridSearchCV on the input parameters, namely X and y since PipeGraph can > accept as many input signals as needed. Thus, an adhoc GridSearchCv version > is also needed and we will provide a basic initial version in a later > version. > > We need to write the documentation and we will propose it as a > contrib-project in a few days. > > Best wishes, > Manuel Castej?n-Limas > > > > > > > > > > > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Fri Feb 9 02:57:09 2018 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Fri, 9 Feb 2018 08:57:09 +0100 Subject: [scikit-learn] Pipegraph is on its way! In-Reply-To: <9f0a744d-d572-15f2-fc9a-f50c7f05459f@gmail.com> References: <9f0a744d-d572-15f2-fc9a-f50c7f05459f@gmail.com> Message-ID: Docs are coming soon. In the meantime , Imagine a first step containing a TrainTestSplit class with a similar behaviour to train_test_split but capable of producing results by using fit and predict (this is a goodie). The inputs will be X, y, z, ... , and the outputs the same names + _train and _test. A second step could be a MinMaxScaler taking only X_train. A third step a linear model using the output from MinMaxScaler as X. This would be written: connections['split'] = {'A': 'X', 'B': 'y'} Meaning that the 'split' step will use the X and y from the fit or predict call calling them A and B internally. If you use, for instance, my_pipegraph.fit(X=myX, y=myY) This step will produce A_train with a piece of myX You can use this later: connections['scaler'] = { 'X': ('split', 'A_train')} Expressing that the output A_train from the split step will be use as input X for the scaler. The output from this step is called 'predict' Finally, for the third step: connections['linear_model'] ={'X': ('scaler', 'predict'), 'y': ('split', 'B_train')} Notice, that if we are talking about an external input variable we don't use a tuple. So the syntax is something like connection[step_label] = {internal_variable: (input_step, variable_there)} Docs are coming anyway. Travis CI, Circle CI and Appveyor have been successfully activated at GitHub.com/mcasl/PipeGraph Sorry if you found mistypos, I use my smartphone for replying. Best Manuel El 7 feb. 2018 11:32 p. m., "Andreas Mueller" escribi?: > Thanks Manuel, that looks pretty cool. > Do you have a write-up about it? I don't entirely understand the > connections setup. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From karl.erik.fransson at gmail.com Fri Feb 9 07:10:48 2018 From: karl.erik.fransson at gmail.com (Erik Fransson) Date: Fri, 9 Feb 2018 13:10:48 +0100 Subject: [scikit-learn] (no subject) Message-ID: Hi everyone, I have a simple questions in regards to how LassoCV works. In the documentation it states that the best model is selected via cross-validation, however I'm wondering how is this best model constructed. Is it simply running a normal lasso fit with all of the training data available using the optimal alpha? And are these the parameters stored in .coef_? Regards, Erik -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Fri Feb 9 07:14:38 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 9 Feb 2018 13:14:38 +0100 Subject: [scikit-learn] (no subject) In-Reply-To: References: Message-ID: <20180209121438.GA1975304@phare.normalesup.org> On Fri, Feb 09, 2018 at 01:10:48PM +0100, Erik Fransson wrote: > Is it simply running a normal lasso fit with all of the training data available > using the optimal alpha? > And are these the parameters stored in .coef_? Yes. G From manuel.castejon at gmail.com Sat Feb 10 11:27:02 2018 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Sat, 10 Feb 2018 17:27:02 +0100 Subject: [scikit-learn] PipeGraph examples: areas of interest Message-ID: Hi all! The good news is that we made GridSearchCv work on PipeGraph! In order to create diverse examples, we welcome some feedback on which other libraries you use in order to acquire/process data before applying scikit learn. For example: 'I work in computer vision and I usually get image feature using the python bindings provided by OpenCV.' This will help us provide interesting examples. Best Manuel -------------- next part -------------- An HTML attachment was scrubbed... URL: From evgeniya.korneva at kuleuven.be Mon Feb 12 06:40:03 2018 From: evgeniya.korneva at kuleuven.be (Evgeniya Korneva) Date: Mon, 12 Feb 2018 11:40:03 +0000 Subject: [scikit-learn] Multi-Output Decision Trees for mixed classification-regerssion problems In-Reply-To: <1518432075448.84750@kuleuven.be> References: <1518432075448.84750@kuleuven.be> Message-ID: <1518435609373.91570@kuleuven.be> Dear all, For my research, I'm working with multi-output decision trees. In the current sklearn implementation, a tree can predict either several numerical or several categorical targets simultaneously, but not a mixture of those. However, predicting various targets jointly is often beneficial both in terms of speed and accuracy. Because of that, I'm willing to add this functionality. It seems that the only thing to be done is to implement a new node splitting criteria that handles a mixture of nominal and numerical attributes, and then define a new class of models (such as DecisionTreeRegressor or DecisionTreeClassifier, but for mixed output). However, since I'm not an experienced sklearn contributor, I am looking for any hints on how to implement this in effective way, re-using as much functionality already available as possible. Your advice is very welcome. Best, Evgeniya -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmitrii.ignatov at gmail.com Mon Feb 12 08:25:54 2018 From: dmitrii.ignatov at gmail.com (=?utf-8?B?0JTQvNC40YLRgNC40Lkg0JjQs9C90LDRgtC+0LI=?=) Date: Mon, 12 Feb 2018 16:25:54 +0300 Subject: [scikit-learn] Multi-Output Decision Trees for mixed classification-regerssion problems In-Reply-To: <1518435609373.91570@kuleuven.be> References: <1518432075448.84750@kuleuven.be> <1518435609373.91570@kuleuven.be> Message-ID: Just a comment: it would be a useful tool. -Dmitry ?????????? ? iPhone > 12 ????. 2018 ?., ? 14:40, Evgeniya Korneva ???????(?): > > > Dear all, > > For my research, I'm working with multi-output decision trees. In the current sklearn implementation, a tree can predict either several numerical or several categorical targets simultaneously, but not a mixture of those. However, predicting various targets jointly is often beneficial both in terms of speed and accuracy. Because of that, I'm willing to add this functionality. > > It seems that the only thing to be done is to implement a new node splitting criteria that handles a mixture of nominal and numerical attributes, and then define a new class of models (such as DecisionTreeRegressor or > > DecisionTreeClassifier, but for mixed output). However, since I'm not an experienced sklearn contributor, I am looking for any hints on how to implement this in effective way, re-using as much functionality already available as possible. > > > Your advice is very welcome. > > Best, > Evgeniya > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From princegosavi12 at gmail.com Mon Feb 12 13:10:01 2018 From: princegosavi12 at gmail.com (prince gosavi) Date: Mon, 12 Feb 2018 23:40:01 +0530 Subject: [scikit-learn] Applying clustering to cosine distance matrix Message-ID: I have generated a cosine distance matrix and would like to apply clustering algorithm to the given matrix. np.shape(distance_matrix)==(14000,14000) I would like to know which clustering suits better and is there any need to process the data further to get it in the form so that a model can be applied. Also any performance tip as the matrix takes around 3-4 hrs of processing. You can find my code here https://github.com/maxyodedara5/BE_Project/blob/master/main.ipynb Code for READ ONLY PURPOSE. -- Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Feb 12 14:14:12 2018 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 12 Feb 2018 14:14:12 -0500 Subject: [scikit-learn] Applying clustering to cosine distance matrix In-Reply-To: References: Message-ID: Hi, by default, the clustering classes from sklearn, (e.g., DBSCAN), take an [num_examples, num_features] array as input, but you can also provide the distance matrix directly, e.g., by instantiating it with metric='precomputed' my_dbscan = DBSCAN(..., metric='precomputed') my_dbscan.fit(my_distance_matrix) Not sure if it helps in that particular case (depending on how many zero elements you have), you can also use a sparse matrix in CSR format (https://docs.scipy.org/doc/scipy-1.0.0/reference/generated/scipy.sparse.csr_matrix.html). Also, you don't need to for-loop through the rows if you want to compute the pair-wise distances, you can simply do that on the complete array. E.g., from sklearn.metrics.pairwise import cosine_distances from scipy import sparse distance_matrix = cosine_distances(sparse.csr_matrix(X), dense_output=False) where X is your "[num_examples, num_features]" array. Best, Sebastian > On Feb 12, 2018, at 1:10 PM, prince gosavi wrote: > > I have generated a cosine distance matrix and would like to apply clustering algorithm to the given matrix. > np.shape(distance_matrix)==(14000,14000) > > I would like to know which clustering suits better and is there any need to process the data further to get it in the form so that a model can be applied. > Also any performance tip as the matrix takes around 3-4 hrs of processing. > You can find my code here https://github.com/maxyodedara5/BE_Project/blob/master/main.ipynb > Code for READ ONLY PURPOSE. > -- > Regards > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Mon Feb 12 15:40:12 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 13 Feb 2018 07:40:12 +1100 Subject: [scikit-learn] Multi-Output Decision Trees for mixed classification-regerssion problems In-Reply-To: References: <1518432075448.84750@kuleuven.be> <1518435609373.91570@kuleuven.be> Message-ID: presuming there are clear applications for this, other models should be able to support mixed targets similarly, like MLP. since we don't really have an API design for this, it might take some time to find consensus on what it should look like. but a PR would be a good way to concretely consider it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From princegosavi12 at gmail.com Mon Feb 12 16:29:21 2018 From: princegosavi12 at gmail.com (prince gosavi) Date: Tue, 13 Feb 2018 02:59:21 +0530 Subject: [scikit-learn] Applying clustering to cosine distance matrix In-Reply-To: <1518466871024.1542079269@boxbe> References: <1518466871024.1542079269@boxbe> Message-ID: Hi, Thanks for those tips Sebastian.That just saved my day. Regards, Rajkumar On Tue, Feb 13, 2018 at 12:44 AM, Sebastian Raschka wrote: > [image: Boxbe] This message is eligible > for Automatic Cleanup! (se.raschka at gmail.com) Add cleanup rule > > | More info > > > Hi, > > by default, the clustering classes from sklearn, (e.g., DBSCAN), take an > [num_examples, num_features] array as input, but you can also provide the > distance matrix directly, e.g., by instantiating it with > metric='precomputed' > > my_dbscan = DBSCAN(..., metric='precomputed') > my_dbscan.fit(my_distance_matrix) > > Not sure if it helps in that particular case (depending on how many zero > elements you have), you can also use a sparse matrix in CSR format ( > https://docs.scipy.org/doc/scipy-1.0.0/reference/ > generated/scipy.sparse.csr_matrix.html). > > Also, you don't need to for-loop through the rows if you want to compute > the pair-wise distances, you can simply do that on the complete array. E.g., > > from sklearn.metrics.pairwise import cosine_distances > from scipy import sparse > > distance_matrix = cosine_distances(sparse.csr_matrix(X), > dense_output=False) > > where X is your "[num_examples, num_features]" array. > > Best, > Sebastian > > > > On Feb 12, 2018, at 1:10 PM, prince gosavi > wrote: > > > > I have generated a cosine distance matrix and would like to apply > clustering algorithm to the given matrix. > > np.shape(distance_matrix)==(14000,14000) > > > > I would like to know which clustering suits better and is there any need > to process the data further to get it in the form so that a model can be > applied. > > Also any performance tip as the matrix takes around 3-4 hrs of > processing. > > You can find my code here https://github.com/ > maxyodedara5/BE_Project/blob/master/main.ipynb > > Code for READ ONLY PURPOSE. > > -- > > Regards > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From princegosavi12 at gmail.com Mon Feb 12 16:31:04 2018 From: princegosavi12 at gmail.com (prince gosavi) Date: Tue, 13 Feb 2018 03:01:04 +0530 Subject: [scikit-learn] Multi-Output Decision Trees for mixedclassification-regerssion problems In-Reply-To: <1518468893228.1424429011@boxbe> References: <1518432075448.84750@kuleuven.be> <1518435609373.91570@kuleuven.be> <1518468893228.1424429011@boxbe> Message-ID: Thanks for the reply will definitely try to PR this issue. Regrads On Tue, Feb 13, 2018 at 2:10 AM, Joel Nothman wrote: > [image: Boxbe] This message is eligible > for Automatic Cleanup! (joel.nothman at gmail.com) Add cleanup rule > > | More info > > > presuming there are clear applications for this, other models should be > able to support mixed targets similarly, like MLP. since we don't really > have an API design for this, it might take some time to find consensus on > what it should look like. but a PR would be a good way to concretely > consider it. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Mon Feb 12 16:49:46 2018 From: vaggi.federico at gmail.com (federico vaggi) Date: Mon, 12 Feb 2018 21:49:46 +0000 Subject: [scikit-learn] Applying clustering to cosine distance matrix In-Reply-To: References: <1518466871024.1542079269@boxbe> Message-ID: As a caveat, a lot of clustering algorithms assume that the distance matrix is a proper metric. If your distance is not a proper metric then the results might be meaningless (the narrative docs do a good job of discussing this). On Mon, 12 Feb 2018 at 13:30 prince gosavi wrote: > Hi, > Thanks for those tips Sebastian.That just saved my day. > > Regards, > Rajkumar > > On Tue, Feb 13, 2018 at 12:44 AM, Sebastian Raschka > wrote: > >> [image: Boxbe] This message is eligible >> for Automatic Cleanup! (se.raschka at gmail.com) Add cleanup rule >> >> | More info >> >> > >> Hi, >> >> by default, the clustering classes from sklearn, (e.g., DBSCAN), take an >> [num_examples, num_features] array as input, but you can also provide the >> distance matrix directly, e.g., by instantiating it with >> metric='precomputed' >> >> my_dbscan = DBSCAN(..., metric='precomputed') >> my_dbscan.fit(my_distance_matrix) >> >> Not sure if it helps in that particular case (depending on how many zero >> elements you have), you can also use a sparse matrix in CSR format ( >> https://docs.scipy.org/doc/scipy-1.0.0/reference/generated/scipy.sparse.csr_matrix.html >> ). >> >> Also, you don't need to for-loop through the rows if you want to compute >> the pair-wise distances, you can simply do that on the complete array. E.g., >> >> from sklearn.metrics.pairwise import cosine_distances >> from scipy import sparse >> >> distance_matrix = cosine_distances(sparse.csr_matrix(X), >> dense_output=False) >> >> where X is your "[num_examples, num_features]" array. >> >> Best, >> Sebastian >> >> >> > On Feb 12, 2018, at 1:10 PM, prince gosavi >> wrote: >> > >> > > I have generated a cosine distance matrix and would like to apply >> clustering algorithm to the given matrix. >> > np.shape(distance_matrix)==(14000,14000) >> > >> > I would like to know which clustering suits better and is there any >> need to process the data further to get it in the form so that a model can >> be applied. >> > Also any performance tip as the matrix takes around 3-4 hrs of >> processing. >> > You can find my code here >> https://github.com/maxyodedara5/BE_Project/blob/master/main.ipynb >> > Code for READ ONLY PURPOSE. >> > -- >> > Regards >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Regards > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From princegosavi12 at gmail.com Mon Feb 12 16:58:22 2018 From: princegosavi12 at gmail.com (prince gosavi) Date: Tue, 13 Feb 2018 03:28:22 +0530 Subject: [scikit-learn] Applying clustering to cosine distance matrix In-Reply-To: <1518472380263.1542079269@boxbe> References: <1518466871024.1542079269@boxbe> <1518472380263.1542079269@boxbe> Message-ID: Will look into it.Although I have problem generating cluster as my data is 14000x14000 distance_matrix and it says "Memory Error". I have 6GB RAM. Any insight on this error is welcomed. Regards On Tue, Feb 13, 2018 at 3:19 AM, federico vaggi wrote: > [image: Boxbe] This message is eligible > for Automatic Cleanup! (vaggi.federico at gmail.com) Add cleanup rule > > | More info > > > As a caveat, a lot of clustering algorithms assume that the distance > matrix is a proper metric. If your distance is not a proper metric then > the results might be meaningless (the narrative docs do a good job of > discussing this). > > On Mon, 12 Feb 2018 at 13:30 prince gosavi > wrote: > >> Hi, >> Thanks for those tips Sebastian.That just saved my day. >> >> Regards, >> Rajkumar >> >> On Tue, Feb 13, 2018 at 12:44 AM, Sebastian Raschka > > wrote: >> >>> [image: Boxbe] This message is >>> eligible for Automatic Cleanup! (se.raschka at gmail.com) Add cleanup rule >>> >>> | More info >>> >>> >> >>> Hi, >>> >>> by default, the clustering classes from sklearn, (e.g., DBSCAN), take an >>> [num_examples, num_features] array as input, but you can also provide the >>> distance matrix directly, e.g., by instantiating it with >>> metric='precomputed' >>> >>> my_dbscan = DBSCAN(..., metric='precomputed') >>> my_dbscan.fit(my_distance_matrix) >>> >>> Not sure if it helps in that particular case (depending on how many zero >>> elements you have), you can also use a sparse matrix in CSR format ( >>> https://docs.scipy.org/doc/scipy-1.0.0/reference/ >>> generated/scipy.sparse.csr_matrix.html). >>> >>> Also, you don't need to for-loop through the rows if you want to compute >>> the pair-wise distances, you can simply do that on the complete array. E.g., >>> >>> from sklearn.metrics.pairwise import cosine_distances >>> from scipy import sparse >>> >>> distance_matrix = cosine_distances(sparse.csr_matrix(X), >>> dense_output=False) >>> >>> where X is your "[num_examples, num_features]" array. >>> >>> Best, >>> Sebastian >>> >>> >>> > On Feb 12, 2018, at 1:10 PM, prince gosavi >>> wrote: >>> > >>> >> > I have generated a cosine distance matrix and would like to apply >>> clustering algorithm to the given matrix. >>> > np.shape(distance_matrix)==(14000,14000) >>> > >>> > I would like to know which clustering suits better and is there any >>> need to process the data further to get it in the form so that a model can >>> be applied. >>> > Also any performance tip as the matrix takes around 3-4 hrs of >>> processing. >>> > You can find my code here https://github.com/ >>> maxyodedara5/BE_Project/blob/master/main.ipynb >>> > Code for READ ONLY PURPOSE. >>> > -- >>> > Regards >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Regards >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Mon Feb 12 19:52:11 2018 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Tue, 13 Feb 2018 01:52:11 +0100 Subject: [scikit-learn] Pipegraph is on its way! In-Reply-To: <9f0a744d-d572-15f2-fc9a-f50c7f05459f@gmail.com> References: <9f0a744d-d572-15f2-fc9a-f50c7f05459f@gmail.com> Message-ID: While we keep working on the docs and figures, here is a little example you all can already run: import numpy as np import pandas as pd from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression from sklearn.model_selection import GridSearchCV from pipegraph.pipeGraph import PipeGraphClassifier, Concatenator import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.neural_network import MLPClassifier iris = load_iris() X = iris.data y = iris.target scaler = MinMaxScaler() gaussian_nb = GaussianNB() svc = SVC() mlp = MLPClassifier() concatenator = Concatenator() steps = [('scaler', scaler), ('gaussian_nb', gaussian_nb), ('svc', svc), ('concat', concatenator), ('mlp', mlp)] connections = { 'scaler': {'X': 'X'}, 'gaussian_nb': {'X': ('scaler', 'predict'), 'y': 'y'}, 'svc': {'X': ('scaler', 'predict'), 'y': 'y'}, 'concat': {'X1': ('scaler', 'predict'), 'X2': ('gaussian_nb', 'predict'), 'X3': ('svc', 'predict')}, 'mlp': {'X': ('concat', 'predict'), 'y': 'y'} } param_grid = {'svc__C': [0.1, 0.5, 1.0], 'mlp__hidden_layer_sizes': [(3,), (6,), (9,),], 'mlp__max_iter': [5000, 10000]} pgraph = PipeGraphClassifier(steps=steps, connections=connections) grid_search_classifier = GridSearchCV(estimator=pgraph, param_grid=param_grid, refit=True) grid_search_classifier.fit(X, y) y_pred = grid_search_classifier.predict(X) grid_search_regressor.best_estimator_.get_params() --- 'predict' is the default output name. One of these days we will simplify the notation to simply the name of the node in case of default output names. Best wishes Manuel 2018-02-07 23:29 GMT+01:00 Andreas Mueller : > Thanks Manuel, that looks pretty cool. > Do you have a write-up about it? I don't entirely understand the > connections setup. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From princegosavi12 at gmail.com Tue Feb 13 09:53:52 2018 From: princegosavi12 at gmail.com (prince gosavi) Date: Tue, 13 Feb 2018 20:23:52 +0530 Subject: [scikit-learn] Building models for recommendation system Message-ID: Hi, I have a *1000x1000* euclidean* distance matrix.* The distance is calculated pairwise between each item i.e *distance between an ITEM with remaining ITEMS.* I would* like to know* the *next step after calculating the distance matrix.* Further, please* link some resources* so that I can get deep understanding because *as far as I have researched most of the websites provide examples with toy dataset which are pretty straight forward.* https://github.com/maxyodedara5/BE_Project/blob/master/test.ipynb is the link to the code CODE FOR READ ONLY PURPOSE. -- Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From samuel.weber at univ-grenoble-alpes.fr Tue Feb 13 11:42:54 2018 From: samuel.weber at univ-grenoble-alpes.fr (=?UTF-8?Q?Samu=c3=abl_Weber?=) Date: Tue, 13 Feb 2018 17:42:54 +0100 Subject: [scikit-learn] Handle uncertainties in NMF Message-ID: <4853dd3b-15f6-d6fc-393c-68210bf0e7f4@univ-grenoble-alpes.fr> Dears, First of all, thanks for scikit-learn! I was wondering if handling uncertainties in NMF would be possible. Indeed, in NMF we minimize a Frobenius norm ||X - WH||?, so we may quite easily minimize ||(X - WH) / U||?, with U the matrix of uncertainty. This would be really helpfull in many field (atmospheric chemistry [1], biology [2], etc). Does such feature is planed to be implemented in scikit-learn? Best regards, Samu?l Weber [1] : https://www.epa.gov/air-research/positive-matrix-factorization-model-environmental-data-analyses [2] : https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/1471-2105-7-175?site=bmcbioinformatics.biomedcentral.com From princegosavi12 at gmail.com Tue Feb 13 13:31:37 2018 From: princegosavi12 at gmail.com (prince gosavi) Date: Wed, 14 Feb 2018 00:01:37 +0530 Subject: [scikit-learn] Building models for recommendation system In-Reply-To: References: Message-ID: https://github.com/maxyodedara5/BE_Project/blob/master/final/test.ipynb new link for code On Tue, Feb 13, 2018 at 8:23 PM, prince gosavi wrote: > Hi, > I have a *1000x1000* euclidean* distance matrix.* > The distance is calculated pairwise between each item i.e *distance > between an ITEM with remaining ITEMS.* > I would* like to know* the *next step after calculating the distance > matrix.* > Further, please* link some resources* so that I can get deep > understanding because > > *as far as I have researched most of the websites provide examples with > toy dataset which are pretty straight forward.* > https://github.com/maxyodedara5/BE_Project/blob/master/test.ipynb is the > link to the code > > CODE FOR READ ONLY PURPOSE. > -- > Regards > -- Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Feb 13 16:37:11 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 13 Feb 2018 22:37:11 +0100 Subject: [scikit-learn] Handle uncertainties in NMF In-Reply-To: <4853dd3b-15f6-d6fc-393c-68210bf0e7f4@univ-grenoble-alpes.fr> References: <4853dd3b-15f6-d6fc-393c-68210bf0e7f4@univ-grenoble-alpes.fr> Message-ID: <20180213213711.GZ1181231@phare.normalesup.org> Hi Samu?l, On Tue, Feb 13, 2018 at 05:42:54PM +0100, Samu?l Weber wrote: > I was wondering if handling uncertainties in NMF would be possible. > Indeed, in NMF we minimize a Frobenius norm ||X - WH||?, so we may > quite easily minimize ||(X - WH) / U||?, with U the matrix of > uncertainty. You can divide your data X by U, run the standard matrix factorization solver, and multiply the resulting matrix H by U and you'll get the result that you want. Best, Ga?l From manjunathgoudreddy at gmail.com Wed Feb 14 06:10:01 2018 From: manjunathgoudreddy at gmail.com (Manjunath Goudreddy) Date: Wed, 14 Feb 2018 11:10:01 +0000 Subject: [scikit-learn] Building models for recommendation system In-Reply-To: References: Message-ID: Hello, If you after video/music recommendation set, I recommend you to check websites like kaggle and Analytics Vidya. However recently there was a competition organised by kaggle which is to do with Music recommendation and here is the link to the dataset. https://www.kaggle.com/c/kkbox-music-recommendation-challenge I think we are ought to keep the mailing list specific to scikit-learn. regards Manjunath On Tue, Feb 13, 2018 at 6:31 PM, prince gosavi wrote: > https://github.com/maxyodedara5/BE_Project/blob/master/final/test.ipynb > new link for code > > On Tue, Feb 13, 2018 at 8:23 PM, prince gosavi > wrote: > >> Hi, >> I have a *1000x1000* euclidean* distance matrix.* >> The distance is calculated pairwise between each item i.e *distance >> between an ITEM with remaining ITEMS.* >> I would* like to know* the *next step after calculating the distance >> matrix.* >> Further, please* link some resources* so that I can get deep >> understanding because >> >> *as far as I have researched most of the websites provide examples with >> toy dataset which are pretty straight forward.* >> https://github.com/maxyodedara5/BE_Project/blob/master/test.ipynb is the >> link to the code >> >> CODE FOR READ ONLY PURPOSE. >> -- >> Regards >> > > > > -- > Regards > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From princegosavi12 at gmail.com Wed Feb 14 07:26:13 2018 From: princegosavi12 at gmail.com (prince gosavi) Date: Wed, 14 Feb 2018 17:56:13 +0530 Subject: [scikit-learn] Building models for recommendation system In-Reply-To: References: Message-ID: Hi, Thanks for the response and Sorry for the trouble I will keep that in mind. Regards On Feb 14, 2018 16:40, "Manjunath Goudreddy" wrote: > Hello, > > If you after video/music recommendation set, I recommend you to check > websites like kaggle and Analytics Vidya. > However recently there was a competition organised by kaggle which is to > do with Music recommendation and here is the link to the dataset. > > https://www.kaggle.com/c/kkbox-music-recommendation-challenge > > > I think we are ought to keep the mailing list specific to scikit-learn. > > regards > Manjunath > > On Tue, Feb 13, 2018 at 6:31 PM, prince gosavi > wrote: > >> https://github.com/maxyodedara5/BE_Project/blob/master/final/test.ipynb >> new link for code >> >> On Tue, Feb 13, 2018 at 8:23 PM, prince gosavi >> wrote: >> >>> Hi, >>> I have a *1000x1000* euclidean* distance matrix.* >>> The distance is calculated pairwise between each item i.e *distance >>> between an ITEM with remaining ITEMS.* >>> I would* like to know* the *next step after calculating the distance >>> matrix.* >>> Further, please* link some resources* so that I can get deep >>> understanding because >>> >>> *as far as I have researched most of the websites provide examples with >>> toy dataset which are pretty straight forward.* >>> https://github.com/maxyodedara5/BE_Project/blob/master/test.ipynb is >>> the link to the code >>> >>> CODE FOR READ ONLY PURPOSE. >>> -- >>> Regards >>> >> >> >> >> -- >> Regards >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shiduan at ucdavis.edu Wed Feb 14 18:09:43 2018 From: shiduan at ucdavis.edu (Shiheng Duan) Date: Wed, 14 Feb 2018 15:09:43 -0800 Subject: [scikit-learn] KMeans cluster Message-ID: Hello all, In KMeans cluster, there is a parameter n_init. It shows that the algorithm will run n_init times and output the best. I wonder how to compare the output of each run. Can we get the score for each run? Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Feb 14 22:46:24 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 15 Feb 2018 14:46:24 +1100 Subject: [scikit-learn] KMeans cluster In-Reply-To: References: Message-ID: you can repeatedly use n_init=1? -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Thu Feb 15 12:37:31 2018 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 15 Feb 2018 18:37:31 +0100 Subject: [scikit-learn] custom loss function in RandomForestRegressor Message-ID: Greetings, The feature importance calculated by the RandomForest implementation is a very useful feature. I personally use it to select the best features because it is simple and fast, and then I train MLPRegressors. The limitation of this approach is that although I can control the loss function of the MLPRegressor (I have modified scikit-learn's implementation to accept an arbitrary loss function), I cannot do the same with RandomForestRegressor, and hence I have to rely on 'mse' which is not in accordance with the loss functions I use in MLPs. Today I was looking at the _criterion.pyx file: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx However, the code is in Cython and I find it hard to follow. I know that for Regression the relevant class are Criterion(), RegressionCriterion(Criterion), and MSE(RegressionCriterion). My question is: is it possible to write a class that takes an arbitrary function "loss(predictions, targets)" to calculate the loss and impurity of the nodes? thanks, Thomas -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Feb 15 12:49:46 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 15 Feb 2018 12:49:46 -0500 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: References: Message-ID: Yes, but if you write it in Python, not Cython, it will be unbearably slow. On 02/15/2018 12:37 PM, Thomas Evangelidis wrote: > Greetings, > > The feature importance calculated by the RandomForest implementation > is a very useful feature. I personally use it to select the best > features because it is simple and fast, and then I train > MLPRegressors. The limitation of this approach is that although I can > control the loss function of the MLPRegressor (I have modified > scikit-learn's implementation to accept an arbitrary loss function), I > cannot do the same with RandomForestRegressor, and hence I have to > rely on 'mse' which is not in accordance with the loss functions I use > in MLPs. Today I was looking at the _criterion.pyx file: > > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx > > However, the code is in Cython and I find it hard to follow. I know > that for Regression the relevant class are Criterion(), > RegressionCriterion(Criterion), and MSE(RegressionCriterion). My > question is: is it possible to write a class that takes an arbitrary > function "loss(predictions, targets)" to calculate the loss and > impurity of the nodes? > > thanks, > Thomas > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Thu Feb 15 12:50:38 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Thu, 15 Feb 2018 18:50:38 +0100 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: References: Message-ID: The ClassificationCriterion and RegressionCriterion are now exposed in the _criterion.pxd. It will allow you to create your own criterion. So you can write your own Criterion with a given loss by implementing the methods which are required in the trees. Then you can pass an instance of this criterion to the tree and it should work. On 15 February 2018 at 18:37, Thomas Evangelidis wrote: > Greetings, > > The feature importance calculated by the RandomForest implementation is a > very useful feature. I personally use it to select the best features > because it is simple and fast, and then I train MLPRegressors. The > limitation of this approach is that although I can control the loss > function of the MLPRegressor (I have modified scikit-learn's implementation > to accept an arbitrary loss function), I cannot do the same with > RandomForestRegressor, and hence I have to rely on 'mse' which is not in > accordance with the loss functions I use in MLPs. Today I was looking at > the _criterion.pyx file: > > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_ > criterion.pyx > > However, the code is in Cython and I find it hard to follow. I know that > for Regression the relevant class are Criterion(), > RegressionCriterion(Criterion), and MSE(RegressionCriterion). My question > is: is it possible to write a class that takes an arbitrary function > "loss(predictions, targets)" to calculate the loss and impurity of the > nodes? > > thanks, > Thomas > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Feb 15 12:59:52 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 15 Feb 2018 12:59:52 -0500 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: References: Message-ID: <80cf87e6-9fee-2d71-e559-906259f10cde@gmail.com> I wonder whether this (together with the caveat about it being slow if doing python) should go into the FAQ. On 02/15/2018 12:50 PM, Guillaume Lema?tre wrote: > The ClassificationCriterion and RegressionCriterion are now exposed in > the _criterion.pxd. It will allow you to create your own criterion. > So you can write your own Criterion with a given loss by implementing > the methods which are required in the trees. > Then you can pass an instance of this criterion to the tree and it > should work. > > On 15 February 2018 at 18:37, Thomas Evangelidis > wrote: > > Greetings, > > The feature importance calculated by the RandomForest > implementation is a very useful feature. I personally use it to > select the best features because it is simple and fast, and then I > train MLPRegressors. The limitation of this approach is that > although I can control the loss function of the MLPRegressor (I > have modified scikit-learn's implementation to accept an arbitrary > loss function), I cannot do the same with RandomForestRegressor, > and hence I have to rely on 'mse' which is not in accordance with > the loss functions I use in MLPs. Today I was looking at the > _criterion.pyx file: > > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx > > > However, the code is in Cython and I find it hard to follow. I > know that for Regression the relevant class are Criterion(), > RegressionCriterion(Criterion), and MSE(RegressionCriterion). My > question is: is it possible to write a class that takes an > arbitrary function "loss(predictions, targets)" to calculate the > loss and impurity of the nodes? > > thanks, > Thomas > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Thu Feb 15 13:13:47 2018 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 15 Feb 2018 19:13:47 +0100 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: References: Message-ID: Sorry I don't know Cython at all. _criterion.pxd is like the header file in C++? I see that it contains class, function and variable definitions and their description in comments. class Criterion is an Interface, doesn't have function definitions. By "writing your own criterion with a given loss" you mean writing a class like MSE(RegressionCriterion)? On 15 February 2018 at 18:50, Guillaume Lema?tre wrote: > The ClassificationCriterion and RegressionCriterion are now exposed in the > _criterion.pxd. It will allow you to create your own criterion. > So you can write your own Criterion with a given loss by implementing the > methods which are required in the trees. > Then you can pass an instance of this criterion to the tree and it should > work. > > > > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Thu Feb 15 13:28:49 2018 From: g.lemaitre58 at gmail.com (Guillaume Lemaitre) Date: Thu, 15 Feb 2018 19:28:49 +0100 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: References: Message-ID: <20180215182849.5115986.63202.48707@gmail.com> An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Thu Feb 15 14:46:31 2018 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 15 Feb 2018 20:46:31 +0100 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: <20180215182849.5115986.63202.48707@gmail.com> References: <20180215182849.5115986.63202.48707@gmail.com> Message-ID: Is it possible to compile just _criterion.pyx and _criterion.pxd files by using "importpyx" or any alternative way instead of compiling the whole sklearn library every time I introduce a change? Dne 15. 2. 2018 19:29 napsal u?ivatel "Guillaume Lemaitre" < g.lemaitre58 at gmail.com>: Yes you are right pxd are the header and pyx the definition. You need to write a class as MSE. Criterion is an abstract class or base class (I don't have it under the eye) @Andy: if I recall the PR, we made the classes public to enable such custom criterion. However, ?it is not documented since we were not officially supporting it. So this is an hidden feature. We could always discuss to make this feature more visible and document it. Guillaume Lemaitre INRIA Saclay Ile-de-France / Equipe PARIETAL guillaume.lemaitre at inria.fr - https://glemaitre.github.io/ *From: *Thomas Evangelidis *Sent: *Thursday, 15 February 2018 19:15 *To: *Scikit-learn mailing list *Reply To: *Scikit-learn mailing list *Subject: *Re: [scikit-learn] custom loss function in RandomForestRegressor Sorry I don't know Cython at all. _criterion.pxd is like the header file in C++? I see that it contains class, function and variable definitions and their description in comments. class Criterion is an Interface, doesn't have function definitions. By "writing your own criterion with a given loss" you mean writing a class like MSE(RegressionCriterion)? On 15 February 2018 at 18:50, Guillaume Lema?tre wrote: > The ClassificationCriterion and RegressionCriterion are now exposed in the > _criterion.pxd. It will allow you to create your own criterion. > So you can write your own Criterion with a given loss by implementing the > methods which are required in the trees. > Then you can pass an instance of this criterion to the tree and it should > work. > > > > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- ====================================================================== Dr Thomas Evangelidis Post-doctoral Researcher CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/2S049, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Feb 15 15:27:04 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 15 Feb 2018 15:27:04 -0500 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: <20180215182849.5115986.63202.48707@gmail.com> References: <20180215182849.5115986.63202.48707@gmail.com> Message-ID: On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote: > Yes you are right pxd are the header and pyx the definition. You need > to write a class as MSE. Criterion is an abstract class or base class > (I don't have it under the eye) > > @Andy: if I recall the PR, we made the classes public to enable such > custom criterion. However, ?it is not documented since we were not > officially supporting it. So this is an hidden feature. We could > always discuss to make this feature more visible and document it. Well maybe not go as far as giving examples, but this question has been on the list >10 times. -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Thu Feb 15 16:05:35 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Thu, 15 Feb 2018 22:05:35 +0100 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: References: <20180215182849.5115986.63202.48707@gmail.com> Message-ID: Calling `python setup.py build_ext --inplace` (also `make in`) will only recompile the files which change without recompiling everything. However, it is true that it can lead to some error which require a clean and recompile everything. On 15 February 2018 at 20:46, Thomas Evangelidis wrote: > Is it possible to compile just _criterion.pyx and _criterion.pxd files by > using "importpyx" or any alternative way instead of compiling the whole > sklearn library every time I introduce a change? > > Dne 15. 2. 2018 19:29 napsal u?ivatel "Guillaume Lemaitre" < > g.lemaitre58 at gmail.com>: > > Yes you are right pxd are the header and pyx the definition. You need to > write a class as MSE. Criterion is an abstract class or base class (I don't > have it under the eye) > > @Andy: if I recall the PR, we made the classes public to enable such > custom criterion. However, ?it is not documented since we were not > officially supporting it. So this is an hidden feature. We could always > discuss to make this feature more visible and document it. > > Guillaume Lemaitre > INRIA Saclay Ile-de-France / Equipe PARIETAL > guillaume.lemaitre at inria.fr - https://glemaitre.github.io/ > *From: *Thomas Evangelidis > *Sent: *Thursday, 15 February 2018 19:15 > *To: *Scikit-learn mailing list > *Reply To: *Scikit-learn mailing list > *Subject: *Re: [scikit-learn] custom loss function in > RandomForestRegressor > > Sorry I don't know Cython at all. _criterion.pxd is like the header file > in C++? I see that it contains class, function and variable definitions and > their description in comments. > > class Criterion is an Interface, doesn't have function definitions. By > "writing your own criterion with a given loss" you mean writing a class > like MSE(RegressionCriterion)? > > > On 15 February 2018 at 18:50, Guillaume Lema?tre > wrote: > >> The ClassificationCriterion and RegressionCriterion are now exposed in >> the _criterion.pxd. It will allow you to create your own criterion. >> So you can write your own Criterion with a given loss by implementing the >> methods which are required in the trees. >> Then you can pass an instance of this criterion to the tree and it should >> work. >> >> >> >> >> >> -- >> Guillaume Lemaitre >> INRIA Saclay - Parietal team >> Center for Data Science Paris-Saclay >> https://glemaitre.github.io/ >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > > ====================================================================== > > Dr Thomas Evangelidis > > Post-doctoral Researcher > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/2S049, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Thu Feb 15 16:06:08 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Thu, 15 Feb 2018 22:06:08 +0100 Subject: [scikit-learn] custom loss function in RandomForestRegressor In-Reply-To: References: <20180215182849.5115986.63202.48707@gmail.com> Message-ID: > 10 times means that we could write something in the doc :) On 15 February 2018 at 21:27, Andreas Mueller wrote: > > > On 02/15/2018 01:28 PM, Guillaume Lemaitre wrote: > > Yes you are right pxd are the header and pyx the definition. You need to > write a class as MSE. Criterion is an abstract class or base class (I don't > have it under the eye) > > @Andy: if I recall the PR, we made the classes public to enable such > custom criterion. However, ?it is not documented since we were not > officially supporting it. So this is an hidden feature. We could always > discuss to make this feature more visible and document it. > > Well maybe not go as far as giving examples, but this question has been on > the list >10 times. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From sergio.peignier.zapata at gmail.com Fri Feb 16 04:51:58 2018 From: sergio.peignier.zapata at gmail.com (peignier sergio) Date: Fri, 16 Feb 2018 10:51:58 +0100 Subject: [scikit-learn] transfer-learning for random forests Message-ID: Hello, I recently begun a research project on Transfer Learning with some colleagues. We would like to contribute to scikit-learn incorporating Transfer Learning functions for Random Forests as described in this recent paper: ** *https://arxiv.org/abs/1511.01258* * *Before starting we would like to ensure that no existing project is ongoing. Thanks! BR, Sergio Peignier -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Fri Feb 16 14:53:27 2018 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Fri, 16 Feb 2018 20:53:27 +0100 Subject: [scikit-learn] PieGraph: First examples and documentation Message-ID: Dear all, We have produced some documentation for the PipeGraph module. Essentially it consists of the API for the two main interfaces: PipeGraphRegressor and PipeGraphClassifier. I guess that at this point the best experience comes from reading the examples and watching the diagrams. These examples are more suggestive than exhaustive though. Our purpose is to present the project in this initial form in order to hear all your comments for making it as useful for you all as possible. These are the links: - The documentation: https://mcasl.github.io/PipeGraph/auto_examples/index.html - The module sources: https://mcasl.github.io/PipeGraph/ Best Manuel -------------- next part -------------- An HTML attachment was scrubbed... URL: From theodore.danka at gmail.com Mon Feb 19 02:45:25 2018 From: theodore.danka at gmail.com (Tivadar Danka) Date: Mon, 19 Feb 2018 08:45:25 +0100 Subject: [scikit-learn] Announcing modAL: a modular active learning framework Message-ID: Dear scikit-learn community! It is my pleasure to announce modAL, a modular active learning framework for Python3, built on top of scikit-learn. Designed with modularity, flexibility and extensibility in mind, it allows the rapid development of active learning workflows with nearly complete freedom. It is aimed for researchers and practitioners, where fast prototyping is essential for testing and developing active learning pipelines. modAL is quite young and under constant improvement. Any feedback, feature request or contribution are very welcome! The package can be installed via pip: pip3 install modAL The repository, tutorials and documentation are available at - GitHub: https://github.com/cosmic-cortex/modAL - Webpage: https://cosmic-cortex.github.io/modAL Cheers, Tivadar -------------------------------------- Tivadar Danka postdoctoral researcher BIOMAG group, MTA-BRC http://www.tivadardanka.com twitter: @TivadarDanka -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Mon Feb 19 08:07:57 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Mon, 19 Feb 2018 14:07:57 +0100 Subject: [scikit-learn] Announcing modAL: a modular active learning framework In-Reply-To: References: Message-ID: It looks nice, thanks for sharing. Do you plan to couple the active learner with a UX-optimized labeling interface (for instance with a react.js or similar frontend and a flask or similar backend)? -- Olivier ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Mon Feb 19 09:58:27 2018 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Mon, 19 Feb 2018 23:58:27 +0900 Subject: [scikit-learn] Announcing modAL: a modular active learning framework In-Reply-To: References: Message-ID: Dear Dr. Danka, This is a very nice generalization you have built. My group and I have published multiple papers on using active learning for drug discovery model creation, built on top of scikit-learn. (2017) Future Med Chem : https://dx.doi.org/10.4155/fmc-2016-0197 (*Most downloaded paper of the year) (Open Access) (2017) J Comput-Aided Chem : https://dx.doi.org/10.2751/jcac.18.124 (Open Access) (2018) ChemMedChem : https://dx.doi.org/10.1002/cmdc.201700677 In our work, we built a similar framework to modAL, though in our framework the iterative model building is done on a fully labeled (Y) set of examples, and we are more interested in knowing: (1) How fast learning converges within some convergence criteria (e.g., how many drugs must be in a model, given an evaluation metric), (2) Which examples are picked across repeated executions of AL (e.g., which drugs appear to be the most informative for model construction), (3) How much diversity is there in the examples picked (e.g., how different are the drugs selected by AL - visualized in the 2017 FutureMedChem paper), and (4) How dependent are actively learned models on descriptors (e.g., do different representations affect the speed of performance convergence?). I think some, if not all, of these questions are also answerable in your framework. Also, with regards to point (1) and evaluation metrics, I recently came up with an idea to generically analyze the nature of 2-class prediction performance metrics independent of the model methodology used: (2018) Molecular Informatics : https://dx.doi.org/10.1002/minf.201700127 (Open Access) You can find the philosophy of this article embedded in the active learning experiments performed in the 2018 ChemMedChem article. If you or anyone else on this list is interested in active learning and chemistry, please drop me a line. Again - very nice job, and best wishes for continued development. Sincerely, J.B. Brown Kyoto University Graduate School of Medicine 2018-02-19 16:45 GMT+09:00 Tivadar Danka : > Dear scikit-learn community! > > It is my pleasure to announce modAL, a modular active learning framework > for Python3, built on top of scikit-learn. Designed with modularity, > flexibility and extensibility in mind, it allows the rapid development of > active learning workflows with nearly complete freedom. It is aimed for > researchers and practitioners, where fast prototyping is essential for > testing and developing active learning pipelines. > > modAL is quite young and under constant improvement. Any feedback, feature > request or contribution are very welcome! > > The package can be installed via pip: > pip3 install modAL > > The repository, tutorials and documentation are available at > - GitHub: https://github.com/cosmic-cortex/modAL > - Webpage: https://cosmic-cortex.github.io/modAL > > Cheers, > Tivadar > > -------------------------------------- > Tivadar Danka > postdoctoral researcher > BIOMAG group, MTA-BRC > http://www.tivadardanka.com > twitter: @TivadarDanka > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From theodore.danka at gmail.com Mon Feb 19 15:02:18 2018 From: theodore.danka at gmail.com (Tivadar Danka) Date: Mon, 19 Feb 2018 21:02:18 +0100 Subject: [scikit-learn] Announcing modAL: a modular active learning framework In-Reply-To: References: Message-ID: Yes, eventually there will be a labelling interface. modAL will be used in an interactive cell explorer tool for molecular and cell biologists, for which a labelling interface will be developed, but this is further down the line. -------------------------------------- Tivadar Danka postdoctoral researcher BIOMAG group, MTA-BRC http://www.tivadardanka.com twitter: @TivadarDanka On 19 February 2018 at 14:07, Olivier Grisel wrote: > It looks nice, thanks for sharing. > > Do you plan to couple the active learner with a UX-optimized labeling > interface (for instance with a react.js or similar frontend and a flask or > similar backend)? > > -- > Olivier > ? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From djacques at uwalumni.com Tue Feb 20 13:06:06 2018 From: djacques at uwalumni.com (Dale Jacques) Date: Tue, 20 Feb 2018 12:06:06 -0600 Subject: [scikit-learn] New Transformer to Support Multiple Column Pipelines & One Hot Encoding Message-ID: Hello all, Long time lurker, first time emailer. I have two small contributions I would like to propose to the email list. I was working on a project this weekend that was using both categorical and numerical columns to predict a final output. I needed to save my transformations to make future predictions and grid search over multiple models and parameters, so sklearn pipelines were the obvious answer. I setup a pipeline, grid searched, then pickled the best model to use for future predictions. This worked well, but I ran into two issues. *1). I needed a transformer to select individual columns in my pipeline. *I needed to apply unique transformations to each column in my data, then recombine with a FeatureUnion. I realized there is not a supported transformer to extract a specific column within pipelines. See this issue here as an example . I created a transformation that explicitly extracts columns of interest for use in a pipeline with FeatureUnion. A FunctionTransformer will solve this issue, but I feel as if sklearn should directly and explicitly support this functionality. I believe this will make pipelines significantly more intuitive and accessible for most users. *2). One hot encoding requires arrays that are already integers.* You can find a similar issue here . This can be accomplished using Pandas.get_dummies() (where the transformation cannot be saved to apply to future predictions) or by using a scikit-learn LabelBinarizer transformation. LabelBinarizer is designed to transform y and does not have a method to pass x and y in a pipeline. This breaks scikit-learn pipelines. I built a LabelBinarizer transformation that can be used with FeatureUnion in pipelines. This issue may be moot with the new CategoricalEncoder that is about to be released. Does the community believe I should pursue contributing either of these? -- Cheers, DJ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Feb 20 14:53:20 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 20 Feb 2018 20:53:20 +0100 Subject: [scikit-learn] New Transformer to Support Multiple Column Pipelines & One Hot Encoding In-Reply-To: References: Message-ID: Hi Dale, Those two issues you mention are indeed current bottlenecks of sklearn's API, but we are currently working on trying to solve them: 1) ColumnTransformer to be able to apply different transformers to different columns: https://github.com/scikit-learn/scikit-learn/pull/9012/ 2) As you mention, there is the CategoricalEncoder, which is exactly meant to solve that problem. So indeed, with the upcoming release of sklearn, this will be solved. Both are already in the works (or have been merged), but further contributions would certainly be welcome! In the first place, testing out this functionality, seeing how it fits into your workflow and pipelines, and provide feedback on this is very valuable. It's not yet released or merged, so we can still make changes if necessary (for CategoricalEncoder you can use sklearn master to test, for the ColumnTransformer you will need to checkout the PR I mentioned above). Secondly, there are still some open issues to further improve the CategoricalEncoder, and help for those is also certainly welcome (see eg https://github.com/scikit-learn/scikit-learn/issues/10181, https://github.com/scikit-learn/scikit-learn/issues/10465, some kind of 'drop_first' parameter, ..) Best, Joris 2018-02-20 19:06 GMT+01:00 Dale Jacques : > Hello all, > > Long time lurker, first time emailer. > > I have two small contributions I would like to propose to the email list. > > I was working on a project this weekend that was using both categorical > and numerical columns to predict a final output. I needed to save my > transformations to make future predictions and grid search over multiple > models and parameters, so sklearn pipelines were the obvious answer. I > setup a pipeline, grid searched, then pickled the best model to use for > future predictions. > > This worked well, but I ran into two issues. > *1). I needed a transformer to select individual columns in my pipeline. > *I needed to apply unique transformations to each column in my data, > then recombine with a FeatureUnion. I realized there is not a supported > transformer to extract a specific column within pipelines. See this > issue here as an example > . > I created a transformation that explicitly extracts columns of interest for > use in a pipeline with FeatureUnion. A FunctionTransformer will solve this > issue, but I feel as if sklearn should directly and explicitly support this > functionality. I believe this will make pipelines significantly more > intuitive and accessible for most users. > > *2). One hot encoding requires arrays that are already integers.* You > can find a similar issue here > . > This can be accomplished using Pandas.get_dummies() (where the > transformation cannot be saved to apply to future predictions) or by using > a scikit-learn LabelBinarizer > > transformation. LabelBinarizer is designed to transform y and does not > have a method to pass x and y in a pipeline. This breaks scikit-learn > pipelines. I built a LabelBinarizer transformation that can be used with > FeatureUnion in pipelines. This issue may be moot with the new > CategoricalEncoder > > that is about to be released. > > Does the community believe I should pursue contributing either of these? > > -- > Cheers, > > DJ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shiduan at ucdavis.edu Tue Feb 20 17:14:49 2018 From: shiduan at ucdavis.edu (Shiheng Duan) Date: Tue, 20 Feb 2018 14:14:49 -0800 Subject: [scikit-learn] KMeans cluster In-Reply-To: References: Message-ID: Yes, but what is used to decide the optimal output? I saw on the document, it is the best output in terms of inertia. What does that mean? Thanks. On Wed, Feb 14, 2018 at 7:46 PM, Joel Nothman wrote: > you can repeatedly use n_init=1? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Tue Feb 20 17:48:21 2018 From: se.raschka at gmail.com (Sebastian Raschka) Date: Tue, 20 Feb 2018 17:48:21 -0500 Subject: [scikit-learn] KMeans cluster In-Reply-To: References: Message-ID: <24AE67B0-A3B9-461B-BD34-CAFFCEBD9FC0@gmail.com> Inertia simply means the sum of the squared distances from sample points to their cluster centroid. The smaller the inertia, the closer the cluster members are to their cluster centroid (that's also what KMeans optimizes when choosing centroids). In this context, the elbow method may be helpful (https://bl.ocks.org/rpgove/raw/0060ff3b656618e9136b/9aee23cc799d154520572b30443284525dbfcac5/) Maybe also take a look at the silhouette metric for choosing K: http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html Best, Sebastian > On Feb 20, 2018, at 5:14 PM, Shiheng Duan wrote: > > Yes, but what is used to decide the optimal output? I saw on the document, it is the best output in terms of inertia. What does that mean? > Thanks. > > On Wed, Feb 14, 2018 at 7:46 PM, Joel Nothman wrote: > you can repeatedly use n_init=1? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From l.lomasto at innovationengineering.eu Wed Feb 21 03:13:30 2018 From: l.lomasto at innovationengineering.eu (Luigi Lomasto) Date: Wed, 21 Feb 2018 09:13:30 +0100 Subject: [scikit-learn] Clustering with sparse matrix Message-ID: Hi all, I have a sparse matrix where each row (item) has 160 features. For each of them only three or four features are different by 0. Can I do clustering with this data? I?m thinking to use PCA to reduce dimensionality. Thanks for any answer. Luigi From princegosavi12 at gmail.com Wed Feb 21 05:14:03 2018 From: princegosavi12 at gmail.com (prince gosavi) Date: Wed, 21 Feb 2018 15:44:03 +0530 Subject: [scikit-learn] Getting the indexes of the data points after clustering using Kmeans Message-ID: Hi, I have applied Kmeans clustering using the scikit library from kmeans=KMeans(max_iter=4,n_clusters=10,n_init=10).fit(euclidean_dist) After applying the algorithm.I would like to get the data points in the clusters so as to further use them to apply a model. Example: kmeans.cluster_centers_[1] gives me distance array of all the data points. Is there any way around this available in scikit so as to get the data points id/index. Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From christian.braune79 at gmail.com Wed Feb 21 05:58:19 2018 From: christian.braune79 at gmail.com (Christian Braune) Date: Wed, 21 Feb 2018 10:58:19 +0000 Subject: [scikit-learn] Getting the indexes of the data points after clustering using Kmeans In-Reply-To: References: Message-ID: Hi, if you have your original points stored in a numpy array, you can get all points from a cluster i by doing the following: cluster_points = points[kmeans.labels_ == i] "kmeans.labels_" contains a list labels for each point. "kmeans.labels_ == i" creates a mask that selects only those points that belong to cluster i and the whole line then gives you the points, finally. BTW: the fit method has the raw points as input parameter, not the distance matrix. Regards, Christian prince gosavi schrieb am Mi., 21. Feb. 2018 um 11:16 Uhr: > Hi, > I have applied Kmeans clustering using the scikit library from > > kmeans=KMeans(max_iter=4,n_clusters=10,n_init=10).fit(euclidean_dist) > > After applying the algorithm.I would like to get the data points in the > clusters so as to further use them to apply a model. > > Example: > kmeans.cluster_centers_[1] > > gives me distance array of all the data points. > > Is there any way around this available in scikit so as to get the data > points id/index. > > Regards > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From princegosavi12 at gmail.com Wed Feb 21 09:22:39 2018 From: princegosavi12 at gmail.com (prince gosavi) Date: Wed, 21 Feb 2018 19:52:39 +0530 Subject: [scikit-learn] Getting the indexes of the data points after clustering using Kmeans In-Reply-To: References: Message-ID: Hi, Thanks for your hint It just saved my day. Regards, Rajkumar On Wed, Feb 21, 2018 at 4:28 PM, Christian Braune < christian.braune79 at gmail.com> wrote: > Hi, > > if you have your original points stored in a numpy array, you can get all > points from a cluster i by doing the following: > > cluster_points = points[kmeans.labels_ == i] > > "kmeans.labels_" contains a list labels for each point. > "kmeans.labels_ == i" creates a mask that selects only those points that > belong to cluster i > and the whole line then gives you the points, finally. > > BTW: the fit method has the raw points as input parameter, not the > distance matrix. > > Regards, > Christian > > prince gosavi schrieb am Mi., 21. Feb. 2018 um > 11:16 Uhr: > >> Hi, >> I have applied Kmeans clustering using the scikit library from >> >> kmeans=KMeans(max_iter=4,n_clusters=10,n_init=10).fit(euclidean_dist) >> >> After applying the algorithm.I would like to get the data points in the >> clusters so as to further use them to apply a model. >> >> Example: >> kmeans.cluster_centers_[1] >> >> gives me distance array of all the data points. >> >> Is there any way around this available in scikit so as to get the data >> points id/index. >> >> Regards >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: From krish.pvkrishna at gmail.com Thu Feb 22 12:08:51 2018 From: krish.pvkrishna at gmail.com (Krishna Pillutla) Date: Thu, 22 Feb 2018 09:08:51 -0800 Subject: [scikit-learn] Implement Catalyst-SVRG optimizer for linear models? Message-ID: Hello all, *TL;DR*: I'd like to implement Catalyst-SVRG , an accelerated optimization algorithm for sklearn (or scikit-learn-contrib/lightning, if it is more appropriate). Any feedback? *Long version*: I've been playing around with Catalyst-SVRG , an accelerated stochastic variance reduced optimization algorithm for my research. I've found in my experience and in the experiments section of the attached paper that this algorithm does lead to faster optimization than vanilla (un-accerelated) SVRG , which itself is much faster than SGD and on roughly the same footing as SAG/SAGA. Moreover, the per-iteration computational complexity of this algorithm practically matches that of SVRG. I was wondering whether it would be beneficial to the community if I implemented this algorithm in sklearn/linear_model or perhaps in scikit-learn-contrib/lightning. I would love to hear your thoughts on this. Cheers, Krishna -------------- next part -------------- An HTML attachment was scrubbed... URL: From sbairedd at ucsd.edu Thu Feb 22 20:24:36 2018 From: sbairedd at ucsd.edu (Sirin Baireddy) Date: Thu, 22 Feb 2018 17:24:36 -0800 Subject: [scikit-learn] Google Summer of Code Project- Implementing R Package of SAGA algorithm Message-ID: Hi, I'm attempting to work on a project for google summer of code to create an R package for the SAGA algorithm this summer. I was hoping I could get one of the developers of the algorithm to mentor me this summer while I work on it. Thanks, Sirin Baireddy From g.lemaitre58 at gmail.com Fri Feb 23 08:33:55 2018 From: g.lemaitre58 at gmail.com (Guillaume Lemaitre) Date: Fri, 23 Feb 2018 14:33:55 +0100 Subject: [scikit-learn] Google Summer of Code Project- Implementing R Package of SAGA algorithm In-Reply-To: References: Message-ID: <20180223133355.5115986.45704.49531@gmail.com> Hi Sirin, ? You should probably contact some machine learning in R language regarding the supervision. The core dev in scikit learn are more dedicated to python. Also I think that the decision was taken that scikit learn will not take part to the GSoC.? Cheers Guillaume?Lemaitre? INRIA?Saclay?Ile-de-France?/?Equipe?PARIETAL guillaume.lemaitre at inria.fr?-?https://glemaitre.github.io/ ? Original Message ? From: Sirin Baireddy Sent: Friday, 23 February 2018 02:26 To: scikit-learn at python.org Reply To: Scikit-learn mailing list Subject: [scikit-learn] Google Summer of Code Project- Implementing R Package of SAGA algorithm Hi, I'm attempting to work on a project for google summer of code to create an R package for the SAGA algorithm this summer. I was hoping I could get one of the developers of the algorithm to mentor me this summer while I work on it. Thanks, Sirin Baireddy _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn From sbairedd at ucsd.edu Fri Feb 23 17:32:19 2018 From: sbairedd at ucsd.edu (Sirin Baireddy) Date: Fri, 23 Feb 2018 14:32:19 -0800 Subject: [scikit-learn] Google Summer of Code Project- Implementing R Package of SAGA algorithm In-Reply-To: <20180223133355.5115986.45704.49531@gmail.com> References: <20180223133355.5115986.45704.49531@gmail.com> Message-ID: Alright, thanks for the advice. On Fri, Feb 23, 2018 at 5:33 AM, Guillaume Lemaitre wrote: > Hi Sirin, > > You should probably contact some machine learning in R language regarding the supervision. The core dev in scikit learn are more dedicated to python. Also I think that the decision was taken that scikit learn will not take part to the GSoC. > > Cheers > > Guillaume Lemaitre > INRIA Saclay Ile-de-France / Equipe PARIETAL > guillaume.lemaitre at inria.fr - https://glemaitre.github.io/ > Original Message > From: Sirin Baireddy > Sent: Friday, 23 February 2018 02:26 > To: scikit-learn at python.org > Reply To: Scikit-learn mailing list > Subject: [scikit-learn] Google Summer of Code Project- Implementing R Package of SAGA algorithm > > Hi, > I'm attempting to work on a project for google summer of code to > create an R package for the SAGA algorithm this summer. I was hoping I > could get one of the developers of the algorithm to mentor me this > summer while I work on it. > Thanks, > Sirin Baireddy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From david.mo.burns at gmail.com Tue Feb 27 12:02:27 2018 From: david.mo.burns at gmail.com (David Burns) Date: Tue, 27 Feb 2018 12:02:27 -0500 Subject: [scikit-learn] New Transformer Message-ID: <726f2e70-63eb-783f-b470-5ea45af930e5@gmail.com> First post on this mailing list. I have been working with time series data for a project, and thought I could contribute a new transformer to segment time series data using a sliding window, with variable overlap. I have attached demonstration of how this would fit in the existing framework. The only challenge for me here is that the transformer needs to transform both the X and y variable in order to perform the segmentation. I am not sure from the documentation how to implement this in the framework. Overlapping segments is a great way to boost performance for time series classifiers, so this may be a worthwhile contribution for some in this area of ML. Ultimately, model_selection.TimeSeries.Split would need to be modified to support overlapping segments, or a new class created to enable validation for this. Please let me know if this would be a worthwhile contribution, and if so how to go about transforming the target vector y in the framework / pipeline? Thanks! David Burns -------------- next part -------------- A non-text attachment was scrubbed... Name: TimeSeriesSegment.py Type: text/x-python Size: 3336 bytes Desc: not available URL: From g.lemaitre58 at gmail.com Tue Feb 27 13:42:52 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Tue, 27 Feb 2018 19:42:52 +0100 Subject: [scikit-learn] New Transformer In-Reply-To: <726f2e70-63eb-783f-b470-5ea45af930e5@gmail.com> References: <726f2e70-63eb-783f-b470-5ea45af930e5@gmail.com> Message-ID: Transforming y is a big deal :) You can refer to https://github.com/scikit-learn/enhancement_proposals/pull/2 and the associated issues/PR to see what is going on. This is probably an additional use case to think about when designing estimator which will be modifying y. Regarding the pipeline, I assume that your strategy would be to resample at fit and do nothing at predict, isn't it? NB: you could actually implement this sampling in a FunctionSampler of imblearn: http://contrib.scikit-learn.org/imbalanced-learn/dev/generated/imblearn.FunctionSampler.html#imblearn.FunctionSampler and then use the imblearn pipeline which would apply the transform at fit time but not at predict. On 27 February 2018 at 18:02, David Burns wrote: > First post on this mailing list. > > I have been working with time series data for a project, and thought I > could contribute a new transformer to segment time series data using a > sliding window, with variable overlap. I have attached demonstration of how > this would fit in the existing framework. The only challenge for me here is > that the transformer needs to transform both the X and y variable in order > to perform the segmentation. I am not sure from the documentation how to > implement this in the framework. > > Overlapping segments is a great way to boost performance for time series > classifiers, so this may be a worthwhile contribution for some in this area > of ML. Ultimately, model_selection.TimeSeries.Split would need to be > modified to support overlapping segments, or a new class created to enable > validation for this. > > Please let me know if this would be a worthwhile contribution, and if so > how to go about transforming the target vector y in the framework / > pipeline? > > Thanks! > > David Burns > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuel.castejon at gmail.com Wed Feb 28 08:28:20 2018 From: manuel.castejon at gmail.com (=?UTF-8?Q?Manuel_Castej=C3=B3n_Limas?=) Date: Wed, 28 Feb 2018 14:28:20 +0100 Subject: [scikit-learn] New Transformer In-Reply-To: References: <726f2e70-63eb-783f-b470-5ea45af930e5@gmail.com> Message-ID: Dear David, We recently submitted PipeGraph as a sklearn contrib project. Even though it is an ongoing project and we are right now modifying the interface in order to make it more suitable and useful for the sklearn community, I believe that the problems that you explain can be addressed by PipeGraph. If you need the possibility of defining different/equal transformations for X and y you can do it by simply defining different steps for each path; if you need different paths for fit and predict it is also possible to define them in PipeGraph. Please have a look at the general examples and judge by yourself if it fits your needs: https://mcasl.github.io/PipeGraph/auto_examples/plot_4_example_combination_of_classifiers.html#sphx-glr-auto-examples-plot-4-example-combination-of-classifiers-py You can play with it using pip, for example: pip install pipegraph The API can be considered far from stable and we are following the advice of the sklearn community to turn it into something as useful as possible, but it is my humble opinion that in situations like this PipeGraph can provide a suitable solution. Best Manolo Best regards 2018-02-27 19:42 GMT+01:00 Guillaume Lema?tre : > Transforming y is a big deal :) > You can refer to https://github.com/scikit-learn/enhancement_proposals/ > pull/2 > and the associated issues/PR to see what is going on. This is probably an > additional use case to think about when designing estimator which will be > modifying y. > > Regarding the pipeline, I assume that your strategy would be to resample > at fit > and do nothing at predict, isn't it? > > NB: you could actually implement this sampling in a FunctionSampler of > imblearn: > http://contrib.scikit-learn.org/imbalanced-learn/dev/generated/imblearn. > FunctionSampler.html#imblearn.FunctionSampler > and then use the imblearn pipeline which would apply the transform at fit > time but not > at predict. > > On 27 February 2018 at 18:02, David Burns > wrote: > >> First post on this mailing list. >> >> I have been working with time series data for a project, and thought I >> could contribute a new transformer to segment time series data using a >> sliding window, with variable overlap. I have attached demonstration of how >> this would fit in the existing framework. The only challenge for me here is >> that the transformer needs to transform both the X and y variable in order >> to perform the segmentation. I am not sure from the documentation how to >> implement this in the framework. >> >> Overlapping segments is a great way to boost performance for time series >> classifiers, so this may be a worthwhile contribution for some in this area >> of ML. Ultimately, model_selection.TimeSeries.Split would need to be >> modified to support overlapping segments, or a new class created to enable >> validation for this. >> >> Please let me know if this would be a worthwhile contribution, and if so >> how to go about transforming the target vector y in the framework / >> pipeline? >> >> Thanks! >> >> David Burns >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From david.mo.burns at gmail.com Wed Feb 28 11:46:43 2018 From: david.mo.burns at gmail.com (David Burns) Date: Wed, 28 Feb 2018 11:46:43 -0500 Subject: [scikit-learn] New Transformer (Guillaume Lema?tre) In-Reply-To: References: Message-ID: <461b446d-75a0-ef9b-75ce-967aa584d731@gmail.com> Thanks everyone for your suggested. I will have a look at PipeGraph - which might be a suitable option for us as Guillaume suggested. If it works out, I will share it Thanks David On 02/28/2018 08:29 AM, scikit-learn-request at python.org wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. New Transformer (David Burns) > 2. Re: New Transformer (Guillaume Lema?tre) > 3. Re: New Transformer (Manuel Castej?n Limas) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 27 Feb 2018 12:02:27 -0500 > From: David Burns > To: scikit-learn at python.org > Subject: [scikit-learn] New Transformer > Message-ID: <726f2e70-63eb-783f-b470-5ea45af930e5 at gmail.com> > Content-Type: text/plain; charset="utf-8"; Format="flowed" > > First post on this mailing list. > > I have been working with time series data for a project, and thought I > could contribute a new transformer to segment time series data using a > sliding window, with variable overlap. I have attached demonstration of > how this would fit in the existing framework. The only challenge for me > here is that the transformer needs to transform both the X and y > variable in order to perform the segmentation. I am not sure from the > documentation how to implement this in the framework. > > Overlapping segments is a great way to boost performance for time series > classifiers, so this may be a worthwhile contribution for some in this > area of ML. Ultimately, model_selection.TimeSeries.Split would need to > be modified to support overlapping segments, or a new class created to > enable validation for this. > > Please let me know if this would be a worthwhile contribution, and if so > how to go about transforming the target vector y in the framework / > pipeline? > > Thanks! > > David Burns > > > > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: TimeSeriesSegment.py > Type: text/x-python > Size: 3336 bytes > Desc: not available > URL: > > ------------------------------ > > Message: 2 > Date: Tue, 27 Feb 2018 19:42:52 +0100 > From: Guillaume Lema?tre > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] New Transformer > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > Transforming y is a big deal :) > You can refer to > https://github.com/scikit-learn/enhancement_proposals/pull/2 > and the associated issues/PR to see what is going on. This is probably an > additional use case to think about when designing estimator which will be > modifying y. > > Regarding the pipeline, I assume that your strategy would be to resample at > fit > and do nothing at predict, isn't it? > > NB: you could actually implement this sampling in a FunctionSampler of > imblearn: > http://contrib.scikit-learn.org/imbalanced-learn/dev/generated/imblearn.FunctionSampler.html#imblearn.FunctionSampler > and then use the imblearn pipeline which would apply the transform at fit > time but not > at predict. > > On 27 February 2018 at 18:02, David Burns wrote: > >> First post on this mailing list. >> >> I have been working with time series data for a project, and thought I >> could contribute a new transformer to segment time series data using a >> sliding window, with variable overlap. I have attached demonstration of how >> this would fit in the existing framework. The only challenge for me here is >> that the transformer needs to transform both the X and y variable in order >> to perform the segmentation. I am not sure from the documentation how to >> implement this in the framework. >> >> Overlapping segments is a great way to boost performance for time series >> classifiers, so this may be a worthwhile contribution for some in this area >> of ML. Ultimately, model_selection.TimeSeries.Split would need to be >> modified to support overlapping segments, or a new class created to enable >> validation for this. >> >> Please let me know if this would be a worthwhile contribution, and if so >> how to go about transforming the target vector y in the framework / >> pipeline? >> >> Thanks! >> >> David Burns >> >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >