From tevang3 at gmail.com Thu Dec 1 08:01:36 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 1 Dec 2016 14:01:36 +0100 Subject: [scikit-learn] random forests using grouped data Message-ID: Greetings ?I have grouped data which are divided into actives and inactives. The features are two different types of normalized scores (0-1), where the higher the score the most probable is an observation to be an "active". My data look like this: Group1: score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] score2 = [ y=[1,1,1,0,0,0, ...] Group2: ?score1 = [0 score2 = [ y=[1,1,1,1,1]? ?...... Group24?: ?score1 = [0 score2 = [ y=[1,1,1,1,1]? I searched in the documentation about treatment of grouped data, but the only thing I found was how do do cross-validation. My question is whether there is any special algorithm that creates random forests from these type of grouped data. thanks in advance Thomas -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Thu Dec 1 08:05:45 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 1 Dec 2016 14:05:45 +0100 Subject: [scikit-learn] random forests using grouped data In-Reply-To: References: Message-ID: Sorry, the previous email was incomplete. Below is how the grouped data look like: Group1: score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...] y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" Group2: score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...] score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...] y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" ?...... Group24?: score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...] score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...] y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" On 1 December 2016 at 14:01, Thomas Evangelidis wrote: > Greetings > > ?I have grouped data which are divided into actives and inactives. The > features are two different types of normalized scores (0-1), where the > higher the score the most probable is an observation to be an "active". My > data look like this: > > > Group1: > score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] > score2 = [ > y=[1,1,1,0,0,0, ...] > > Group2: > ?score1 = [0 > score2 = [ > y=[1,1,1,1,1]? > > ?...... > Group24?: > ?score1 = [0 > score2 = [ > y=[1,1,1,1,1]? > > > I searched in the documentation about treatment of grouped data, but the > only thing I found was how do do cross-validation. My question is whether > there is any special algorithm that creates random forests from these type > of grouped data. > > thanks in advance > Thomas > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Thu Dec 1 08:16:54 2016 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Thu, 1 Dec 2016 22:16:54 +0900 Subject: [scikit-learn] random forests using grouped data In-Reply-To: References: Message-ID: Hello Thomas, I don't personally know of any algorithm that works on collections of groupings, but why not first test a simple control model, meaning can you achieve a satisfactory model by simply concatenating all 48 scores per sample and building a forest the standard way? If not, what context or reasons dictate that the groupings need to stay retained as you have presented them? Hope this helps, J.B. 2016-12-01 22:05 GMT+09:00 Thomas Evangelidis : > Sorry, the previous email was incomplete. Below is how the grouped data > look like: > > > Group1: > score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] > score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...] > y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" > > Group2: > score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...] > score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...] > y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" > > ?...... > Group24?: > score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...] > score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...] > y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" > > > On 1 December 2016 at 14:01, Thomas Evangelidis wrote: > >> Greetings >> >> ?I have grouped data which are divided into actives and inactives. The >> features are two different types of normalized scores (0-1), where the >> higher the score the most probable is an observation to be an "active". My >> data look like this: >> >> >> Group1: >> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] >> score2 = [ >> y=[1,1,1,0,0,0, ...] >> >> Group2: >> ?score1 = [0 >> score2 = [ >> y=[1,1,1,1,1]? >> >> ?...... >> Group24?: >> ?score1 = [0 >> score2 = [ >> y=[1,1,1,1,1]? >> >> >> I searched in the documentation about treatment of grouped data, but the >> only thing I found was how do do cross-validation. My question is whether >> there is any special algorithm that creates random forests from these type >> of grouped data. >> >> thanks in advance >> Thomas >> >> >> >> -- >> >> ====================================================================== >> >> Thomas Evangelidis >> >> Research Specialist >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/1S081, >> 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> >> tevang3 at gmail.com >> >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zephyr14 at gmail.com Thu Dec 1 09:04:55 2016 From: zephyr14 at gmail.com (Vlad Niculae) Date: Thu, 1 Dec 2016 09:04:55 -0500 Subject: [scikit-learn] random forests using grouped data In-Reply-To: References: Message-ID: I don't think there are any such estimators in scikit-learn directly, but the model selection machinery is there to help. Check out GroupKFold [1] so you can do cross-validation after concatenating all the samples, while ensuring that training and validation groups are separate. The setup of this problem looks a lot like query results reranking in information retrieval, where you need to find relevant and non-relevant results among the set of retrieved docs for each search query. A simple approach you can build using scikit-learn tools is RankSVM, where you take, within each group, all possible pairs between a positive and a negative sample, and take the difference of their features as your input. This is the same as optimizing within-group AUC. Unfortunately the trick doesn't work in the same way for nonlinear models, but it's another baseline you could try. Fabian had an example of this, with some VERY enlightening illustrations, here [2]. HTH, Vlad [1] http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html [2] https://github.com/fabianp/minirank/blob/master/notebooks/pairwise_transform.ipynb On Thu, Dec 1, 2016 at 8:16 AM, Brown J.B. wrote: > Hello Thomas, > > I don't personally know of any algorithm that works on collections of > groupings, but why not first test a simple control model, meaning > can you achieve a satisfactory model by simply concatenating all 48 scores > per sample and building a forest the standard way? > If not, what context or reasons dictate that the groupings need to stay > retained as you have presented them? > > Hope this helps, > J.B. > > 2016-12-01 22:05 GMT+09:00 Thomas Evangelidis : >> >> Sorry, the previous email was incomplete. Below is how the grouped data >> look like: >> >> >> Group1: >> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] >> score2 = [0.34, 0.27, 0.24, 0.05, 0.13, 0,14, ...] >> y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" >> >> Group2: >> score1 = [0.34, 0.38, 0.48, 0.18, 0.12, 0.19, ...] >> score2 = [0.28, 0.41, 0.34, 0.13, 0.09, 0,1, ...] >> y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" >> >> ...... >> Group24: >> score1 = [0.67, 0.54, 0.59, 0.23, 0.24, 0.08, ...] >> score2 = [0.41, 0.31, 0.28, 0.23, 0.18, 0,22, ...] >> y=[1,1,1,0,0,0, ...] # 1 indicates "active" and 0 "inactive" >> >> >> On 1 December 2016 at 14:01, Thomas Evangelidis wrote: >>> >>> Greetings >>> >>> I have grouped data which are divided into actives and inactives. The >>> features are two different types of normalized scores (0-1), where the >>> higher the score the most probable is an observation to be an "active". My >>> data look like this: >>> >>> >>> Group1: >>> score1 = [0.56, 0.34, 0.42, 0.12, 0.08, 0.21, ...] >>> score2 = [ >>> y=[1,1,1,0,0,0, ...] >>> >>> Group2: >>> score1 = [0 >>> score2 = [ >>> y=[1,1,1,1,1] >>> >>> ...... >>> Group24: >>> score1 = [0 >>> score2 = [ >>> y=[1,1,1,1,1] >>> >>> >>> I searched in the documentation about treatment of grouped data, but the >>> only thing I found was how do do cross-validation. My question is whether >>> there is any special algorithm that creates random forests from these type >>> of grouped data. >>> >>> thanks in advance >>> Thomas >>> >>> >>> >>> -- >>> >>> ====================================================================== >>> >>> Thomas Evangelidis >>> >>> Research Specialist >>> >>> CEITEC - Central European Institute of Technology >>> Masaryk University >>> Kamenice 5/A35/1S081, >>> 62500 Brno, Czech Republic >>> >>> email: tevang at pharm.uoa.gr >>> >>> tevang3 at gmail.com >>> >>> >>> website: https://sites.google.com/site/thomasevangelidishomepage/ >>> >>> >> >> >> >> -- >> >> ====================================================================== >> >> Thomas Evangelidis >> >> Research Specialist >> >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/1S081, >> 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> >> tevang3 at gmail.com >> >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From deliprao at gmail.com Thu Dec 1 22:33:33 2016 From: deliprao at gmail.com (Delip Rao) Date: Fri, 02 Dec 2016 03:33:33 +0000 Subject: [scikit-learn] [semi-supervised learning] Using a pre-existing graph with LabelSpreading API Message-ID: Hello, I have an existing graph dataset in the edge format: node_i node_j weight The number of nodes are around 3.6M, and the number of edges are around 72M. I also have some labeled data (around a dozen per class with 16 classes in total), so overall, a perfect setting for label propagation or its variants. In particular, I want to try the LabelSpreading implementation for the regularization. I looked at the documentation and can't find a way to plug in a pre-computed graph (or adjacency matrix). So two questions: 1. What are any scaling issues I should be aware of for a dataset of this size? I can try sparsifying the graph, but would love to learn any knobs I should be aware of. 2. How do I plugin an existing weighted graph with the current API? Happy to use any undocumented features. Thanks in advance! Delip -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Dec 2 19:52:09 2016 From: t3kcit at gmail.com (Andy) Date: Fri, 2 Dec 2016 19:52:09 -0500 Subject: [scikit-learn] Github project management tools In-Reply-To: References: <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com> Message-ID: So did we ever decide on how to prioritize reviews? (I was still mentally / notification catching up after 0.18.1) There are some really important issues to tackle, often with proposed solutions, not no reviews! It's hard for everybody to keep the big picture in mind with such a full issue tracker. I think it might be helpful if Joel and me prioritize issues. Obviously that will only make sense if the other team members check up on it when deciding what to review / work on. Do we want to try to seriously use the project feature? https://github.com/scikit-learn/scikit-learn/projects/5 On my monitor I can fit four columns and the "add cards" tab. I tried using five columns (separating in-progress and stalled PRs) but then I could access the right-most column when the "add cards" was open. The whole interface is a bit awkward but maybe the best we have (for example moving something from the bottom to the top is easiest by moving it to a different column, then scrolling up, then moving it back) wdyt? Andy On 09/29/2016 11:05 PM, Joel Nothman wrote: > The spreadsheet seems to have some duplications and presumably some > missing rows, with apologies. I assume some is due to the github > pagination, and some may be my error. Not a big enough error to fix up. > > On 30 September 2016 at 05:15, Raphael C > wrote: > > My apologies I see it is in the spreadsheet. It would be great to see > this work finished for 0.19 if at all possible IMHO. > > Raphael > > On 29 September 2016 at 20:12, Raphael C > wrote: > > I hope this isn't out of place but I notice that > > https://github.com/scikit-learn/scikit-learn/pull/4899 > is not in the > > list. It seems like a very worthwhile addition and the PR appears > > stalled at present. > > > > Raphael > > > > On 29 September 2016 at 15:05, Joel Nothman > > wrote: > >> I agree that being able to identify which PRs are stalled on the > >> contributor's part, which on reviewers' part, and since when, > would be > >> great. I'm not sure we've come up with a way that'll work though. > >> > >> In terms of backlog, I've wondered if just getting things into > a spreadsheet > >> would help: > >> > >> > https://docs.google.com/spreadsheets/d/1LdzNxQbn7A0Ao8zlUBgnvT42929JpAe9958YxKCubjE/edit > > >> > >> What other features of an Issue / PR would be useful to > >> sort/filter/pivottable on in a spreadsheet form like this? > >> > >> (It would be extra nice if we could modify titles and labels > within the > >> spreadsheet and have them update via the GitHub API, but I'm > not sure I'll > >> get around to making that feature :P) > >> > >> > >> On 29 September 2016 at 23:45, Andreas Mueller > > wrote: > >>> > >>> So I made a project for 0.19: > >>> > >>> https://github.com/scikit-learn/scikit-learn/projects/5 > > >>> > >>> The idea would be to drag and drop issues and PRs so that the > important > >>> ones are at the top. > >>> We could also add an "important" column, currently the > scrolling is pretty > >>> annoying. > >>> Thoughts? > >>> > >>> > >>> > >>> > >>> On 09/28/2016 03:29 PM, Nelle Varoquaux wrote: > >>>> > >>>> On 28 September 2016 at 12:24, Andreas Mueller > > wrote: > >>>>> > >>>>> > >>>>> On 09/28/2016 02:21 PM, Nelle Varoquaux wrote: > >>>>>> > >>>>>> > >>>>>> I think the only ones worth having are the ones that can be > dealt with > >>>>>> automatically and the ones that will not be used frequently: > >>>>>> > >>>>>> - stalled after 30 days of inactivity [can be done > automatically] > >>>>>> - in dispute [I don't expect it to be used often]. > >>>>> > >>>>> I think "in dispute" is actually one of the most common > statuses among > >>>>> PRs. > >>>>> Or maybe I have a skewed picture of things. > >>>>> Many PRs stalled because it is not clear whether the > proposed solution > >>>>> is a > >>>>> good one. > >>>> > >>>> On the stalled one, sure, but there are a lot of PRs being merged > >>>> fairly quickly. So over all, I think it is quite rare. No? > >>>> > >>>>> It would be great to have some way to get through the > backlog of 400 PRs > >>>>> and > >>>>> I think tagging them might be useful. > >>>>> We rarely reject PRs, we could also revisit that policy. > >>>>> > >>>>> For the backlog, it's pretty unclear to me how many are > waiting for > >>>>> reviews, > >>>>> how many are waiting for changes, > >>>>> and how many are disputed. > >>>>> Tagging these might help people who want to review to find > things to > >>>>> review, > >>>>> and people who want to code to pick > >>>>> up stalled PRs. > >>>> > >>>> That sounds like a great use of labels, thought all of these > need to > >>>> be tagged manually. > >>>> > >>>>> _______________________________________________ > >>>>> scikit-learn mailing list > >>>>> scikit-learn at python.org > >>>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>>> > >>>> _______________________________________________ > >>>> scikit-learn mailing list > >>>> scikit-learn at python.org > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > > >>> > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > >> > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Fri Dec 2 20:04:10 2016 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Fri, 2 Dec 2016 17:04:10 -0800 Subject: [scikit-learn] Github project management tools In-Reply-To: References: <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com> Message-ID: Hello, This seems a good moment to say that we will be starting a project at BIDS next semester to try extract information from github and classify PRs into different categories (stalled, updated, needs review). St?fan drafted a list of elements he would like to see for scikit-image, and I have been wanting something similar for matplotlib. I've got my hands full right now, but we are more than open to discuss with the wider community to see if such a tool would be useful and what features is of interest. Here are some examples of elements we'd like to be able to identify and sort: - Most active pull requests ?hot topics? - The one where "I" have commented on. - PRs that haven?t seen any discussion. - Stalled PRs. - New issues without any comments. - See the old PRs that could be merged - Recently merged PR referring to a ticket but haven?t closed that ticket. - Duplicate PR (closing the same ticket). - Tickets that being referred to many times. - Unmergeable PRs (that need to be rebased). - PRs that passed the majority of tests. - Issues that external projects refer too. Do you think something like this could be interesting for sklearn? Also, if you have scripts that similar things and that you would be willing to share, we would be very happy to see what exists already out there. Cheers, N On 2 December 2016 at 16:52, Andy wrote: > So did we ever decide on how to prioritize reviews? > (I was still mentally / notification catching up after 0.18.1) > > There are some really important issues to tackle, often with proposed > solutions, not no reviews! > It's hard for everybody to keep the big picture in mind with such a full > issue tracker. > I think it might be helpful if Joel and me prioritize issues. Obviously that > will only make > sense if the other team members check up on it when deciding what to review > / work on. > > Do we want to try to seriously use the project feature? > https://github.com/scikit-learn/scikit-learn/projects/5 > > On my monitor I can fit four columns and the "add cards" tab. > I tried using five columns (separating in-progress and stalled PRs) but then > I could access the right-most column when > the "add cards" was open. > The whole interface is a bit awkward but maybe the best we have (for example > moving something from the bottom > to the top is easiest by moving it to a different column, then scrolling up, > then moving it back) > > wdyt? > Andy > > > > On 09/29/2016 11:05 PM, Joel Nothman wrote: > > The spreadsheet seems to have some duplications and presumably some missing > rows, with apologies. I assume some is due to the github pagination, and > some may be my error. Not a big enough error to fix up. > > On 30 September 2016 at 05:15, Raphael C wrote: >> >> My apologies I see it is in the spreadsheet. It would be great to see >> this work finished for 0.19 if at all possible IMHO. >> >> Raphael >> >> On 29 September 2016 at 20:12, Raphael C wrote: >> > I hope this isn't out of place but I notice that >> > https://github.com/scikit-learn/scikit-learn/pull/4899 is not in the >> > list. It seems like a very worthwhile addition and the PR appears >> > stalled at present. >> > >> > Raphael >> > >> > On 29 September 2016 at 15:05, Joel Nothman >> > wrote: >> >> I agree that being able to identify which PRs are stalled on the >> >> contributor's part, which on reviewers' part, and since when, would be >> >> great. I'm not sure we've come up with a way that'll work though. >> >> >> >> In terms of backlog, I've wondered if just getting things into a >> >> spreadsheet >> >> would help: >> >> >> >> >> >> https://docs.google.com/spreadsheets/d/1LdzNxQbn7A0Ao8zlUBgnvT42929JpAe9958YxKCubjE/edit >> >> >> >> What other features of an Issue / PR would be useful to >> >> sort/filter/pivottable on in a spreadsheet form like this? >> >> >> >> (It would be extra nice if we could modify titles and labels within the >> >> spreadsheet and have them update via the GitHub API, but I'm not sure >> >> I'll >> >> get around to making that feature :P) >> >> >> >> >> >> On 29 September 2016 at 23:45, Andreas Mueller >> >> wrote: >> >>> >> >>> So I made a project for 0.19: >> >>> >> >>> https://github.com/scikit-learn/scikit-learn/projects/5 >> >>> >> >>> The idea would be to drag and drop issues and PRs so that the >> >>> important >> >>> ones are at the top. >> >>> We could also add an "important" column, currently the scrolling is >> >>> pretty >> >>> annoying. >> >>> Thoughts? >> >>> >> >>> >> >>> >> >>> >> >>> On 09/28/2016 03:29 PM, Nelle Varoquaux wrote: >> >>>> >> >>>> On 28 September 2016 at 12:24, Andreas Mueller >> >>>> wrote: >> >>>>> >> >>>>> >> >>>>> On 09/28/2016 02:21 PM, Nelle Varoquaux wrote: >> >>>>>> >> >>>>>> >> >>>>>> I think the only ones worth having are the ones that can be dealt >> >>>>>> with >> >>>>>> automatically and the ones that will not be used frequently: >> >>>>>> >> >>>>>> - stalled after 30 days of inactivity [can be done automatically] >> >>>>>> - in dispute [I don't expect it to be used often]. >> >>>>> >> >>>>> I think "in dispute" is actually one of the most common statuses >> >>>>> among >> >>>>> PRs. >> >>>>> Or maybe I have a skewed picture of things. >> >>>>> Many PRs stalled because it is not clear whether the proposed >> >>>>> solution >> >>>>> is a >> >>>>> good one. >> >>>> >> >>>> On the stalled one, sure, but there are a lot of PRs being merged >> >>>> fairly quickly. So over all, I think it is quite rare. No? >> >>>> >> >>>>> It would be great to have some way to get through the backlog of 400 >> >>>>> PRs >> >>>>> and >> >>>>> I think tagging them might be useful. >> >>>>> We rarely reject PRs, we could also revisit that policy. >> >>>>> >> >>>>> For the backlog, it's pretty unclear to me how many are waiting for >> >>>>> reviews, >> >>>>> how many are waiting for changes, >> >>>>> and how many are disputed. >> >>>>> Tagging these might help people who want to review to find things to >> >>>>> review, >> >>>>> and people who want to code to pick >> >>>>> up stalled PRs. >> >>>> >> >>>> That sounds like a great use of labels, thought all of these need to >> >>>> be tagged manually. >> >>>> >> >>>>> _______________________________________________ >> >>>>> scikit-learn mailing list >> >>>>> scikit-learn at python.org >> >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >> >>>> >> >>>> _______________________________________________ >> >>>> scikit-learn mailing list >> >>>> scikit-learn at python.org >> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >> >>> >> >>> >> >>> _______________________________________________ >> >>> scikit-learn mailing list >> >>> scikit-learn at python.org >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From t3kcit at gmail.com Fri Dec 2 21:28:14 2016 From: t3kcit at gmail.com (Andy) Date: Fri, 2 Dec 2016 21:28:14 -0500 Subject: [scikit-learn] Github project management tools In-Reply-To: References: <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com> Message-ID: Hey Nelle. That sounds great. My main question is how you'd expose this to the user. Will it be a separate website? A bot? Emails? Greasemonkey on top of github? Most of these could be implemented with tags that are automatically assigned by a bot, I guess. That would be quite a few tags, though, and wouldn't work well for filtering the ones I was active in. Tickets that are being referred to many times also sound more like a sorting of issues, not a tag. And some of these are more of a "notification type", like "this project has referred to this issue" is maybe something that I want to be made aware of, say by a comment on the issue (which triggers an email) or a direct email to me. Similarly I might be notified if someone forgot to close the ticket for a PR (so I can go and check whether to close it). I might want to be notified if any of my PRs become "unmergable". A comment by a bot would alert everybody though, and an email to me only me. The "PRs that haven't seen any discussion" is actually implemented in github by sorting by comments, and I recently used that. Also happy to (try to find time to) contribute code or discuss the project with you guys! To summarize, I think there are some low-hanging fruit for automatic tagging and for sending emails with notifications, and possibly for bots commenting. I expect that doing anything that involves sorting (a subset of) issues probably requires much more effort. Andy On 12/02/2016 08:04 PM, Nelle Varoquaux wrote: > Hello, > > This seems a good moment to say that we will be starting a project at > BIDS next semester to try extract information from github and classify > PRs into different categories (stalled, updated, needs review). > St?fan drafted a list of elements he would like to see for > scikit-image, and I have been wanting something similar for > matplotlib. > I've got my hands full right now, but we are more than open to discuss > with the wider community to see if such a tool would be useful and > what features is of interest. > > Here are some examples of elements we'd like to be able to identify and sort: > > - Most active pull requests ?hot topics? > - The one where "I" have commented on. > - PRs that haven?t seen any discussion. > - Stalled PRs. > - New issues without any comments. > - See the old PRs that could be merged > - Recently merged PR referring to a ticket but haven?t closed that ticket. > - Duplicate PR (closing the same ticket). > - Tickets that being referred to many times. > - Unmergeable PRs (that need to be rebased). > - PRs that passed the majority of tests. > - Issues that external projects refer too. > > Do you think something like this could be interesting for sklearn? > Also, if you have scripts that similar things and that you would be > willing to share, we would be very happy to see what exists already > out there. > > Cheers, > N > > On 2 December 2016 at 16:52, Andy wrote: >> So did we ever decide on how to prioritize reviews? >> (I was still mentally / notification catching up after 0.18.1) >> >> There are some really important issues to tackle, often with proposed >> solutions, not no reviews! >> It's hard for everybody to keep the big picture in mind with such a full >> issue tracker. >> I think it might be helpful if Joel and me prioritize issues. Obviously that >> will only make >> sense if the other team members check up on it when deciding what to review >> / work on. >> >> Do we want to try to seriously use the project feature? >> https://github.com/scikit-learn/scikit-learn/projects/5 >> >> On my monitor I can fit four columns and the "add cards" tab. >> I tried using five columns (separating in-progress and stalled PRs) but then >> I could access the right-most column when >> the "add cards" was open. >> The whole interface is a bit awkward but maybe the best we have (for example >> moving something from the bottom >> to the top is easiest by moving it to a different column, then scrolling up, >> then moving it back) >> >> wdyt? >> Andy >> >> >> >> On 09/29/2016 11:05 PM, Joel Nothman wrote: >> >> The spreadsheet seems to have some duplications and presumably some missing >> rows, with apologies. I assume some is due to the github pagination, and >> some may be my error. Not a big enough error to fix up. >> >> On 30 September 2016 at 05:15, Raphael C wrote: >>> My apologies I see it is in the spreadsheet. It would be great to see >>> this work finished for 0.19 if at all possible IMHO. >>> >>> Raphael >>> >>> On 29 September 2016 at 20:12, Raphael C wrote: >>>> I hope this isn't out of place but I notice that >>>> https://github.com/scikit-learn/scikit-learn/pull/4899 is not in the >>>> list. It seems like a very worthwhile addition and the PR appears >>>> stalled at present. >>>> >>>> Raphael >>>> >>>> On 29 September 2016 at 15:05, Joel Nothman >>>> wrote: >>>>> I agree that being able to identify which PRs are stalled on the >>>>> contributor's part, which on reviewers' part, and since when, would be >>>>> great. I'm not sure we've come up with a way that'll work though. >>>>> >>>>> In terms of backlog, I've wondered if just getting things into a >>>>> spreadsheet >>>>> would help: >>>>> >>>>> >>>>> https://docs.google.com/spreadsheets/d/1LdzNxQbn7A0Ao8zlUBgnvT42929JpAe9958YxKCubjE/edit >>>>> >>>>> What other features of an Issue / PR would be useful to >>>>> sort/filter/pivottable on in a spreadsheet form like this? >>>>> >>>>> (It would be extra nice if we could modify titles and labels within the >>>>> spreadsheet and have them update via the GitHub API, but I'm not sure >>>>> I'll >>>>> get around to making that feature :P) >>>>> >>>>> >>>>> On 29 September 2016 at 23:45, Andreas Mueller >>>>> wrote: >>>>>> So I made a project for 0.19: >>>>>> >>>>>> https://github.com/scikit-learn/scikit-learn/projects/5 >>>>>> >>>>>> The idea would be to drag and drop issues and PRs so that the >>>>>> important >>>>>> ones are at the top. >>>>>> We could also add an "important" column, currently the scrolling is >>>>>> pretty >>>>>> annoying. >>>>>> Thoughts? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 09/28/2016 03:29 PM, Nelle Varoquaux wrote: >>>>>>> On 28 September 2016 at 12:24, Andreas Mueller >>>>>>> wrote: >>>>>>>> >>>>>>>> On 09/28/2016 02:21 PM, Nelle Varoquaux wrote: >>>>>>>>> >>>>>>>>> I think the only ones worth having are the ones that can be dealt >>>>>>>>> with >>>>>>>>> automatically and the ones that will not be used frequently: >>>>>>>>> >>>>>>>>> - stalled after 30 days of inactivity [can be done automatically] >>>>>>>>> - in dispute [I don't expect it to be used often]. >>>>>>>> I think "in dispute" is actually one of the most common statuses >>>>>>>> among >>>>>>>> PRs. >>>>>>>> Or maybe I have a skewed picture of things. >>>>>>>> Many PRs stalled because it is not clear whether the proposed >>>>>>>> solution >>>>>>>> is a >>>>>>>> good one. >>>>>>> On the stalled one, sure, but there are a lot of PRs being merged >>>>>>> fairly quickly. So over all, I think it is quite rare. No? >>>>>>> >>>>>>>> It would be great to have some way to get through the backlog of 400 >>>>>>>> PRs >>>>>>>> and >>>>>>>> I think tagging them might be useful. >>>>>>>> We rarely reject PRs, we could also revisit that policy. >>>>>>>> >>>>>>>> For the backlog, it's pretty unclear to me how many are waiting for >>>>>>>> reviews, >>>>>>>> how many are waiting for changes, >>>>>>>> and how many are disputed. >>>>>>>> Tagging these might help people who want to review to find things to >>>>>>>> review, >>>>>>>> and people who want to code to pick >>>>>>>> up stalled PRs. >>>>>>> That sounds like a great use of labels, thought all of these need to >>>>>>> be tagged manually. >>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Fri Dec 2 21:34:39 2016 From: t3kcit at gmail.com (Andy) Date: Fri, 2 Dec 2016 21:34:39 -0500 Subject: [scikit-learn] Github project management tools In-Reply-To: References: <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com> Message-ID: <27deb6f1-03fe-acf5-c549-67fbc1b2f7d1@gmail.com> Another fun shortcoming of the project interface: If a card is already present in your project, you can not search for it (though you can ctrl+f) From matteo at mycarta.ca Fri Dec 2 22:28:38 2016 From: matteo at mycarta.ca (Matteo Niccoli) Date: Fri, 2 Dec 2016 22:28:38 -0500 Subject: [scikit-learn] Trying to get learning curves with custom scorer and leave one group out Message-ID: <087b462738da5f6bef59c9eac0c7bc08.squirrel@mycarta.ca> HI all, I want to plot learning curves on a trained SVM classifier, using a custom scorer, and using Leave One Group Out as the method of crossvalidation. I thought I had it figured out, but two different scorers - 'f1_micro' and 'accuracy' - will yield identical values. I am confused, is that supposed to be the case? Here's my code (unfortunately I cannot share the data as it is not open): from sklearn import svm SVC_classifier_LOWO_VC0 = svm.SVC(cache_size=800, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf', max_iter=-1, probability=False, random_state=1, shrinking=True, tol=0.001, verbose=False) training_data = pd.read_csv('training_data.csv') scaler = preprocessing.StandardScaler().fit(X) X = scaler.transform(X) y = training_data['Targets'].values groups = training_data["Groups"].values Fscorer = make_scorer(f1_score, average = 'micro') logo = LeaveOneGroupOut() parm_range0 = np.logspace(-2, 6, 9) train_scores0, test_scores0 = validation_curve(SVC_classifier_LOWO_VC0, X, y, "C", parm_range0, cv =logo.split(X, y, groups=groups), scoring = Fscorer) Now, from: train_scores_mean0 = np.mean(train_scores0, axis=1) train_scores_std0 = np.std(train_scores0, axis=1) test_scores_mean0 = np.mean(test_scores0, axis=1) test_scores_std0 = np.std(test_scores0, axis=1) print test_scores_mean0 print np.amax(test_scores_mean0) print np.logspace(-2, 6, 9)[test_scores_mean0.argmax(axis=0)] I get: [ 0.20257407 0.35551122 0.40791047 0.49887676 0.5021742 0.50030438 0.49426622 0.48066419 0.4868987 ] 0.502174200206 100.0 If I create a new classifier, but with the same parameters, and run everything exactly as before, except for the scoring, e.g.: parm_range1 = np.logspace(-2, 6, 9) train_scores1, test_scores1 = validation_curve(SVC_classifier_LOWO_VC1, X, y, "C", parm_range1, cv =logo.split(X, y, groups=wells), scoring = 'accuracy') train_scores_mean1 = np.mean(train_scores1, axis=1) train_scores_std1= np.std(train_scores1, axis=1) test_scores_mean1 = np.mean(test_scores1, axis=1) test_scores_std1 = np.std(test_scores1, axis=1) print test_scores_mean1 print np.amax(test_scores_mean1) print np.logspace(-2, 6, 9)[test_scores_mean1.argmax(axis=0)] I get exactly the same answer: [ 0.20257407 0.35551122 0.40791047 0.49887676 0.5021742 0.50030438 0.49426622 0.48066419 0.4868987 ] 0.502174200206 100.0 How is that possible, am I doing something wrong, or missing something? Thanks From matteo at mycarta.ca Fri Dec 2 22:40:05 2016 From: matteo at mycarta.ca (Matteo Niccoli) Date: Fri, 2 Dec 2016 22:40:05 -0500 Subject: [scikit-learn] Trying to get learning curves with custom scorer and leave one group out In-Reply-To: <087b462738da5f6bef59c9eac0c7bc08.squirrel@mycarta.ca> References: <087b462738da5f6bef59c9eac0c7bc08.squirrel@mycarta.ca> Message-ID: <8feb0c3aa67fc63a3754d266e053f2e6.squirrel@mycarta.ca> My apologies, there was a typo in the code below, second example, should read: train_scores1, test_scores1 = validation_curve(SVC_classifier_LOWO_VC1, X, y, "C", parm_range1, cv =logo.split(X, y, groups=groups), scoring = 'accuracy') Everything else is correct. On Fri, December 2, 2016 10:28 pm, Matteo Niccoli wrote: > HI all, > > > I want to plot learning curves on a trained SVM classifier, using a > custom scorer, and using Leave One Group Out as the method of > crossvalidation. I thought I had it figured out, but two different scorers > - 'f1_micro' and > 'accuracy' - will yield identical values. I am confused, is that supposed > to be the case? > > Here's my code (unfortunately I cannot share the data as it is not open): > > > from sklearn import svm SVC_classifier_LOWO_VC0 = svm.SVC(cache_size=800, > class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, > gamma=0.01, kernel='rbf', max_iter=-1, probability=False, random_state=1, > shrinking=True, tol=0.001, verbose=False) training_data = > pd.read_csv('training_data.csv') scaler = > preprocessing.StandardScaler().fit(X) X = scaler.transform(X) > y = training_data['Targets'].values groups = training_data["Groups"].values > Fscorer = make_scorer(f1_score, average = 'micro') > logo = LeaveOneGroupOut() parm_range0 = np.logspace(-2, 6, 9) train_scores0, > test_scores0 = validation_curve(SVC_classifier_LOWO_VC0, X, y, "C", > parm_range0, cv =logo.split(X, y, groups=groups), scoring = Fscorer) > > > Now, from: > train_scores_mean0 = np.mean(train_scores0, axis=1) train_scores_std0 = > np.std(train_scores0, axis=1) test_scores_mean0 = np.mean(test_scores0, > axis=1) test_scores_std0 = np.std(test_scores0, axis=1) print > test_scores_mean0 print np.amax(test_scores_mean0) print np.logspace(-2, > 6, 9)[test_scores_mean0.argmax(axis=0)] > > > I get: > [ 0.20257407 0.35551122 0.40791047 0.49887676 0.5021742 0.50030438 > 0.49426622 0.48066419 0.4868987 ] > 0.502174200206 > 100.0 > > > If I create a new classifier, but with the same parameters, and run > everything exactly as before, except for the scoring, e.g.: > > parm_range1 = np.logspace(-2, 6, 9) train_scores1, test_scores1 = > validation_curve(SVC_classifier_LOWO_VC1, X, y, "C", parm_range1, cv > =logo.split(X, y, groups=wells), scoring = > 'accuracy') > train_scores_mean1 = np.mean(train_scores1, axis=1) train_scores_std1= > np.std(train_scores1, axis=1) test_scores_mean1 = np.mean(test_scores1, > axis=1) test_scores_std1 = np.std(test_scores1, axis=1) print > test_scores_mean1 print np.amax(test_scores_mean1) print np.logspace(-2, > 6, 9)[test_scores_mean1.argmax(axis=0)] > > > I get exactly the same answer: > [ 0.20257407 0.35551122 0.40791047 0.49887676 0.5021742 0.50030438 > 0.49426622 0.48066419 0.4868987 ] > 0.502174200206 > 100.0 > > > How is that possible, am I doing something wrong, or missing something? > > > Thanks > > > From alekhka at gmail.com Sat Dec 3 04:38:00 2016 From: alekhka at gmail.com (Alekh Karkada Ashok) Date: Sat, 3 Dec 2016 15:08:00 +0530 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: References: Message-ID: Hi all, I want use the Scikit-learn's MLPRegressor to map image to image. That is I have a numpy array of size [1000,2030400] (1000 samples, 76800x3 (RGB) pixels). Corresponding labelled images I have. Therefore Y is also [1000,230400]. But according to documentation: *fit(X, y)* Fit the model to data matrix X and target y. *Parameters:* *X : *{array-like, sparse matrix}, shape (n_samples, n_features) The input data. *y : *array-like, shape (n_samples,) The target values. *Returns:* self : returns a trained MLP model. We can see that Y should be a column matrix. Does this mean Scikit-learn doesn't support multiple outputs? I am getting MemoryError when I try to fit now. More: http://stackoverflow.com/questions/40945791/ memoryerror-in-scikit-learn Please help. Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Sat Dec 3 05:29:26 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 3 Dec 2016 11:29:26 +0100 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: References: Message-ID: <20161203102926.GG455403@phare.normalesup.org> On Sat, Dec 03, 2016 at 03:08:00PM +0530, Alekh Karkada Ashok wrote: > I want use the Scikit-learn's MLPRegressor to map image to image. That is I > have a numpy array of size [1000,2030400] (1000 samples, 76800x3 (RGB) pixels). > Corresponding labelled images I have. Therefore Y is also [1000,230400]. But > according to documentation: 1 thousands samples and 2030 thousands features: you are using the wrong tool, I multi-layer perceptron model will be too complex and overfit in these settings. I would suggest a ridge. > We can see that Y should be a column matrix. Does this mean Scikit-learn > doesn't support multiple outputs? I believe that this is a documentation error. Could you open an issue (only on the documentation error) > I am getting MemoryError when I try to fit now. > More: http://stackoverflow.com/questions/40945791/memoryerror-in-scikit-learn I believe that your problem is too high-dimensional; Too many features. G From gael.varoquaux at normalesup.org Sat Dec 3 05:52:15 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Sat, 3 Dec 2016 11:52:15 +0100 Subject: [scikit-learn] Github project management tools In-Reply-To: References: <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com> Message-ID: <20161203105215.GH455403@phare.normalesup.org> On Fri, Dec 02, 2016 at 07:52:09PM -0500, Andy wrote: > So did we ever decide on how to prioritize reviews? I don't know how to do this. > I think it might be helpful if Joel and me prioritize issues. I think that it would be useful. Although of course different people will have different priorities (depending for instance on the type of data that we process). I guess that we can agree on a large part of the prioritization, and hence it will be useful. > Obviously that will only make sense if the other team members check up > on it when deciding what to review / work on. So, the big question is: how do we do this? Isn't there on of the many project-management extension of github that enables this? From ragvrv at gmail.com Sat Dec 3 12:26:29 2016 From: ragvrv at gmail.com (Raghav R V) Date: Sat, 3 Dec 2016 18:26:29 +0100 Subject: [scikit-learn] Github project management tools In-Reply-To: <20161203105215.GH455403@phare.normalesup.org> References: <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com> <20161203105215.GH455403@phare.normalesup.org> Message-ID: We could start with assigning priority labels like they use in numpy... That + milestones could help us prioritize? On Sat, Dec 3, 2016 at 11:52 AM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > On Fri, Dec 02, 2016 at 07:52:09PM -0500, Andy wrote: > > So did we ever decide on how to prioritize reviews? > > I don't know how to do this. > > > I think it might be helpful if Joel and me prioritize issues. > > I think that it would be useful. Although of course different people will > have different priorities (depending for instance on the type of data > that we process). I guess that we can agree on a large part of the > prioritization, and hence it will be useful. > > > Obviously that will only make sense if the other team members check up > > on it when deciding what to review / work on. > > So, the big question is: how do we do this? Isn't there on of the many > project-management extension of github that enables this? > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From avisochek3 at gmail.com Sat Dec 3 12:19:52 2016 From: avisochek3 at gmail.com (Allan Visochek) Date: Sat, 3 Dec 2016 12:19:52 -0500 Subject: [scikit-learn] Markov Clustering? Message-ID: Hi there, My name is Allan Visochek, I'm a data scientist and web developer and I love scikit-learn so first of all, thanks so much for the work that you do. I'm reaching out because I've found the markov clustering algorithm to be quite useful for me in some of my work and noticed that there is no implementation in scikit-learn, is anybody working on this? If not, id be happy to take this on. I'm new to open source, but I've been working with python for a few years now. Best, -Allan -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sat Dec 3 13:08:55 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 13:08:55 -0500 Subject: [scikit-learn] Github project management tools In-Reply-To: References: <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com> <20161203105215.GH455403@phare.normalesup.org> Message-ID: <43b13054-ef73-54b3-0def-d138e814823d@gmail.com> On 12/03/2016 12:26 PM, Raghav R V wrote: > We could start with assigning priority labels like they use in > numpy... That + milestones could help us prioritize? > I feel milestones are too coarse. Or I'm using them wrong. And priority labels only work if people don't use the "high priority" all the time. There is a lot of stuff labeled "bug", which I would interpret as "highest priority" that people don't look at at all. From t3kcit at gmail.com Sat Dec 3 13:10:36 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 13:10:36 -0500 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: <20161203102926.GG455403@phare.normalesup.org> References: <20161203102926.GG455403@phare.normalesup.org> Message-ID: On 12/03/2016 05:29 AM, Gael Varoquaux wrote: > On Sat, Dec 03, 2016 at 03:08:00PM +0530, Alekh Karkada Ashok wrote: >> I want use the Scikit-learn's MLPRegressor to map image to image. That is I >> have a numpy array of size [1000,2030400] (1000 samples, 76800x3 (RGB) pixels). >> Corresponding labelled images I have. Therefore Y is also [1000,230400]. But >> according to documentation: > 1 thousands samples and 2030 thousands features: you are using the wrong > tool, I multi-layer perceptron model will be too complex and overfit in > these settings. I would suggest a ridge. > > These are images! Don't use ridge, use a convolutional neural network. Our MLP is not convolutional, it will not be useful. There is a lot of material out there on how to use covolutional neural networks for image labeling (it looks like you have one label per pixel, not per image) From t3kcit at gmail.com Sat Dec 3 13:13:50 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 13:13:50 -0500 Subject: [scikit-learn] Trying to get learning curves with custom scorer and leave one group out In-Reply-To: <8feb0c3aa67fc63a3754d266e053f2e6.squirrel@mycarta.ca> References: <087b462738da5f6bef59c9eac0c7bc08.squirrel@mycarta.ca> <8feb0c3aa67fc63a3754d266e053f2e6.squirrel@mycarta.ca> Message-ID: <264dc532-ed6c-6aed-ac1c-7a6fbad2c2b5@gmail.com> That indeed looks odd. Can you reproduce with synthetic data? On 12/02/2016 10:40 PM, Matteo Niccoli wrote: > My apologies, there was a typo in the code below, second example, should > read: > > train_scores1, test_scores1 = validation_curve(SVC_classifier_LOWO_VC1, X, > y, "C", parm_range1, cv =logo.split(X, y, groups=groups), scoring = > 'accuracy') > > Everything else is correct. > > > On Fri, December 2, 2016 10:28 pm, Matteo Niccoli wrote: >> HI all, >> >> >> I want to plot learning curves on a trained SVM classifier, using a >> custom scorer, and using Leave One Group Out as the method of >> crossvalidation. I thought I had it figured out, but two different scorers >> - 'f1_micro' and >> 'accuracy' - will yield identical values. I am confused, is that supposed >> to be the case? >> >> Here's my code (unfortunately I cannot share the data as it is not open): >> >> >> from sklearn import svm SVC_classifier_LOWO_VC0 = svm.SVC(cache_size=800, >> class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, >> gamma=0.01, kernel='rbf', max_iter=-1, probability=False, random_state=1, >> shrinking=True, tol=0.001, verbose=False) training_data = >> pd.read_csv('training_data.csv') scaler = >> preprocessing.StandardScaler().fit(X) X = scaler.transform(X) >> y = training_data['Targets'].values groups = training_data["Groups"].values >> Fscorer = make_scorer(f1_score, average = 'micro') >> logo = LeaveOneGroupOut() parm_range0 = np.logspace(-2, 6, 9) > train_scores0, >> test_scores0 = validation_curve(SVC_classifier_LOWO_VC0, X, y, "C", >> parm_range0, cv =logo.split(X, y, groups=groups), scoring = Fscorer) >> >> >> Now, from: >> train_scores_mean0 = np.mean(train_scores0, axis=1) train_scores_std0 = >> np.std(train_scores0, axis=1) test_scores_mean0 = np.mean(test_scores0, >> axis=1) test_scores_std0 = np.std(test_scores0, axis=1) print >> test_scores_mean0 print np.amax(test_scores_mean0) print np.logspace(-2, >> 6, 9)[test_scores_mean0.argmax(axis=0)] >> >> >> I get: >> [ 0.20257407 0.35551122 0.40791047 0.49887676 0.5021742 0.50030438 >> 0.49426622 0.48066419 0.4868987 ] >> 0.502174200206 >> 100.0 >> >> >> If I create a new classifier, but with the same parameters, and run >> everything exactly as before, except for the scoring, e.g.: >> >> parm_range1 = np.logspace(-2, 6, 9) train_scores1, test_scores1 = >> validation_curve(SVC_classifier_LOWO_VC1, X, y, "C", parm_range1, cv >> =logo.split(X, y, groups=wells), scoring = >> 'accuracy') >> train_scores_mean1 = np.mean(train_scores1, axis=1) train_scores_std1= >> np.std(train_scores1, axis=1) test_scores_mean1 = np.mean(test_scores1, >> axis=1) test_scores_std1 = np.std(test_scores1, axis=1) print >> test_scores_mean1 print np.amax(test_scores_mean1) print np.logspace(-2, >> 6, 9)[test_scores_mean1.argmax(axis=0)] >> >> >> I get exactly the same answer: >> [ 0.20257407 0.35551122 0.40791047 0.49887676 0.5021742 0.50030438 >> 0.49426622 0.48066419 0.4868987 ] >> 0.502174200206 >> 100.0 >> >> >> How is that possible, am I doing something wrong, or missing something? >> >> >> Thanks >> >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From nelle.varoquaux at gmail.com Sat Dec 3 13:20:14 2016 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Sat, 3 Dec 2016 10:20:14 -0800 Subject: [scikit-learn] Github project management tools In-Reply-To: <43b13054-ef73-54b3-0def-d138e814823d@gmail.com> References: <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com> <20161203105215.GH455403@phare.normalesup.org> <43b13054-ef73-54b3-0def-d138e814823d@gmail.com> Message-ID: On 3 December 2016 at 10:08, Andy wrote: > > > On 12/03/2016 12:26 PM, Raghav R V wrote: >> >> We could start with assigning priority labels like they use in numpy... >> That + milestones could help us prioritize? >> > I feel milestones are too coarse. Or I'm using them wrong. > And priority labels only work if people don't use the "high priority" all > the time. > There is a lot of stuff labeled "bug", which I would interpret as "highest > priority" that people don't look at at all. even milestone only work if people don't use the next milestone all the time. I think the only milestone useful is for release critical bugs, for the next release. For example, on matplotlib, I am currently only reviewing and working on tickets for the 2.0 milestone, as we're hoping to get a new candidate release out this week-end. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Sat Dec 3 14:07:33 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 14:07:33 -0500 Subject: [scikit-learn] Github project management tools In-Reply-To: References: <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com> <20161203105215.GH455403@phare.normalesup.org> <43b13054-ef73-54b3-0def-d138e814823d@gmail.com> Message-ID: On 12/03/2016 01:20 PM, Nelle Varoquaux wrote: > On 3 December 2016 at 10:08, Andy wrote: >> >> On 12/03/2016 12:26 PM, Raghav R V wrote: >>> We could start with assigning priority labels like they use in numpy... >>> That + milestones could help us prioritize? >>> >> I feel milestones are too coarse. Or I'm using them wrong. >> And priority labels only work if people don't use the "high priority" all >> the time. >> There is a lot of stuff labeled "bug", which I would interpret as "highest >> priority" that people don't look at at all. > even milestone only work if people don't use the next milestone all > the time. I think the only milestone useful is for release critical > bugs, for the next release. > For example, on matplotlib, I am currently only reviewing and working > on tickets for the 2.0 milestone, as we're hoping to get a new > candidate release out this week-end. > > That's what I meant by "probably doing it wrong". I assign it to too often. But actually I think people mostly ignore it anyhow ;) From jmschreiber91 at gmail.com Sat Dec 3 14:12:47 2016 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Sat, 3 Dec 2016 11:12:47 -0800 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References: Message-ID: I don't think anyone is working on this. Contributions are always very welcome, but be aware before you start that the process of getting a completely new algorithm into scikit-learn will take a lot of time and reviews. On Sat, Dec 3, 2016 at 9:19 AM, Allan Visochek wrote: > Hi there, > > My name is Allan Visochek, I'm a data scientist and web developer and I > love scikit-learn so first of all, thanks so much for the work that you do. > > I'm reaching out because I've found the markov clustering algorithm to be > quite useful for me in some of my work and noticed that there is no > implementation in scikit-learn, is anybody working on this? If not, id be > happy to take this on. I'm new to open source, but I've been working with > python for a few years now. > > Best, > -Allan > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alekhka at gmail.com Sat Dec 3 15:10:55 2016 From: alekhka at gmail.com (Alekh Karkada Ashok) Date: Sun, 4 Dec 2016 01:40:55 +0530 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: References: <20161203102926.GG455403@phare.normalesup.org> Message-ID: Hey All, I chose MLP because they were images and I have heard MLPs perform better. My application is detecting body parts from these images and therefore, the mapping would be pretty non-linear and this was my idea behind selecting MLP. Otherwise, I would have to engineer high dimension features by hand. I have 2030400 pixels and making higher dimensional features would require a lot more memory. Where do you want to me to open the issue? GitHub? I don't think the error is only in documentation. Because when Y is [2030400,1] there is no MemoryError (treated as 2030400 samples with a single feature) and when I try to fit [1,2030400] it throws MemoryError. If the case was memory, both should have thrown the error right? I am still a novice but I am fairly good with Python. I am taken aback by scikit's sheer beauty and simplicity. I would love to contribute code to it. Can you please tell me how I can get started? Thanks a lot! On Sat, Dec 3, 2016 at 11:40 PM, Andy wrote: > > > On 12/03/2016 05:29 AM, Gael Varoquaux wrote: > >> On Sat, Dec 03, 2016 at 03:08:00PM +0530, Alekh Karkada Ashok wrote: >> >>> I want use the Scikit-learn's MLPRegressor to map image to image. That >>> is I >>> have a numpy array of size [1000,2030400] (1000 samples, 76800x3 (RGB) >>> pixels). >>> Corresponding labelled images I have. Therefore Y is also [1000,230400]. >>> But >>> according to documentation: >>> >> 1 thousands samples and 2030 thousands features: you are using the wrong >> tool, I multi-layer perceptron model will be too complex and overfit in >> these settings. I would suggest a ridge. >> >> >> These are images! Don't use ridge, use a convolutional neural network. > Our MLP is not convolutional, it will not be useful. > There is a lot of material out there on how to use covolutional neural > networks > for image labeling (it looks like you have one label per pixel, not per > image) > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sat Dec 3 15:34:55 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 15:34:55 -0500 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References: Message-ID: Hi Allan. Can you provide the original paper? It this something usually used on sparse graphs? We do have algorithms that operate on data-induced graphs, like SpectralClustering, but we don't really implement general graph algorithms (there's no PageRank or community detection). Andy On 12/03/2016 12:19 PM, Allan Visochek wrote: > Hi there, > > My name is Allan Visochek, I'm a data scientist and web developer and > I love scikit-learn so first of all, thanks so much for the work that > you do. > > I'm reaching out because I've found the markov clustering algorithm to > be quite useful for me in some of my work and noticed that there is no > implementation in scikit-learn, is anybody working on this? If not, id > be happy to take this on. I'm new to open source, but I've been > working with python for a few years now. > > Best, > -Allan > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sat Dec 3 15:41:45 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 15:41:45 -0500 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: References: <20161203102926.GG455403@phare.normalesup.org> Message-ID: On 12/03/2016 03:10 PM, Alekh Karkada Ashok wrote: > > Hey All, > > I chose MLP because they were images and I have heard MLPs perform better. Better than a convolutional neural net? Whoever told you that was wrong. I usually don't make absolute statements like this, but this is something that is pretty certain. > > Where do you want to me to open the issue? GitHub? I don't think the > error is only in documentation. Because when Y is [2030400,1] there is > no MemoryError (treated as 2030400 samples with a single feature) and > when I try to fit [1,2030400] it throws MemoryError. If the case was > memory, both should have thrown the error right? MLPClassifier actually supports multi-label classification (which is not documented correctly and I made an issue here: https://github.com/scikit-learn/scikit-learn/issues/7972) MLPClassifier does not support multi-output (multi-class multi-output), which is probably what you want. From avn at mccme.ru Sat Dec 3 15:39:04 2016 From: avn at mccme.ru (avn at mccme.ru) Date: Sat, 03 Dec 2016 23:39:04 +0300 Subject: [scikit-learn] Adding samplers for intersection/Jensen-Shannon kernels Message-ID: <2379df1fffaf791977177019377b57bc@mccme.ru> Hello, In the course of my work, I've made samplers for intersection/Jensen-Shannon kernels, just by small modifications to sklearn.kernel_approximation.AdditiveChi2Sampler code. Intersection kernel proved to be the best one for my task (clustering Docstrum feature vectors), so perhaps it'd be good to add those samplers alongside AdditiveChi2Sampler? Should I proceed with creating a pull request? Or, perhaps, those kernels were not already included for some good reason? With best regards, -- Valery From t3kcit at gmail.com Sat Dec 3 16:23:21 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 16:23:21 -0500 Subject: [scikit-learn] Adding samplers for intersection/Jensen-Shannon kernels In-Reply-To: <2379df1fffaf791977177019377b57bc@mccme.ru> References: <2379df1fffaf791977177019377b57bc@mccme.ru> Message-ID: Hi Valery. I didn't include them because the Chi2 worked better for my task ;) In hindsight, I'm not sure if these kernels are not to a bit too specialized for scikit-learn. But given that we have the (slightly more obscure) SkewedChi2 and AdditiveChi2, I think the intersection one would be a good addition if you found it useful. Andy On 12/03/2016 03:39 PM, Valery Anisimovsky via scikit-learn wrote: > Hello, > > In the course of my work, I've made samplers for > intersection/Jensen-Shannon kernels, just by small modifications to > sklearn.kernel_approximation.AdditiveChi2Sampler code. Intersection > kernel proved to be the best one for my task (clustering Docstrum > feature vectors), so perhaps it'd be good to add those samplers > alongside AdditiveChi2Sampler? Should I proceed with creating a pull > request? Or, perhaps, those kernels were not already included for some > good reason? > > With best regards, > -- Valery > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From avisochek3 at gmail.com Sat Dec 3 16:33:58 2016 From: avisochek3 at gmail.com (Allan Visochek) Date: Sat, 3 Dec 2016 16:33:58 -0500 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References: Message-ID: Hey Andy, This algorithm does operate on sparse graphs so it may be beyond the scope of sci-kit learn, let me know what you think. The website is here , it includes a brief description of how the algorithm operates under Documentation -> Overview1 and Overview2. The references listed on the website are included below. Best, -Allan [1] Stijn van Dongen. *Graph Clustering by Flow Simulation*. PhD thesis, University of Utrecht, May 2000. http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm [2] Stijn van Dongen. *A cluster algorithm for graphs*. Technical Report INS-R0010, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000. http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z [3] Stijn van Dongen. *A stochastic uncoupling process for graphs*. Technical Report INS-R0011, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000. http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z [4] Stijn van Dongen. *Performance criteria for graph clustering and Markov cluster experiments*. Technical Report INS-R0012, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000. http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z [5] Enright A.J., Van Dongen S., Ouzounis C.A. *An efficient algorithm for large-scale detection of protein families*, Nucleic Acids Research 30(7):1575-1584 (2002). On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > Hi Allan. > Can you provide the original paper? > It this something usually used on sparse graphs? We do have algorithms > that operate on data-induced > graphs, like SpectralClustering, but we don't really implement general > graph algorithms (there's no PageRank or community detection). > > Andy > > > On 12/03/2016 12:19 PM, Allan Visochek wrote: > > Hi there, > > My name is Allan Visochek, I'm a data scientist and web developer and I > love scikit-learn so first of all, thanks so much for the work that you do. > > I'm reaching out because I've found the markov clustering algorithm to be > quite useful for me in some of my work and noticed that there is no > implementation in scikit-learn, is anybody working on this? If not, id be > happy to take this on. I'm new to open source, but I've been working with > python for a few years now. > > Best, > -Allan > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Sat Dec 3 16:45:02 2016 From: t3kcit at gmail.com (Andy) Date: Sat, 3 Dec 2016 16:45:02 -0500 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References: Message-ID: Hey Allan. None of the references apart from the last one seems to be published in a peer-reviewed place, is that right? And "A stochastic uncoupling process for graphs" has 13 citations since 2000. Unless there is a more prominent publication or evidence of heavy use, I think it's disqualified. Academia is certainly not the only metric for evaluation, so if you have others, that's good, too ;) Best, Andy On 12/03/2016 04:33 PM, Allan Visochek wrote: > Hey Andy, > > This algorithm does operate on sparse graphs so it may be beyond the > scope of sci-kit learn, let me know what you think. > The website is here , it includes a brief > description of how the algorithm operates under Documentation -> > Overview1 and Overview2. > The references listed on the website are included below. > > Best, > -Allan > > [1] Stijn van Dongen. /Graph Clustering by Flow Simulation/. PhD > thesis, University of Utrecht, May 2000. > http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > > > [2] Stijn van Dongen. /A cluster algorithm for graphs/. Technical > Report INS-R0010, National Research Institute for Mathematics and > Computer Science in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > > > [3] Stijn van Dongen. /A stochastic uncoupling process for graphs/. > Technical Report INS-R0011, National Research Institute for > Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > > > [4] Stijn van Dongen. /Performance criteria for graph clustering and > Markov cluster experiments/. Technical Report INS-R0012, National > Research Institute for Mathematics and Computer Science in the > Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > > > [5] Enright A.J., Van Dongen S., Ouzounis C.A. /An efficient algorithm > for large-scale detection of protein families/, Nucleic Acids Research > 30(7):1575-1584 (2002). > > > On Sat, Dec 3, 2016 at 3:34 PM, Andy > wrote: > > Hi Allan. > Can you provide the original paper? > It this something usually used on sparse graphs? We do have > algorithms that operate on data-induced > graphs, like SpectralClustering, but we don't really implement > general graph algorithms (there's no PageRank or community detection). > > Andy > > > On 12/03/2016 12:19 PM, Allan Visochek wrote: >> Hi there, >> >> My name is Allan Visochek, I'm a data scientist and web developer >> and I love scikit-learn so first of all, thanks so much for the >> work that you do. >> >> I'm reaching out because I've found the markov clustering >> algorithm to be quite useful for me in some of my work and >> noticed that there is no implementation in scikit-learn, is >> anybody working on this? If not, id be happy to take this on. I'm >> new to open source, but I've been working with python for a few >> years now. >> >> Best, >> -Allan >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From alekhka at gmail.com Sat Dec 3 17:00:04 2016 From: alekhka at gmail.com (Alekh Karkada Ashok) Date: Sun, 4 Dec 2016 03:30:04 +0530 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: References: <20161203102926.GG455403@phare.normalesup.org> Message-ID: No, I am not saying it is better than CNN, but my images aren't real-life images but computer generated silhouettes. So CNN seemed to be overkill. I'll revisit CNN. I resized the images and converted it to grayscale. Now I am feeding [1,4800] now and I am getting good output with MLP. I looped over all my images and used partial_fit to train each one. I didn't get what you meant by MLPClassifier doesn't support multi-output. Thanks for the help! On Sun, Dec 4, 2016 at 2:11 AM, Andy wrote: > > > On 12/03/2016 03:10 PM, Alekh Karkada Ashok wrote: > >> >> Hey All, >> >> I chose MLP because they were images and I have heard MLPs perform better. >> > Better than a convolutional neural net? Whoever told you that was wrong. I > usually don't make absolute statements like this, but this is something > that is pretty certain. > > >> Where do you want to me to open the issue? GitHub? I don't think the >> error is only in documentation. Because when Y is [2030400,1] there is no >> MemoryError (treated as 2030400 samples with a single feature) and when I >> try to fit [1,2030400] it throws MemoryError. If the case was memory, both >> should have thrown the error right? >> > MLPClassifier actually supports multi-label classification (which is not > documented correctly and I made an issue here: > https://github.com/scikit-learn/scikit-learn/issues/7972) > MLPClassifier does not support multi-output (multi-class multi-output), > which is probably what you want. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vaggi.federico at gmail.com Sat Dec 3 17:35:03 2016 From: vaggi.federico at gmail.com (federico vaggi) Date: Sat, 03 Dec 2016 22:35:03 +0000 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: References: <20161203102926.GG455403@phare.normalesup.org> Message-ID: As long as the feature ordering has a meaningful spatial component (as is almost always the case when you are dealing with raw pixels as features) CNNs will almost always be better. CNNs actually have a lot fewer parameters than MLPs (depending on architecture of course) because of weight sharing among the parameters of the convolutional kernel within a feature map. On Sat, 3 Dec 2016 at 23:00 Alekh Karkada Ashok wrote: > No, I am not saying it is better than CNN, but my images aren't real-life > images but computer generated silhouettes. So CNN seemed to be overkill. > I'll revisit CNN. I resized the images and converted it to grayscale. Now I > am feeding [1,4800] now and I am getting good output with MLP. I looped > over all my images and used partial_fit to train each one. > I didn't get what you meant by MLPClassifier doesn't support multi-output. > Thanks for the help! > > On Sun, Dec 4, 2016 at 2:11 AM, Andy wrote: > > > > On 12/03/2016 03:10 PM, Alekh Karkada Ashok wrote: > > > Hey All, > > I chose MLP because they were images and I have heard MLPs perform better. > > Better than a convolutional neural net? Whoever told you that was wrong. I > usually don't make absolute statements like this, but this is something > that is pretty certain. > > > Where do you want to me to open the issue? GitHub? I don't think the error > is only in documentation. Because when Y is [2030400,1] there is no > MemoryError (treated as 2030400 samples with a single feature) and when I > try to fit [1,2030400] it throws MemoryError. If the case was memory, both > should have thrown the error right? > > MLPClassifier actually supports multi-label classification (which is not > documented correctly and I made an issue here: > https://github.com/scikit-learn/scikit-learn/issues/7972) > MLPClassifier does not support multi-output (multi-class multi-output), > which is probably what you want. > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From avisochek3 at gmail.com Sat Dec 3 17:43:30 2016 From: avisochek3 at gmail.com (Allan Visochek) Date: Sat, 3 Dec 2016 17:43:30 -0500 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References: Message-ID: Thanks for pointing that out, I sort of picked it up by word of mouth so I'd assumed it had a bit more precedence in the academic world. I'll look into it a little more, but I'd definitely be interested in contributing something else if that doesn't work out. -Allan On Sat, Dec 3, 2016 at 4:45 PM, Andy wrote: > Hey Allan. > > None of the references apart from the last one seems to be published in a > peer-reviewed place, is that right? > And "A stochastic uncoupling process for graphs" has 13 citations since > 2000. Unless there is a more prominent > publication or evidence of heavy use, I think it's disqualified. > Academia is certainly not the only metric for evaluation, so if you have > others, that's good, too ;) > > Best, > Andy > > On 12/03/2016 04:33 PM, Allan Visochek wrote: > > Hey Andy, > > This algorithm does operate on sparse graphs so it may be beyond the scope > of sci-kit learn, let me know what you think. > The website is here , it includes a brief > description of how the algorithm operates under Documentation -> Overview1 > and Overview2. > The references listed on the website are included below. > > Best, > -Allan > > [1] Stijn van Dongen. *Graph Clustering by Flow Simulation*. PhD thesis, > University of Utrecht, May 2000. > http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > > [2] Stijn van Dongen. *A cluster algorithm for graphs*. Technical Report > INS-R0010, National Research Institute for Mathematics and Computer Science > in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > > [3] Stijn van Dongen. *A stochastic uncoupling process for graphs*. > Technical Report INS-R0011, National Research Institute for Mathematics and > Computer Science in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > > [4] Stijn van Dongen. *Performance criteria for graph clustering and > Markov cluster experiments*. Technical Report INS-R0012, National > Research Institute for Mathematics and Computer Science in the Netherlands, > Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > > [5] Enright A.J., Van Dongen S., Ouzounis C.A. *An efficient algorithm > for large-scale detection of protein families*, Nucleic Acids Research > 30(7):1575-1584 (2002). > > On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > >> Hi Allan. >> Can you provide the original paper? >> It this something usually used on sparse graphs? We do have algorithms >> that operate on data-induced >> graphs, like SpectralClustering, but we don't really implement general >> graph algorithms (there's no PageRank or community detection). >> >> Andy >> >> >> On 12/03/2016 12:19 PM, Allan Visochek wrote: >> >> Hi there, >> >> My name is Allan Visochek, I'm a data scientist and web developer and I >> love scikit-learn so first of all, thanks so much for the work that you do. >> >> I'm reaching out because I've found the markov clustering algorithm to be >> quite useful for me in some of my work and noticed that there is no >> implementation in scikit-learn, is anybody working on this? If not, id be >> happy to take this on. I'm new to open source, but I've been working with >> python for a few years now. >> >> Best, >> -Allan >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ scikit-learn mailing >> list scikit-learn at python.org https://mail.python.org/mailma >> n/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From drraph at gmail.com Sun Dec 4 03:18:54 2016 From: drraph at gmail.com (Raphael C) Date: Sun, 04 Dec 2016 08:18:54 +0000 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References: Message-ID: I think you get a better view of the importance of Markov Clustering in academia from https://scholar.google.co.uk/scholar?hl=en&as_sdt=0,5&q=Markov+clustering . Raphael On Sat, 3 Dec 2016 at 22:43 Allan Visochek wrote: > Thanks for pointing that out, I sort of picked it up by word of mouth so > I'd assumed it had a bit more precedence in the academic world. > > I'll look into it a little more, but I'd definitely be interested in > contributing something else if that doesn't work out. > > -Allan > > On Sat, Dec 3, 2016 at 4:45 PM, Andy wrote: > > Hey Allan. > > None of the references apart from the last one seems to be published in a > peer-reviewed place, is that right? > And "A stochastic uncoupling process for graphs" has 13 citations since > 2000. Unless there is a more prominent > publication or evidence of heavy use, I think it's disqualified. > Academia is certainly not the only metric for evaluation, so if you have > others, that's good, too ;) > > Best, > Andy > > On 12/03/2016 04:33 PM, Allan Visochek wrote: > > Hey Andy, > > This algorithm does operate on sparse graphs so it may be beyond the scope > of sci-kit learn, let me know what you think. > The website is here , it includes a brief > description of how the algorithm operates under Documentation -> Overview1 > and Overview2. > The references listed on the website are included below. > > Best, > -Allan > > [1] Stijn van Dongen. *Graph Clustering by Flow Simulation*. PhD thesis, > University of Utrecht, May 2000. > http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > > [2] Stijn van Dongen. *A cluster algorithm for graphs*. Technical Report > INS-R0010, National Research Institute for Mathematics and Computer Science > in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > > [3] Stijn van Dongen. *A stochastic uncoupling process for graphs*. > Technical Report INS-R0011, National Research Institute for Mathematics and > Computer Science in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > > [4] Stijn van Dongen. *Performance criteria for graph clustering and > Markov cluster experiments*. Technical Report INS-R0012, National > Research Institute for Mathematics and Computer Science in the Netherlands, > Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > > [5] Enright A.J., Van Dongen S., Ouzounis C.A. *An efficient algorithm > for large-scale detection of protein families*, Nucleic Acids Research > 30(7):1575-1584 (2002). > > On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > > Hi Allan. > Can you provide the original paper? > It this something usually used on sparse graphs? We do have algorithms > that operate on data-induced > graphs, like SpectralClustering, but we don't really implement general > graph algorithms (there's no PageRank or community detection). > > Andy > > > On 12/03/2016 12:19 PM, Allan Visochek wrote: > > Hi there, > > My name is Allan Visochek, I'm a data scientist and web developer and I > love scikit-learn so first of all, thanks so much for the work that you do. > > I'm reaching out because I've found the markov clustering algorithm to be > quite useful for me in some of my work and noticed that there is no > implementation in scikit-learn, is anybody working on this? If not, id be > happy to take this on. I'm new to open source, but I've been working with > python for a few years now. > > Best, > -Allan > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ragvrv at gmail.com Sun Dec 4 11:16:32 2016 From: ragvrv at gmail.com (Raghav R V) Date: Sun, 4 Dec 2016 17:16:32 +0100 Subject: [scikit-learn] Github project management tools In-Reply-To: References: <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com> <20161203105215.GH455403@phare.normalesup.org> <43b13054-ef73-54b3-0def-d138e814823d@gmail.com> Message-ID: > > Okay so in the project, instead of sorting them by Issues / PR why don't >>> we make one column per priority. Let's have 3 levels and one column for >>> Done. We have a label for "Stalled" / "Need Contributor" which shows up in >>> the cards of the project anyway... >>> >>> As I didn't want to disturb the existing project setup, I created one >>> for a demo - https://github.com/scikit-learn/scikit-learn/projects/7 >>> (I'm resending this e-mail as the last one was rejected because the >>> attached image was huge for the mailing list) >>> >> Thanks Raghav -------------- next part -------------- An HTML attachment was scrubbed... URL: From ludo25_90 at hotmail.com Sun Dec 4 15:12:29 2016 From: ludo25_90 at hotmail.com (Ludovico Coletta) Date: Sun, 4 Dec 2016 20:12:29 +0000 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit Message-ID: Dear scikit experts, I'm struggling with the implementation of a nested cross validation. My data: I have 26 subjects (13 per class) x 6670 features. I used a feature reduction algorithm (you may have heard about Boruta) to reduce the dimensionality of my data. Problems start now: I defined LOSO as outer partitioning schema. Therefore, for each of the 26 cv folds I used 24 subjects for feature reduction. This lead to a different number of features in each cv fold. Now, for each cv fold I would like to use the same 24 subjects for hyperparameter optimization (SVM with rbf kernel). This is what I did: cv = list(LeaveOneout(len(y))) # in y I stored the labels inner_train = [None] * len(y) inner_test = [None] * len(y) ii = 0 while ii < len(y): cv = list(LeaveOneOut(len(y))) a = cv[ii][0] a = a[:-1] inner_train[ii] = a b = cv[ii][0] b = np.array(b[((len(cv[0][0]))-1)]) inner_test[ii]=b ii = ii + 1 custom_cv = zip(inner_train,inner_test) # inner cv pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', SVC(kernel="rbf"))]) parameters = [{'clf__C': np.logspace(-2, 10, 13), 'clf__gamma':np.logspace(-9, 3, 13)}] scores = [None] * (len(y)) ii = 0 while ii < len(scores): a = data[ii][0] # data for train b = data[ii][1] # data for test c = np.concatenate((a,b)) # shape: number of subjects * number of features d = cv[ii][0] # labels for train e = cv[ii][1] # label for test f = np.concatenate((d,e)) grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='accuracy', cv= zip(([custom_cv[ii][0]]), ([custom_cv[ii][1]]))) scores[ii] = cross_validation.cross_val_score(grid_search, c, y[f], scoring='accuracy', cv = zip(([cv[ii][0]]), ([cv[ii][1]]))) ii = ii + 1 However, I got the following error message: index 25 is out of bounds for size 25 Would it be so bad if I do not perform a nested LOSO but I use the default setting for hyperparameter optimization? Any help would be really appreciated -------------- next part -------------- An HTML attachment was scrubbed... URL: From avn at mccme.ru Sun Dec 4 15:50:21 2016 From: avn at mccme.ru (avn at mccme.ru) Date: Sun, 04 Dec 2016 23:50:21 +0300 Subject: [scikit-learn] Adding samplers for intersection/Jensen-Shannon kernels In-Reply-To: References: <2379df1fffaf791977177019377b57bc@mccme.ru> Message-ID: <0511d5fa33737f78ccdf7fbb2e5b2156@mccme.ru> I see now. So I'll proceed with adding documentation and unit tests for those kernels to complete their support. And I don't think they're too specialized, given that many kinds of feature vectors in e.g. computer vision are in fact histograms and all of those kernels are histogram-oriented. Andy ????? 2016-12-04 00:23: > Hi Valery. > I didn't include them because the Chi2 worked better for my task ;) > In hindsight, I'm not sure if these kernels are not to a bit too > specialized for scikit-learn. > But given that we have the (slightly more obscure) SkewedChi2 and > AdditiveChi2, > I think the intersection one would be a good addition if you found it > useful. > > Andy > > On 12/03/2016 03:39 PM, Valery Anisimovsky via scikit-learn wrote: >> Hello, >> >> In the course of my work, I've made samplers for >> intersection/Jensen-Shannon kernels, just by small modifications to >> sklearn.kernel_approximation.AdditiveChi2Sampler code. Intersection >> kernel proved to be the best one for my task (clustering Docstrum >> feature vectors), so perhaps it'd be good to add those samplers >> alongside AdditiveChi2Sampler? Should I proceed with creating a pull >> request? Or, perhaps, those kernels were not already included for some >> good reason? >> >> With best regards, >> -- Valery >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From ragvrv at gmail.com Sun Dec 4 16:27:02 2016 From: ragvrv at gmail.com (Raghav R V) Date: Sun, 4 Dec 2016 22:27:02 +0100 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References: Message-ID: Hi! It looks like you are using the old sklearn.cross_validation's LeaveOneLabelOut cross-validator. It has been deprecated since v0.18. Use the LeaveOneLabelOut from sklearn.model_selection, that should fix your issue I think (thought I have not looked into your code in detail). HTH! On Sun, Dec 4, 2016 at 9:12 PM, Ludovico Coletta wrote: > Dear scikit experts, > > I'm struggling with the implementation of a nested cross validation. > > My data: I have 26 subjects (13 per class) x 6670 features. I used a > feature reduction algorithm (you may have heard about Boruta) to reduce the > dimensionality of my data. Problems start now: I defined LOSO as outer > partitioning schema. Therefore, for each of the 26 cv folds I used 24 > subjects for feature reduction. This lead to a different number of features > in each cv fold. Now, for each cv fold I would like to use the same 24 > subjects for hyperparameter optimization (SVM with rbf kernel). > > This is what I did: > > *cv = list(LeaveOneout(len(y))) # in y I stored the labels* > > *inner_train = [None] * len(y)* > > *inner_test = [None] * len(y)* > > *ii = 0* > > *while ii < len(y):* > * cv = list(LeaveOneOut(len(y))) * > * a = cv[ii][0]* > * a = a[:-1]* > * inner_train[ii] = a* > > * b = cv[ii][0]* > * b = np.array(b[((len(cv[0][0]))-1)])* > * inner_test[ii]=b* > > * ii = ii + 1* > > *custom_cv = zip(inner_train,inner_test) # inner cv* > > > *pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', > SVC(kernel="rbf"))])* > > *parameters = [{'clf__C': np.logspace(-2, 10, 13), > 'clf__gamma':np.logspace(-9, 3, 13)}]* > > > > *scores = [None] * (len(y)) * > > *ii = 0* > > *while ii < len(scores):* > > * a = data[ii][0] # data for train* > * b = data[ii][1] # data for test* > * c = np.concatenate((a,b)) # shape: number of subjects * number of > features* > * d = cv[ii][0] # labels for train* > * e = cv[ii][1] # label for test* > * f = np.concatenate((d,e))* > > * grid_search = GridSearchCV(estimator=pipe_logistic, > param_grid=parameters, verbose=1, scoring='accuracy', cv= > zip(([custom_cv[ii][0]]), ([custom_cv[ii][1]])))* > > * scores[ii] = cross_validation.cross_val_score(grid_search, c, y[f], > scoring='accuracy', cv = zip(([cv[ii][0]]), ([cv[ii][1]])))* > > * ii = ii + 1* > > > > However, I got the following error message: index 25 is out of bounds for > size 25 > > Would it be so bad if I do not perform a nested LOSO but I use the default > setting for hyperparameter optimization? > > Any help would be really appreciated > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From ludo25_90 at hotmail.com Mon Dec 5 08:39:40 2016 From: ludo25_90 at hotmail.com (Ludovico Coletta) Date: Mon, 5 Dec 2016 13:39:40 +0000 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References: Message-ID: Unfortunately, it did not work. I think I am doing something wrong when passing the nested cv, but I do not understand where. If I omit the cv argument in the grid search it runs smoothly. I would like to have LeaveOneOut in both the outer and inner cv, how would you implement such a thing? Best Ludovico ________________________________ Da: scikit-learn per conto di scikit-learn-request at python.org Inviato: domenica 4 dicembre 2016 22.27 A: scikit-learn at python.org Oggetto: scikit-learn Digest, Vol 9, Issue 13 Send scikit-learn mailing list submissions to scikit-learn at python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... or, via email, send a message with subject or body 'help' to scikit-learn-request at python.org You can reach the person managing the list at scikit-learn-owner at python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. Nested Leave One Subject Out (LOSO) cross validation with scikit (Ludovico Coletta) 2. Re: Adding samplers for intersection/Jensen-Shannon kernels (avn at mccme.ru) 3. Re: Nested Leave One Subject Out (LOSO) cross validation with scikit (Raghav R V) ---------------------------------------------------------------------- Message: 1 Date: Sun, 4 Dec 2016 20:12:29 +0000 From: Ludovico Coletta To: "scikit-learn at python.org" Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit Message-ID: Content-Type: text/plain; charset="iso-8859-1" Dear scikit experts, I'm struggling with the implementation of a nested cross validation. My data: I have 26 subjects (13 per class) x 6670 features. I used a feature reduction algorithm (you may have heard about Boruta) to reduce the dimensionality of my data. Problems start now: I defined LOSO as outer partitioning schema. Therefore, for each of the 26 cv folds I used 24 subjects for feature reduction. This lead to a different number of features in each cv fold. Now, for each cv fold I would like to use the same 24 subjects for hyperparameter optimization (SVM with rbf kernel). This is what I did: cv = list(LeaveOneout(len(y))) # in y I stored the labels inner_train = [None] * len(y) inner_test = [None] * len(y) ii = 0 while ii < len(y): cv = list(LeaveOneOut(len(y))) a = cv[ii][0] a = a[:-1] inner_train[ii] = a b = cv[ii][0] b = np.array(b[((len(cv[0][0]))-1)]) inner_test[ii]=b ii = ii + 1 custom_cv = zip(inner_train,inner_test) # inner cv pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', SVC(kernel="rbf"))]) parameters = [{'clf__C': np.logspace(-2, 10, 13), 'clf__gamma':np.logspace(-9, 3, 13)}] scores = [None] * (len(y)) ii = 0 while ii < len(scores): a = data[ii][0] # data for train b = data[ii][1] # data for test c = np.concatenate((a,b)) # shape: number of subjects * number of features d = cv[ii][0] # labels for train e = cv[ii][1] # label for test f = np.concatenate((d,e)) grid_search = GridSearchCV(estimator=pipe_logistic, param_grid=parameters, verbose=1, scoring='accuracy', cv= zip(([custom_cv[ii][0]]), ([custom_cv[ii][1]]))) scores[ii] = cross_validation.cross_val_score(grid_search, c, y[f], scoring='accuracy', cv = zip(([cv[ii][0]]), ([cv[ii][1]]))) ii = ii + 1 However, I got the following error message: index 25 is out of bounds for size 25 Would it be so bad if I do not perform a nested LOSO but I use the default setting for hyperparameter optimization? Any help would be really appreciated -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Sun, 04 Dec 2016 23:50:21 +0300 From: avn at mccme.ru To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Adding samplers for intersection/Jensen-Shannon kernels Message-ID: <0511d5fa33737f78ccdf7fbb2e5b2156 at mccme.ru> Content-Type: text/plain; charset=UTF-8; format=flowed I see now. So I'll proceed with adding documentation and unit tests for those kernels to complete their support. And I don't think they're too specialized, given that many kinds of feature vectors in e.g. computer vision are in fact histograms and all of those kernels are histogram-oriented. Andy ????? 2016-12-04 00:23: > Hi Valery. > I didn't include them because the Chi2 worked better for my task ;) > In hindsight, I'm not sure if these kernels are not to a bit too > specialized for scikit-learn. > But given that we have the (slightly more obscure) SkewedChi2 and > AdditiveChi2, > I think the intersection one would be a good addition if you found it > useful. > > Andy > > On 12/03/2016 03:39 PM, Valery Anisimovsky via scikit-learn wrote: >> Hello, >> >> In the course of my work, I've made samplers for >> intersection/Jensen-Shannon kernels, just by small modifications to >> sklearn.kernel_approximation.AdditiveChi2Sampler code. Intersection >> kernel proved to be the best one for my task (clustering Docstrum >> feature vectors), so perhaps it'd be good to add those samplers >> alongside AdditiveChi2Sampler? Should I proceed with creating a pull >> request? Or, perhaps, those kernels were not already included for some >> good reason? >> >> With best regards, >> -- Valery >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... ------------------------------ Message: 3 Date: Sun, 4 Dec 2016 22:27:02 +0100 From: Raghav R V To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit Message-ID: Content-Type: text/plain; charset="utf-8" Hi! It looks like you are using the old sklearn.cross_validation's LeaveOneLabelOut cross-validator. It has been deprecated since v0.18. Use the LeaveOneLabelOut from sklearn.model_selection, that should fix your issue I think (thought I have not looked into your code in detail). HTH! On Sun, Dec 4, 2016 at 9:12 PM, Ludovico Coletta wrote: > Dear scikit experts, > > I'm struggling with the implementation of a nested cross validation. > > My data: I have 26 subjects (13 per class) x 6670 features. I used a > feature reduction algorithm (you may have heard about Boruta) to reduce the > dimensionality of my data. Problems start now: I defined LOSO as outer > partitioning schema. Therefore, for each of the 26 cv folds I used 24 > subjects for feature reduction. This lead to a different number of features > in each cv fold. Now, for each cv fold I would like to use the same 24 > subjects for hyperparameter optimization (SVM with rbf kernel). > > This is what I did: > > *cv = list(LeaveOneout(len(y))) # in y I stored the labels* > > *inner_train = [None] * len(y)* > > *inner_test = [None] * len(y)* > > *ii = 0* > > *while ii < len(y):* > * cv = list(LeaveOneOut(len(y))) * > * a = cv[ii][0]* > * a = a[:-1]* > * inner_train[ii] = a* > > * b = cv[ii][0]* > * b = np.array(b[((len(cv[0][0]))-1)])* > * inner_test[ii]=b* > > * ii = ii + 1* > > *custom_cv = zip(inner_train,inner_test) # inner cv* > > > *pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', > SVC(kernel="rbf"))])* > > *parameters = [{'clf__C': np.logspace(-2, 10, 13), > 'clf__gamma':np.logspace(-9, 3, 13)}]* > > > > *scores = [None] * (len(y)) * > > *ii = 0* > > *while ii < len(scores):* > > * a = data[ii][0] # data for train* > * b = data[ii][1] # data for test* > * c = np.concatenate((a,b)) # shape: number of subjects * number of > features* > * d = cv[ii][0] # labels for train* > * e = cv[ii][1] # label for test* > * f = np.concatenate((d,e))* > > * grid_search = GridSearchCV(estimator=pipe_logistic, > param_grid=parameters, verbose=1, scoring='accuracy', cv= > zip(([custom_cv[ii][0]]), ([custom_cv[ii][1]])))* > > * scores[ii] = cross_validation.cross_val_score(grid_search, c, y[f], > scoring='accuracy', cv = zip(([cv[ii][0]]), ([cv[ii][1]])))* > > * ii = ii + 1* > > > > However, I got the following error message: index 25 is out of bounds for > size 25 > > Would it be so bad if I do not perform a nested LOSO but I use the default > setting for hyperparameter optimization? > > Any help would be really appreciated > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > -- Raghav RV https://github.com/raghavrv [https://avatars2.githubusercontent.com/u/9487348?v=3&s=400] raghavrv (Raghav RV) ? GitHub github.com raghavrv has 18 repositories available. Follow their code on GitHub. -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Subject: Digest Footer _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... ------------------------------ End of scikit-learn Digest, Vol 9, Issue 13 ******************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Mon Dec 5 08:45:52 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 5 Dec 2016 14:45:52 +0100 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References: Message-ID: <20161205134552.GK2327874@phare.normalesup.org> Interestingly, a couple of days before this thread was started a researcher in a top lab of a huge private-sector company had mentionned to me that they found this algorithm very useful in practice (sorry for taking time to point this out, I just needed to check with him that indeed it was this specific algorithm). G On Sun, Dec 04, 2016 at 08:18:54AM +0000, Raphael C wrote: > I think you get a better view of the importance of Markov Clustering in > academia from https://scholar.google.co.uk/scholar?hl=en&as_sdt=0,5&q= > Markov+clustering . > Raphael > On Sat, 3 Dec 2016 at 22:43 Allan Visochek wrote: > Thanks for pointing that out, I sort of picked it up by word of mouth so > I'd assumed it had a bit more precedence in the academic world. ? > I'll look into it a little more, but I'd definitely be interested in > contributing something else if that doesn't work out. > -Allan > On Sat, Dec 3, 2016 at 4:45 PM, Andy wrote: > Hey Allan. > None of the references apart from the last one seems to be published in > a peer-reviewed place, is that right? > And "A stochastic uncoupling process for graphs" has 13 citations since > 2000. Unless there is a more prominent > publication or evidence of heavy use, I think it's disqualified. > Academia is certainly not the only metric for evaluation, so if you > have others, that's good, too ;) > Best, > Andy > On 12/03/2016 04:33 PM, Allan Visochek wrote: > Hey Andy, > This algorithm does operate on sparse graphs so it may be beyond > the scope of sci-kit learn, let me know what you think.? > The website is here, it includes a brief description of how the > algorithm operates under Documentation -> Overview1 and Overview2.? > The references listed on the website are included below. > Best, > -Allan > [1]?Stijn van Dongen.?Graph Clustering by Flow Simulation. PhD > thesis, University of Utrecht, May 2000. > http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > [2]?Stijn van Dongen.?A cluster algorithm for graphs. Technical > Report INS-R0010, National Research Institute for Mathematics and > Computer Science in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > [3]?Stijn van Dongen.?A stochastic uncoupling process for graphs. > Technical Report INS-R0011, National Research Institute for > Mathematics and Computer Science in the Netherlands, Amsterdam, May > 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > [4]?Stijn van Dongen.?Performance criteria for graph clustering and > Markov cluster experiments. Technical Report INS-R0012, National > Research Institute for Mathematics and Computer Science in the > Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > [5]?Enright A.J., Van Dongen S., Ouzounis C.A.?An efficient > algorithm for large-scale detection of protein families, Nucleic > Acids Research 30(7):1575-1584 (2002). > On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > Hi Allan. > Can you provide the original paper? > It this something usually used on sparse graphs? We do have > algorithms that operate on data-induced > graphs, like SpectralClustering, but we don't really implement > general graph algorithms (there's no PageRank or community > detection). > Andy > On 12/03/2016 12:19 PM, Allan Visochek wrote: > Hi there, > My name is Allan Visochek, I'm a data scientist and web > developer and I love scikit-learn so first of all, thanks > so much for the work that you do.? > I'm reaching out because I've found the markov clustering > algorithm to be quite useful for me in some of my work and > noticed that there is no implementation in scikit-learn, is > anybody working on this? If not, id be happy to take this > on. I'm new to open source, but I've been working with > python for a few years now.? > Best, > -Allan > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org https://mail.python.org/ > mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From drraph at gmail.com Mon Dec 5 08:51:26 2016 From: drraph at gmail.com (Raphael C) Date: Mon, 5 Dec 2016 13:51:26 +0000 Subject: [scikit-learn] Markov Clustering? In-Reply-To: <20161205134552.GK2327874@phare.normalesup.org> References: <20161205134552.GK2327874@phare.normalesup.org> Message-ID: And... [1] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm has 1201 citations. I think it's fair to say the method is very widely known and used. Raphael On 5 December 2016 at 13:45, Gael Varoquaux wrote: > Interestingly, a couple of days before this thread was started a > researcher in a top lab of a huge private-sector company had mentionned > to me that they found this algorithm very useful in practice (sorry for > taking time to point this out, I just needed to check with him that > indeed it was this specific algorithm). > > G > > On Sun, Dec 04, 2016 at 08:18:54AM +0000, Raphael C wrote: >> I think you get a better view of the importance of Markov Clustering in >> academia from https://scholar.google.co.uk/scholar?hl=en&as_sdt=0,5&q= >> Markov+clustering . > >> Raphael > >> On Sat, 3 Dec 2016 at 22:43 Allan Visochek wrote: > >> Thanks for pointing that out, I sort of picked it up by word of mouth so >> I'd assumed it had a bit more precedence in the academic world. > >> I'll look into it a little more, but I'd definitely be interested in >> contributing something else if that doesn't work out. > >> -Allan > >> On Sat, Dec 3, 2016 at 4:45 PM, Andy wrote: > >> Hey Allan. > >> None of the references apart from the last one seems to be published in >> a peer-reviewed place, is that right? >> And "A stochastic uncoupling process for graphs" has 13 citations since >> 2000. Unless there is a more prominent >> publication or evidence of heavy use, I think it's disqualified. >> Academia is certainly not the only metric for evaluation, so if you >> have others, that's good, too ;) > >> Best, >> Andy > >> On 12/03/2016 04:33 PM, Allan Visochek wrote: > >> Hey Andy, > >> This algorithm does operate on sparse graphs so it may be beyond >> the scope of sci-kit learn, let me know what you think. >> The website is here, it includes a brief description of how the >> algorithm operates under Documentation -> Overview1 and Overview2. >> The references listed on the website are included below. > >> Best, >> -Allan > > >> [1] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD >> thesis, University of Utrecht, May 2000. >> http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > >> [2] Stijn van Dongen. A cluster algorithm for graphs. Technical >> Report INS-R0010, National Research Institute for Mathematics and >> Computer Science in the Netherlands, Amsterdam, May 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > >> [3] Stijn van Dongen. A stochastic uncoupling process for graphs. >> Technical Report INS-R0011, National Research Institute for >> Mathematics and Computer Science in the Netherlands, Amsterdam, May >> 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > >> [4] Stijn van Dongen. Performance criteria for graph clustering and >> Markov cluster experiments. Technical Report INS-R0012, National >> Research Institute for Mathematics and Computer Science in the >> Netherlands, Amsterdam, May 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > >> [5] Enright A.J., Van Dongen S., Ouzounis C.A. An efficient >> algorithm for large-scale detection of protein families, Nucleic >> Acids Research 30(7):1575-1584 (2002). > > >> On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > >> Hi Allan. >> Can you provide the original paper? >> It this something usually used on sparse graphs? We do have >> algorithms that operate on data-induced >> graphs, like SpectralClustering, but we don't really implement >> general graph algorithms (there's no PageRank or community >> detection). > >> Andy > > >> On 12/03/2016 12:19 PM, Allan Visochek wrote: > >> Hi there, > >> My name is Allan Visochek, I'm a data scientist and web >> developer and I love scikit-learn so first of all, thanks >> so much for the work that you do. > >> I'm reaching out because I've found the markov clustering >> algorithm to be quite useful for me in some of my work and >> noticed that there is no implementation in scikit-learn, is >> anybody working on this? If not, id be happy to take this >> on. I'm new to open source, but I've been working with >> python for a few years now. > >> Best, >> -Allan > > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > >> _______________________________________________ scikit-learn >> mailing list scikit-learn at python.org https://mail.python.org/ >> mailman/listinfo/scikit-learn > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Mon Dec 5 08:51:47 2016 From: t3kcit at gmail.com (Andy) Date: Mon, 5 Dec 2016 08:51:47 -0500 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References: Message-ID: <2864ea23-e6ca-cf83-599f-f8ec149e8d67@gmail.com> On 12/04/2016 04:27 PM, Raghav R V wrote: > Hi! > > It looks like you are using the old sklearn.cross_validation's > LeaveOneLabelOut cross-validator. It has been deprecated since v0.18. > > Use the LeaveOneLabelOut from sklearn.model_selection, that should > fix your issue I think (thought I have not looked into your code in > detail). > You mean LeaveOneGroupOut, right? From t3kcit at gmail.com Mon Dec 5 08:54:01 2016 From: t3kcit at gmail.com (Andy) Date: Mon, 5 Dec 2016 08:54:01 -0500 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References: Message-ID: <6e970af0-faeb-bf81-3e9f-28dcc5df9168@gmail.com> I'm not sure what the issue with your custom CV is but this seems like a complicated way to implement this. Try model_selection.LeaveOneGroupOut, which directly implements LOSO On 12/04/2016 03:12 PM, Ludovico Coletta wrote: > Dear scikit experts, > > I'm struggling with the implementation of a nested cross validation. > > My data: I have 26 subjects (13 per class) x 6670 features. I used a > feature reduction algorithm (you may have heard about Boruta) to > reduce the dimensionality of my data. Problems start now: I defined > LOSO as outer partitioning schema. Therefore, for each of the 26 cv > folds I used 24 subjects for feature reduction. This lead to a > different number of features in each cv fold. Now, for each cv fold I > would like to use the same 24 subjects for hyperparameter optimization > (SVM with rbf kernel). > > This is what I did: > > /cv = list(LeaveOneout(len(y))) # in y I stored the labels/ > / > / > /inner_train = [None] * len(y)/ > / > / > /inner_test = [None] * len(y)/ > / > / > /ii = 0/ > / > / > /while ii < len(y):/ > / cv = list(LeaveOneOut(len(y))) / > / a = cv[ii][0]/ > / a = a[:-1]/ > / inner_train[ii] = a/ > / > / > / b = cv[ii][0]/ > / b = np.array(b[((len(cv[0][0]))-1)])/ > / inner_test[ii]=b/ > / > / > / ii = ii + 1/ > / > / > /custom_cv = zip(inner_train,inner_test) # inner cv/ > / > / > / > / > /pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', > SVC(kernel="rbf"))])/ > / > / > /parameters = [{'clf__C': np.logspace(-2, 10, 13), > 'clf__gamma':np.logspace(-9, 3, 13)}]/ > / > / > / > / > / > / > /scores = [None] * (len(y)) / > / > / > /ii = 0/ > / > / > /while ii < len(scores):/ > / > / > / a = data[ii][0] # data for train/ > / b = data[ii][1] # data for test/ > / c = np.concatenate((a,b)) # shape: number of subjects * number of > features/ > / d = cv[ii][0] # labels for train/ > / e = cv[ii][1] # label for test/ > / f = np.concatenate((d,e))/ > / > / > / grid_search = GridSearchCV(estimator=pipe_logistic, > param_grid=parameters, verbose=1, scoring='accuracy', cv= > zip(([custom_cv[ii][0]]), ([custom_cv[ii][1]])))/ > / > / > / scores[ii] = cross_validation.cross_val_score(grid_search, c, > y[f], scoring='accuracy', cv = zip(([cv[ii][0]]), ([cv[ii][1]])))/ > / > / > / ii = ii + 1/ > However, I got the following error message: index 25 is out of bounds > for size 25 > > Would it be so bad if I do not perform a nested LOSO but I use the > default setting for hyperparameter optimization? > > Any help would be really appreciated > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Dec 5 08:57:08 2016 From: t3kcit at gmail.com (Andy) Date: Mon, 5 Dec 2016 08:57:08 -0500 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References: <20161205134552.GK2327874@phare.normalesup.org> Message-ID: On 12/05/2016 08:51 AM, Raphael C wrote: > And... > > [1] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD > thesis, University of Utrecht, May 2000. > http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > > has > > 1201 citations. > > I think it's fair to say the method is very widely known and used. > Ok cool. I haven't looked at it, my question is now whether this is more of a "graph clustering" or a "data clustering" approach, though that distinction is not very clear. Some of the papers compare it against affinity propagation, which we do have implemented. If this algorithm makes sense for knn graphs or similar methods we implemented in SpectralClustering, then I guess go for it? From ragvrv at gmail.com Mon Dec 5 09:19:55 2016 From: ragvrv at gmail.com (Raghav R V) Date: Mon, 5 Dec 2016 15:19:55 +0100 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: <2864ea23-e6ca-cf83-599f-f8ec149e8d67@gmail.com> References: <2864ea23-e6ca-cf83-599f-f8ec149e8d67@gmail.com> Message-ID: Ah yes sorry LeaveOneGroupOut indeed! Also refer this example for nested cv - http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html Thx! On Mon, Dec 5, 2016 at 2:51 PM, Andy wrote: > > > On 12/04/2016 04:27 PM, Raghav R V wrote: > >> Hi! >> >> It looks like you are using the old sklearn.cross_validation's >> LeaveOneLabelOut cross-validator. It has been deprecated since v0.18. >> >> Use the LeaveOneLabelOut from sklearn.model_selection, that should >> fix your issue I think (thought I have not looked into your code in detail). >> >> You mean LeaveOneGroupOut, right? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From ludo25_90 at hotmail.com Mon Dec 5 09:42:47 2016 From: ludo25_90 at hotmail.com (Ludovico Coletta) Date: Mon, 5 Dec 2016 14:42:47 +0000 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References: Message-ID: thank you for the quick answer! The problem is that I have a different number of features for each cv folds, therefore I thought that I had to handle each cv fold separately. I did like you suggested, but training set of the outer cv is then further splitted 3 times (stratified kfold), which I think is suboptimal (for feature selection I indeed implemented a nested loso). One question: would it be so bad if I had nested loso for feature selection but the default stratified kfold for hyperparameter optimization? It would be some kind of double dipping in the nested cv, but the final set left out for test is not concerned. The other point is that maybe I got something wrong in the whole process. I have 26 subjects. CV 1: subject 26 is left out for the final test, subjects 1:24 are used for hyperparameter optimization, subject 25 is used to select the best hyperpameters CV 2: subject 1 is left out for the final test, subjects 2:25 are used for hyperparameter optimization, subject 26 is used to select the best hyperpameters. So until the end. Is that correct? Sorry for the trivial questions, but I am quite a beginner with both Python and ML Best Ludovico ________________________________ Da: scikit-learn per conto di scikit-learn-request at python.org Inviato: luned? 5 dicembre 2016 14.54 A: scikit-learn at python.org Oggetto: scikit-learn Digest, Vol 9, Issue 15 Send scikit-learn mailing list submissions to scikit-learn at python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... or, via email, send a message with subject or body 'help' to scikit-learn-request at python.org You can reach the person managing the list at scikit-learn-owner at python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. Re: Markov Clustering? (Gael Varoquaux) 2. Re: Markov Clustering? (Raphael C) 3. Re: Nested Leave One Subject Out (LOSO) cross validation with scikit (Andy) 4. Re: Nested Leave One Subject Out (LOSO) cross validation with scikit (Andy) ---------------------------------------------------------------------- Message: 1 Date: Mon, 5 Dec 2016 14:45:52 +0100 From: Gael Varoquaux To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Markov Clustering? Message-ID: <20161205134552.GK2327874 at phare.normalesup.org> Content-Type: text/plain; charset=iso-8859-1 Interestingly, a couple of days before this thread was started a researcher in a top lab of a huge private-sector company had mentionned to me that they found this algorithm very useful in practice (sorry for taking time to point this out, I just needed to check with him that indeed it was this specific algorithm). G On Sun, Dec 04, 2016 at 08:18:54AM +0000, Raphael C wrote: > I think you get a better view of the importance of Markov Clustering in > academia from https://scholar.google.co.uk/scholar?hl=en&as_sdt=0,5&q= > Markov+clustering . > Raphael > On Sat, 3 Dec 2016 at 22:43 Allan Visochek wrote: > Thanks for pointing that out, I sort of picked it up by word of mouth so > I'd assumed it had a bit more precedence in the academic world. ? > I'll look into it a little more, but I'd definitely be interested in > contributing something else if that doesn't work out. > -Allan > On Sat, Dec 3, 2016 at 4:45 PM, Andy wrote: > Hey Allan. > None of the references apart from the last one seems to be published in > a peer-reviewed place, is that right? > And "A stochastic uncoupling process for graphs" has 13 citations since > 2000. Unless there is a more prominent > publication or evidence of heavy use, I think it's disqualified. > Academia is certainly not the only metric for evaluation, so if you > have others, that's good, too ;) > Best, > Andy > On 12/03/2016 04:33 PM, Allan Visochek wrote: > Hey Andy, > This algorithm does operate on sparse graphs so it may be beyond > the scope of sci-kit learn, let me know what you think.? > The website is here, it includes a brief description of how the > algorithm operates under Documentation -> Overview1 and Overview2.? > The references listed on the website are included below. > Best, > -Allan > [1]?Stijn van Dongen.?Graph Clustering by Flow Simulation. PhD > thesis, University of Utrecht, May 2000. > http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > [2]?Stijn van Dongen.?A cluster algorithm for graphs. Technical > Report INS-R0010, National Research Institute for Mathematics and > Computer Science in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > [3]?Stijn van Dongen.?A stochastic uncoupling process for graphs. > Technical Report INS-R0011, National Research Institute for > Mathematics and Computer Science in the Netherlands, Amsterdam, May > 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > [4]?Stijn van Dongen.?Performance criteria for graph clustering and > Markov cluster experiments. Technical Report INS-R0012, National > Research Institute for Mathematics and Computer Science in the > Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > [5]?Enright A.J., Van Dongen S., Ouzounis C.A.?An efficient > algorithm for large-scale detection of protein families, Nucleic > Acids Research 30(7):1575-1584 (2002). > On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > Hi Allan. > Can you provide the original paper? > It this something usually used on sparse graphs? We do have > algorithms that operate on data-induced > graphs, like SpectralClustering, but we don't really implement > general graph algorithms (there's no PageRank or community > detection). > Andy > On 12/03/2016 12:19 PM, Allan Visochek wrote: > Hi there, > My name is Allan Visochek, I'm a data scientist and web > developer and I love scikit-learn so first of all, thanks > so much for the work that you do.? > I'm reaching out because I've found the markov clustering > algorithm to be quite useful for me in some of my work and > noticed that there is no implementation in scikit-learn, is > anybody working on this? If not, id be happy to take this > on. I'm new to open source, but I've been working with > python for a few years now.? > Best, > -Allan > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org https://mail.python.org/ mail.python.org Mailing Lists mail.python.org mail.python.org Mailing Lists: Welcome! Below is a listing of all the public mailing lists on mail.python.org. Click on a list name to get more information ... > mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux Gael Varoquaux (@GaelVaroquaux) | Twitter twitter.com The latest Tweets from Gael Varoquaux (@GaelVaroquaux). Researcher and geek: ?Brain, Data, & Computational science ?#python #pydata #sklearn ?Machine learning for fMRI ?Photography on @artgael. Paris, France Ga?l Varoquaux: computer / data / brain science gael-varoquaux.info Ga?l Varoquaux, computer / data / brain science ... Latest posts . misc personnal programming science Data science instrumenting social media for advertising is ... ------------------------------ Message: 2 Date: Mon, 5 Dec 2016 13:51:26 +0000 From: Raphael C To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Markov Clustering? Message-ID: Content-Type: text/plain; charset=UTF-8 And... [1] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm has 1201 citations. I think it's fair to say the method is very widely known and used. Raphael On 5 December 2016 at 13:45, Gael Varoquaux wrote: > Interestingly, a couple of days before this thread was started a > researcher in a top lab of a huge private-sector company had mentionned > to me that they found this algorithm very useful in practice (sorry for > taking time to point this out, I just needed to check with him that > indeed it was this specific algorithm). > > G > > On Sun, Dec 04, 2016 at 08:18:54AM +0000, Raphael C wrote: >> I think you get a better view of the importance of Markov Clustering in >> academia from https://scholar.google.co.uk/scholar?hl=en&as_sdt=0,5&q= >> Markov+clustering . > >> Raphael > >> On Sat, 3 Dec 2016 at 22:43 Allan Visochek wrote: > >> Thanks for pointing that out, I sort of picked it up by word of mouth so >> I'd assumed it had a bit more precedence in the academic world. > >> I'll look into it a little more, but I'd definitely be interested in >> contributing something else if that doesn't work out. > >> -Allan > >> On Sat, Dec 3, 2016 at 4:45 PM, Andy wrote: > >> Hey Allan. > >> None of the references apart from the last one seems to be published in >> a peer-reviewed place, is that right? >> And "A stochastic uncoupling process for graphs" has 13 citations since >> 2000. Unless there is a more prominent >> publication or evidence of heavy use, I think it's disqualified. >> Academia is certainly not the only metric for evaluation, so if you >> have others, that's good, too ;) > >> Best, >> Andy > >> On 12/03/2016 04:33 PM, Allan Visochek wrote: > >> Hey Andy, > >> This algorithm does operate on sparse graphs so it may be beyond >> the scope of sci-kit learn, let me know what you think. >> The website is here, it includes a brief description of how the >> algorithm operates under Documentation -> Overview1 and Overview2. >> The references listed on the website are included below. > >> Best, >> -Allan > > >> [1] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD >> thesis, University of Utrecht, May 2000. >> http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > >> [2] Stijn van Dongen. A cluster algorithm for graphs. Technical >> Report INS-R0010, National Research Institute for Mathematics and >> Computer Science in the Netherlands, Amsterdam, May 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > >> [3] Stijn van Dongen. A stochastic uncoupling process for graphs. >> Technical Report INS-R0011, National Research Institute for >> Mathematics and Computer Science in the Netherlands, Amsterdam, May >> 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > >> [4] Stijn van Dongen. Performance criteria for graph clustering and >> Markov cluster experiments. Technical Report INS-R0012, National >> Research Institute for Mathematics and Computer Science in the >> Netherlands, Amsterdam, May 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > >> [5] Enright A.J., Van Dongen S., Ouzounis C.A. An efficient >> algorithm for large-scale detection of protein families, Nucleic >> Acids Research 30(7):1575-1584 (2002). > > >> On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > >> Hi Allan. >> Can you provide the original paper? >> It this something usually used on sparse graphs? We do have >> algorithms that operate on data-induced >> graphs, like SpectralClustering, but we don't really implement >> general graph algorithms (there's no PageRank or community >> detection). > >> Andy > > >> On 12/03/2016 12:19 PM, Allan Visochek wrote: > >> Hi there, > >> My name is Allan Visochek, I'm a data scientist and web >> developer and I love scikit-learn so first of all, thanks >> so much for the work that you do. > >> I'm reaching out because I've found the markov clustering >> algorithm to be quite useful for me in some of my work and >> noticed that there is no implementation in scikit-learn, is >> anybody working on this? If not, id be happy to take this >> on. I'm new to open source, but I've been working with >> python for a few years now. > >> Best, >> -Allan > > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > >> _______________________________________________ scikit-learn >> mailing list scikit-learn at python.org https://mail.python.org/ mail.python.org Mailing Lists mail.python.org mail.python.org Mailing Lists: Welcome! Below is a listing of all the public mailing lists on mail.python.org. Click on a list name to get more information ... >> mailman/listinfo/scikit-learn > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux Gael Varoquaux (@GaelVaroquaux) | Twitter twitter.com The latest Tweets from Gael Varoquaux (@GaelVaroquaux). Researcher and geek: ?Brain, Data, & Computational science ?#python #pydata #sklearn ?Machine learning for fMRI ?Photography on @artgael. Paris, France Ga?l Varoquaux: computer / data / brain science gael-varoquaux.info Ga?l Varoquaux, computer / data / brain science ... Latest posts . misc personnal programming science Data science instrumenting social media for advertising is ... > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... ------------------------------ Message: 3 Date: Mon, 5 Dec 2016 08:51:47 -0500 From: Andy To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit Message-ID: <2864ea23-e6ca-cf83-599f-f8ec149e8d67 at gmail.com> Content-Type: text/plain; charset=windows-1252; format=flowed On 12/04/2016 04:27 PM, Raghav R V wrote: > Hi! > > It looks like you are using the old sklearn.cross_validation's > LeaveOneLabelOut cross-validator. It has been deprecated since v0.18. > > Use the LeaveOneLabelOut from sklearn.model_selection, that should > fix your issue I think (thought I have not looked into your code in > detail). > You mean LeaveOneGroupOut, right? ------------------------------ Message: 4 Date: Mon, 5 Dec 2016 08:54:01 -0500 From: Andy To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit Message-ID: <6e970af0-faeb-bf81-3e9f-28dcc5df9168 at gmail.com> Content-Type: text/plain; charset="windows-1252"; Format="flowed" I'm not sure what the issue with your custom CV is but this seems like a complicated way to implement this. Try model_selection.LeaveOneGroupOut, which directly implements LOSO On 12/04/2016 03:12 PM, Ludovico Coletta wrote: > Dear scikit experts, > > I'm struggling with the implementation of a nested cross validation. > > My data: I have 26 subjects (13 per class) x 6670 features. I used a > feature reduction algorithm (you may have heard about Boruta) to > reduce the dimensionality of my data. Problems start now: I defined > LOSO as outer partitioning schema. Therefore, for each of the 26 cv > folds I used 24 subjects for feature reduction. This lead to a > different number of features in each cv fold. Now, for each cv fold I > would like to use the same 24 subjects for hyperparameter optimization > (SVM with rbf kernel). > > This is what I did: > > /cv = list(LeaveOneout(len(y))) # in y I stored the labels/ > / > / > /inner_train = [None] * len(y)/ > / > / > /inner_test = [None] * len(y)/ > / > / > /ii = 0/ > / > / > /while ii < len(y):/ > / cv = list(LeaveOneOut(len(y))) / > / a = cv[ii][0]/ > / a = a[:-1]/ > / inner_train[ii] = a/ > / > / > / b = cv[ii][0]/ > / b = np.array(b[((len(cv[0][0]))-1)])/ > / inner_test[ii]=b/ > / > / > / ii = ii + 1/ > / > / > /custom_cv = zip(inner_train,inner_test) # inner cv/ > / > / > / > / > /pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', > SVC(kernel="rbf"))])/ > / > / > /parameters = [{'clf__C': np.logspace(-2, 10, 13), > 'clf__gamma':np.logspace(-9, 3, 13)}]/ > / > / > / > / > / > / > /scores = [None] * (len(y)) / > / > / > /ii = 0/ > / > / > /while ii < len(scores):/ > / > / > / a = data[ii][0] # data for train/ > / b = data[ii][1] # data for test/ > / c = np.concatenate((a,b)) # shape: number of subjects * number of > features/ > / d = cv[ii][0] # labels for train/ > / e = cv[ii][1] # label for test/ > / f = np.concatenate((d,e))/ > / > / > / grid_search = GridSearchCV(estimator=pipe_logistic, > param_grid=parameters, verbose=1, scoring='accuracy', cv= > zip(([custom_cv[ii][0]]), ([custom_cv[ii][1]])))/ > / > / > / scores[ii] = cross_validation.cross_val_score(grid_search, c, > y[f], scoring='accuracy', cv = zip(([cv[ii][0]]), ([cv[ii][1]])))/ > / > / > / ii = ii + 1/ > However, I got the following error message: index 25 is out of bounds > for size 25 > > Would it be so bad if I do not perform a nested LOSO but I use the > default setting for hyperparameter optimization? > > Any help would be really appreciated > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Subject: Digest Footer _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... ------------------------------ End of scikit-learn Digest, Vol 9, Issue 15 ******************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: From clay at woolam.org Mon Dec 5 16:50:01 2016 From: clay at woolam.org (Clay Woolam) Date: Mon, 5 Dec 2016 13:50:01 -0800 Subject: [scikit-learn] [semi-supervised learning] Using a pre-existing graph with LabelSpreading API In-Reply-To: References: Message-ID: Heya, sorry for not responding sooner. Running those algorithms algorithm is expensive (O(n^3) from memory), so that's going to be a big limiting factor. And I worry that your graph may be too big for these algorithsm. The max_iter param is certainly available for tuning which trade-off the accuracy of the result. Totally speculating: I don't think sparsifying would help too much with these implementations. These both create fully connected graphs as part of the graph construction step. I think sparsification would help a lot if you instead directly simulated the particle movements through the graph, instead of using these exact solutions. For #2, what if you subclassed the LabelSpreading class and overrode _build_graph to inject the graph that you set up? May be a big hack. On Thu, Dec 1, 2016 at 7:33 PM, Delip Rao wrote: > Hello, > > I have an existing graph dataset in the edge format: > > node_i node_j weight > > The number of nodes are around 3.6M, and the number of edges are around > 72M. > > I also have some labeled data (around a dozen per class with 16 classes in > total), so overall, a perfect setting for label propagation or its > variants. In particular, I want to try the LabelSpreading implementation > for the regularization. I looked at the documentation and can't find a way > to plug in a pre-computed graph (or adjacency matrix). So two questions: > > 1. What are any scaling issues I should be aware of for a dataset of this > size? I can try sparsifying the graph, but would love to learn any knobs I > should be aware of. > 2. How do I plugin an existing weighted graph with the current API? Happy > to use any undocumented features. > > Thanks in advance! > Delip > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Dec 5 21:35:33 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 6 Dec 2016 13:35:33 +1100 Subject: [scikit-learn] Github project management tools In-Reply-To: References: <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com> <20161203105215.GH455403@phare.normalesup.org> <43b13054-ef73-54b3-0def-d138e814823d@gmail.com> Message-ID: With apologies for starting the thread and then disappearing for a while (life got in the way, and when I came back I decided the issue backlog itself was more pressing): Of late, I mostly operate on a last-in-first-out basis, so I'm highly influenced by recent activity. This minimises communication time and issues due to fading memories of contributors and reviewers. This preferences bug fixes, and is less good for long-term feature needs; but that preference also aligns with my ability to get through a small unit of work much more easily than a large conceptual PR. Sometimes I push things onto the stack that I recall I would like to see merged. From a personal perspective, having a way to remind myself of these priorities would be valuable. I'm not convinced the *Projects* kanban feature is going to readily help us, unfortunately. I don't think the problem is so much about staging issues so much as just maintaining a prioritised backlog. The Projects feature is poor with *long* lists of issues, and that's where we're at. Apart from bug fixes, I think there are *two main things that we should keep a list of*: 1. enhancements/features that a few people are agreed we'd like to see, but may get lost in the wash. 2. big ideas that need concentrated design/development work from multiple people. For example: estimator tags, sample properties, feature names, ?increased pandas support (which incidentally solves the latter two issues!), etc. What belongs in 1. is very hard to define, but it could be easily maintained through some kind of "Forget me not!" label (alternatively just "high priority"). Managing 2 is more about resolving some epic-scale goals as part of a release plan, and then managing a particular project to achieve design consensus, development tasking and review. While some of this can be managed with labels / kanban boards, this is mostly a procedural issue about establishing virtual sprints: we need a way to design release goals. In terms of *Milestones*: I don't find it useful for us to assign general issues to future milestones. Where version milestones can be useful is to say "include this in a bug-fix release" or "save this for a major release" (i.e. might break backwards compatibility in a big way). These also allow us to avoid postponing releases excessively: we can scope a release 6 weeks before its proposed date, and say that as long as we can merge or eliminate nearly every issue associated with it, we should delay no longer. The other thing that I notice is that it's not always clear who is available to review a particular contribution. GitHub allows one or more *Assignees* (must be team members) to be appointed to an issue / PR. While it is often useful to have more than two sets of eyes, using the Assignees feature may mean that each of us can better focus on a small set of issues. Upon seeing a PR that they think they will have capacity and expertise to review, core devs could assign themselves to its review, with the advantage of minimising duplicated work, while creating a sense of voluntary duty commensurate with each contributor's availability. Apologies that I don't really have a TL;DR or any explicit proposals, but I hope you find my thoughts here useful. Joel On 5 December 2016 at 03:16, Raghav R V wrote: > Okay so in the project, instead of sorting them by Issues / PR why don't >>>> we make one column per priority. Let's have 3 levels and one column for >>>> Done. We have a label for "Stalled" / "Need Contributor" which shows up in >>>> the cards of the project anyway... >>>> >>>> As I didn't want to disturb the existing project setup, I created one >>>> for a demo - https://github.com/scikit-learn/scikit-learn/projects/7 >>>> (I'm resending this e-mail as the last one was rejected because the >>>> attached image was huge for the mailing list) >>>> >>> > Thanks > > Raghav > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From linjia at ruijie.com.cn Tue Dec 6 06:12:47 2016 From: linjia at ruijie.com.cn (linjia at ruijie.com.cn) Date: Tue, 6 Dec 2016 11:12:47 +0000 Subject: [scikit-learn] =?utf-8?q?question_in_using_Scikit-learn_MLPClassi?= =?utf-8?b?Zmllcu+8nw==?= In-Reply-To: References: <20161203102926.GG455403@phare.normalesup.org> Message-ID: <265E382B26F78742B972BE038FD292C6A9F769@fzex2.ruijie.com.cn> Hi all? I uses a ?Car Evaluation? dataset from http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data to test the effect of MLP. (I transfer some class in the data to digit value, e.g. ?low? to 1 ?med? to 2, ?high ?to 3, the final dataset?s input is 6 dimension, output label is 4 dimension) However, the accuracy rate is not satisfied comparing to the result in Matlab which use BP algorithm too, I wonder if I should tune the parameter of MLP for better? Attachment: main code in matlab: accuracy 100% after train net=newff([-1 1;-1 1;-1 1;-1 1;-1 1;-1 1;],[10 4],{'tansig','logsig'},'trainlm'); main code in MLP Code: accuracy 70% after fit clf = MLPClassifier(solver='sgd', activation='logistic', max_iter=2000, learning_rate='adaptive',warm_start = True) -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Dec 6 06:27:18 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 6 Dec 2016 12:27:18 +0100 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: References: <20161203102926.GG455403@phare.normalesup.org> Message-ID: <20161206112718.GD2935632@phare.normalesup.org> On Sat, Dec 03, 2016 at 10:35:03PM +0000, federico vaggi wrote: > As long as the feature ordering has a meaningful spatial component (as is > almost always the case when you are dealing with raw pixels as features) CNNs > will almost always be better.? There is another important aspect: CNN are meant to work on data that is translation invariant. Not all images are, for instance not brain images, because they have been realigned. That said, most images are. G From avisochek3 at gmail.com Tue Dec 6 08:05:21 2016 From: avisochek3 at gmail.com (Allan Visochek) Date: Tue, 6 Dec 2016 08:05:21 -0500 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References: <20161205134552.GK2327874@phare.normalesup.org> Message-ID: At it's core, Markov clustering is a graph algorithm, it operates on a sparse similarity matrix (essentially, by simulating flow between the data points). This makes it useful for similarity graphs that don't originate from features (i.e. protien-protien interaction networks). Because the graph is based on similarity though, its definitely possible to use it as a data clustering algorithm that takes a similarity metric as an argument. I suppose it could be implemented so that the algorithm could take either a sparse similarity matrix or a set of features as its first argument. This would keep the same structure of the other clustering algorithms, but also allow use with pure similarity graphs. Does this make sense? On Mon, Dec 5, 2016 at 8:57 AM, Andy wrote: > > > On 12/05/2016 08:51 AM, Raphael C wrote: > >> And... >> >> [1] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD >> thesis, University of Utrecht, May 2000. >> http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm >> >> has >> >> 1201 citations. >> >> I think it's fair to say the method is very widely known and used. >> >> Ok cool. > I haven't looked at it, my question is now whether this is more of a > "graph clustering" > or a "data clustering" approach, though that distinction is not very clear. > Some of the papers compare it against affinity propagation, which we do > have implemented. > > If this algorithm makes sense for knn graphs or similar methods we > implemented in SpectralClustering, > then I guess go for it? > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Dec 6 09:39:51 2016 From: t3kcit at gmail.com (Andy) Date: Tue, 6 Dec 2016 09:39:51 -0500 Subject: [scikit-learn] Fwd: Scikit-learn MLPRegressor Help In-Reply-To: <20161206112718.GD2935632@phare.normalesup.org> References: <20161203102926.GG455403@phare.normalesup.org> <20161206112718.GD2935632@phare.normalesup.org> Message-ID: <7dbfa52e-d033-6dee-3823-830da315345b@gmail.com> On 12/06/2016 06:27 AM, Gael Varoquaux wrote: > On Sat, Dec 03, 2016 at 10:35:03PM +0000, federico vaggi wrote: >> As long as the feature ordering has a meaningful spatial component (as is >> almost always the case when you are dealing with raw pixels as features) CNNs >> will almost always be better. > There is another important aspect: CNN are meant to work on data that is > translation invariant. Not all images are, for instance not brain images, > because they have been realigned. That said, most images are. > They also work on images that are not globally translation invariant. Clearly you are the expert in brain images but the point is more that the same feature detector make sense in multiple locations. So as long as the local statistics and patterns are similar, I'd expect it to work. From t3kcit at gmail.com Tue Dec 6 09:36:50 2016 From: t3kcit at gmail.com (Andy) Date: Tue, 6 Dec 2016 09:36:50 -0500 Subject: [scikit-learn] Markov Clustering? In-Reply-To: References: <20161205134552.GK2327874@phare.normalesup.org> Message-ID: On 12/06/2016 08:05 AM, Allan Visochek wrote: > At it's core, Markov clustering is a graph algorithm, it operates on a > sparse similarity matrix (essentially, by simulating flow between the > data points). This makes it useful for similarity graphs that don't > originate from features (i.e. protien-protien interaction networks). > Because the graph is based on similarity though, its definitely > possible to use it as a data clustering algorithm that takes a > similarity metric as an argument. > > I suppose it could be implemented so that the algorithm could take > either a sparse similarity matrix or a set of features as its first > argument. This would keep the same structure of the other clustering > algorithms, but also allow use with pure similarity graphs. Does this > make sense? > Yeah that's also how the other algorithms work. From t3kcit at gmail.com Tue Dec 6 10:19:47 2016 From: t3kcit at gmail.com (Andy) Date: Tue, 6 Dec 2016 10:19:47 -0500 Subject: [scikit-learn] Github project management tools In-Reply-To: References: <20161203105215.GH455403@phare.normalesup.org> <43b13054-ef73-54b3-0def-d138e814823d@gmail.com> Message-ID: Thanks for your thoughts. I'm working in a similar mode, though I kind of try to avoid too much last-in first-out - I do it too, though, because I'm trying to keep up with all notifications. However, there are many older PRs and issues that are important bug-fixes and they get lost because of some minor new feature being added. Your point about faster communication in recent issues is taken, though. But I feel we should prioritize bug fixes much more - they do need more brain power to review, though :-/ From ragvrv at gmail.com Tue Dec 6 11:26:32 2016 From: ragvrv at gmail.com (Raghav R V) Date: Tue, 6 Dec 2016 17:26:32 +0100 Subject: [scikit-learn] Github project management tools In-Reply-To: References: <20161203105215.GH455403@phare.normalesup.org> <43b13054-ef73-54b3-0def-d138e814823d@gmail.com> Message-ID: +1 for self assigning PRs by reviewers... On Tue, Dec 6, 2016 at 4:19 PM, Andy wrote: > Thanks for your thoughts. > I'm working in a similar mode, though I kind of try to avoid too much > last-in first-out - I do it too, though, > because I'm trying to keep up with all notifications. > However, there are many older PRs and issues that are important bug-fixes > and they get lost because of some minor new feature being added. > Your point about faster communication in recent issues is taken, though. > > But I feel we should prioritize bug fixes much more - they do need more > brain power to review, though :-/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Raghav RV https://github.com/raghavrv -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Tue Dec 6 11:59:18 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Tue, 6 Dec 2016 11:59:18 -0500 Subject: [scikit-learn] =?utf-8?q?question_in_using_Scikit-learn_MLPClass?= =?utf-8?b?aWZpZXLvvJ8=?= In-Reply-To: <265E382B26F78742B972BE038FD292C6A9F769@fzex2.ruijie.com.cn> References: <20161203102926.GG455403@phare.normalesup.org> <265E382B26F78742B972BE038FD292C6A9F769@fzex2.ruijie.com.cn> Message-ID: Hi, typically, you want/need to play around with the hyperparameters if you want to get something useful out of an MLP ? they rarely work out of the ?box? since hyperparameters are very context-dependent. > However, the accuracy rate is not satisfied comparing to the result in Matlab which use BP algorithm too, I wonder if I should tune the parameter of MLP for better? Things you may want to try first is a) check if the training converged: i.e., check clf.loss_ for e.g., 200, 2000, 5000 iterations. If the loss is noticably smaller after 5000 iterations (compared to 2000 iters), it would tell you that it hasn?t converged yet. Especially stochastic gradient descent is very sensitive to the initial learning rate. I also suggest that you try different values for these. Also, try to use a fixed random seed for reproducibility between runs, e.g., random_state=123 b) If you are using stochastic gradient descent with a logistic activation function, you may want to scale your input features via the StandardScaler so that the features are centered at 0 with std.dev. 1. E.g., sc = StandardScaler() X_train_scaled = sc.fit_transform(X_train) X_test_scaled = sc.transform(X_test) Good luck! Sebastian > On Dec 6, 2016, at 6:12 AM, linjia at ruijie.com.cn wrote: > > Hi all? > I uses a ?Car Evaluation? dataset from http://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data to test the effect of MLP. (I transfer some class in the data to digit value, e.g. ?low? to 1 ?med? to 2, ?high ?to 3, the final dataset?s input is 6 dimension, output label is 4 dimension) > However, the accuracy rate is not satisfied comparing to the result in Matlab which use BP algorithm too, I wonder if I should tune the parameter of MLP for better? > > Attachment: > > main code in matlab: accuracy 100% after train > net=newff([-1 1;-1 1;-1 1;-1 1;-1 1;-1 1;],[10 4],{'tansig','logsig'},'trainlm'); > > main code in MLP Code: accuracy 70% after fit > clf = MLPClassifier(solver='sgd', activation='logistic', max_iter=2000, learning_rate='adaptive',warm_start = True) > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From chinmay0301 at gmail.com Wed Dec 7 07:30:27 2016 From: chinmay0301 at gmail.com (Chinmay Talegaonkar) Date: Wed, 7 Dec 2016 18:00:27 +0530 Subject: [scikit-learn] New to scikit Message-ID: Hi everyone, I have a prior experience in python, and have started learning machine learning recently. I wanted to contribute to scikit, can anyone suggest a relatively easy codebase to explore. Thanks in advance! -------------- next part -------------- An HTML attachment was scrubbed... URL: From ludo25_90 at hotmail.com Wed Dec 7 07:41:27 2016 From: ludo25_90 at hotmail.com (Ludovico Coletta) Date: Wed, 7 Dec 2016 12:41:27 +0000 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References: , Message-ID: Dear scikit experts, I did as you suggested, but it is not exactly what I would like to do ( I also read this: http://stackoverflow.com/questions/40400351/nested-cross-validation-with-stratifiedshufflesplit-in-sklearn) Perhaps I should ask my question in another way: it is possible to split the nested cv folds just once? It seems to me that this is not possible, do you have any hints? Thanks you for time Ludovico ________________________________ Da: Ludovico Coletta Inviato: luned? 5 dicembre 2016 15.42 A: scikit-learn at python.org Oggetto: Re: Nested Leave One Subject Out (LOSO) cross validation with scikit thank you for the quick answer! The problem is that I have a different number of features for each cv folds, therefore I thought that I had to handle each cv fold separately. I did like you suggested, but training set of the outer cv is then further splitted 3 times (stratified kfold), which I think is suboptimal (for feature selection I indeed implemented a nested loso). One question: would it be so bad if I had nested loso for feature selection but the default stratified kfold for hyperparameter optimization? It would be some kind of double dipping in the nested cv, but the final set left out for test is not concerned. The other point is that maybe I got something wrong in the whole process. I have 26 subjects. CV 1: subject 26 is left out for the final test, subjects 1:24 are used for hyperparameter optimization, subject 25 is used to select the best hyperpameters CV 2: subject 1 is left out for the final test, subjects 2:25 are used for hyperparameter optimization, subject 26 is used to select the best hyperpameters. So until the end. Is that correct? Sorry for the trivial questions, but I am quite a beginner with both Python and ML Best Ludovico ________________________________ Da: scikit-learn per conto di scikit-learn-request at python.org Inviato: luned? 5 dicembre 2016 14.54 A: scikit-learn at python.org Oggetto: scikit-learn Digest, Vol 9, Issue 15 Send scikit-learn mailing list submissions to scikit-learn at python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... or, via email, send a message with subject or body 'help' to scikit-learn-request at python.org You can reach the person managing the list at scikit-learn-owner at python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. Re: Markov Clustering? (Gael Varoquaux) 2. Re: Markov Clustering? (Raphael C) 3. Re: Nested Leave One Subject Out (LOSO) cross validation with scikit (Andy) 4. Re: Nested Leave One Subject Out (LOSO) cross validation with scikit (Andy) ---------------------------------------------------------------------- Message: 1 Date: Mon, 5 Dec 2016 14:45:52 +0100 From: Gael Varoquaux To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Markov Clustering? Message-ID: <20161205134552.GK2327874 at phare.normalesup.org> Content-Type: text/plain; charset=iso-8859-1 Interestingly, a couple of days before this thread was started a researcher in a top lab of a huge private-sector company had mentionned to me that they found this algorithm very useful in practice (sorry for taking time to point this out, I just needed to check with him that indeed it was this specific algorithm). G On Sun, Dec 04, 2016 at 08:18:54AM +0000, Raphael C wrote: > I think you get a better view of the importance of Markov Clustering in > academia from https://scholar.google.co.uk/scholar?hl=en&as_sdt=0,5&q= > Markov+clustering . > Raphael > On Sat, 3 Dec 2016 at 22:43 Allan Visochek wrote: > Thanks for pointing that out, I sort of picked it up by word of mouth so > I'd assumed it had a bit more precedence in the academic world. ? > I'll look into it a little more, but I'd definitely be interested in > contributing something else if that doesn't work out. > -Allan > On Sat, Dec 3, 2016 at 4:45 PM, Andy wrote: > Hey Allan. > None of the references apart from the last one seems to be published in > a peer-reviewed place, is that right? > And "A stochastic uncoupling process for graphs" has 13 citations since > 2000. Unless there is a more prominent > publication or evidence of heavy use, I think it's disqualified. > Academia is certainly not the only metric for evaluation, so if you > have others, that's good, too ;) > Best, > Andy > On 12/03/2016 04:33 PM, Allan Visochek wrote: > Hey Andy, > This algorithm does operate on sparse graphs so it may be beyond > the scope of sci-kit learn, let me know what you think.? > The website is here, it includes a brief description of how the > algorithm operates under Documentation -> Overview1 and Overview2.? > The references listed on the website are included below. > Best, > -Allan > [1]?Stijn van Dongen.?Graph Clustering by Flow Simulation. PhD > thesis, University of Utrecht, May 2000. > http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > [2]?Stijn van Dongen.?A cluster algorithm for graphs. Technical > Report INS-R0010, National Research Institute for Mathematics and > Computer Science in the Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > [3]?Stijn van Dongen.?A stochastic uncoupling process for graphs. > Technical Report INS-R0011, National Research Institute for > Mathematics and Computer Science in the Netherlands, Amsterdam, May > 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > [4]?Stijn van Dongen.?Performance criteria for graph clustering and > Markov cluster experiments. Technical Report INS-R0012, National > Research Institute for Mathematics and Computer Science in the > Netherlands, Amsterdam, May 2000. > http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > [5]?Enright A.J., Van Dongen S., Ouzounis C.A.?An efficient > algorithm for large-scale detection of protein families, Nucleic > Acids Research 30(7):1575-1584 (2002). > On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > Hi Allan. > Can you provide the original paper? > It this something usually used on sparse graphs? We do have > algorithms that operate on data-induced > graphs, like SpectralClustering, but we don't really implement > general graph algorithms (there's no PageRank or community > detection). > Andy > On 12/03/2016 12:19 PM, Allan Visochek wrote: > Hi there, > My name is Allan Visochek, I'm a data scientist and web > developer and I love scikit-learn so first of all, thanks > so much for the work that you do.? > I'm reaching out because I've found the markov clustering > algorithm to be quite useful for me in some of my work and > noticed that there is no implementation in scikit-learn, is > anybody working on this? If not, id be happy to take this > on. I'm new to open source, but I've been working with > python for a few years now.? > Best, > -Allan > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > _______________________________________________ scikit-learn > mailing list scikit-learn at python.org https://mail.python.org/ mail.python.org Mailing Lists mail.python.org mail.python.org Mailing Lists: Welcome! Below is a listing of all the public mailing lists on mail.python.org. Click on a list name to get more information ... > mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux Gael Varoquaux (@GaelVaroquaux) | Twitter twitter.com The latest Tweets from Gael Varoquaux (@GaelVaroquaux). Researcher and geek: ?Brain, Data, & Computational science ?#python #pydata #sklearn ?Machine learning for fMRI ?Photography on @artgael. Paris, France Ga?l Varoquaux: computer / data / brain science gael-varoquaux.info Ga?l Varoquaux, computer / data / brain science ... Latest posts . misc personnal programming science Data science instrumenting social media for advertising is ... ------------------------------ Message: 2 Date: Mon, 5 Dec 2016 13:51:26 +0000 From: Raphael C To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Markov Clustering? Message-ID: Content-Type: text/plain; charset=UTF-8 And... [1] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm has 1201 citations. I think it's fair to say the method is very widely known and used. Raphael On 5 December 2016 at 13:45, Gael Varoquaux wrote: > Interestingly, a couple of days before this thread was started a > researcher in a top lab of a huge private-sector company had mentionned > to me that they found this algorithm very useful in practice (sorry for > taking time to point this out, I just needed to check with him that > indeed it was this specific algorithm). > > G > > On Sun, Dec 04, 2016 at 08:18:54AM +0000, Raphael C wrote: >> I think you get a better view of the importance of Markov Clustering in >> academia from https://scholar.google.co.uk/scholar?hl=en&as_sdt=0,5&q= >> Markov+clustering . > >> Raphael > >> On Sat, 3 Dec 2016 at 22:43 Allan Visochek wrote: > >> Thanks for pointing that out, I sort of picked it up by word of mouth so >> I'd assumed it had a bit more precedence in the academic world. > >> I'll look into it a little more, but I'd definitely be interested in >> contributing something else if that doesn't work out. > >> -Allan > >> On Sat, Dec 3, 2016 at 4:45 PM, Andy wrote: > >> Hey Allan. > >> None of the references apart from the last one seems to be published in >> a peer-reviewed place, is that right? >> And "A stochastic uncoupling process for graphs" has 13 citations since >> 2000. Unless there is a more prominent >> publication or evidence of heavy use, I think it's disqualified. >> Academia is certainly not the only metric for evaluation, so if you >> have others, that's good, too ;) > >> Best, >> Andy > >> On 12/03/2016 04:33 PM, Allan Visochek wrote: > >> Hey Andy, > >> This algorithm does operate on sparse graphs so it may be beyond >> the scope of sci-kit learn, let me know what you think. >> The website is here, it includes a brief description of how the >> algorithm operates under Documentation -> Overview1 and Overview2. >> The references listed on the website are included below. > >> Best, >> -Allan > > >> [1] Stijn van Dongen. Graph Clustering by Flow Simulation. PhD >> thesis, University of Utrecht, May 2000. >> http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm > >> [2] Stijn van Dongen. A cluster algorithm for graphs. Technical >> Report INS-R0010, National Research Institute for Mathematics and >> Computer Science in the Netherlands, Amsterdam, May 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z > >> [3] Stijn van Dongen. A stochastic uncoupling process for graphs. >> Technical Report INS-R0011, National Research Institute for >> Mathematics and Computer Science in the Netherlands, Amsterdam, May >> 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0011.ps.Z > >> [4] Stijn van Dongen. Performance criteria for graph clustering and >> Markov cluster experiments. Technical Report INS-R0012, National >> Research Institute for Mathematics and Computer Science in the >> Netherlands, Amsterdam, May 2000. >> http://www.cwi.nl/ftp/CWIreports/INS/INS-R0012.ps.Z > >> [5] Enright A.J., Van Dongen S., Ouzounis C.A. An efficient >> algorithm for large-scale detection of protein families, Nucleic >> Acids Research 30(7):1575-1584 (2002). > > >> On Sat, Dec 3, 2016 at 3:34 PM, Andy wrote: > >> Hi Allan. >> Can you provide the original paper? >> It this something usually used on sparse graphs? We do have >> algorithms that operate on data-induced >> graphs, like SpectralClustering, but we don't really implement >> general graph algorithms (there's no PageRank or community >> detection). > >> Andy > > >> On 12/03/2016 12:19 PM, Allan Visochek wrote: > >> Hi there, > >> My name is Allan Visochek, I'm a data scientist and web >> developer and I love scikit-learn so first of all, thanks >> so much for the work that you do. > >> I'm reaching out because I've found the markov clustering >> algorithm to be quite useful for me in some of my work and >> noticed that there is no implementation in scikit-learn, is >> anybody working on this? If not, id be happy to take this >> on. I'm new to open source, but I've been working with >> python for a few years now. > >> Best, >> -Allan > > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > >> _______________________________________________ scikit-learn >> mailing list scikit-learn at python.org https://mail.python.org/ mail.python.org Mailing Lists mail.python.org mail.python.org Mailing Lists: Welcome! Below is a listing of all the public mailing lists on mail.python.org. Click on a list name to get more information ... >> mailman/listinfo/scikit-learn > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux Gael Varoquaux (@GaelVaroquaux) | Twitter twitter.com The latest Tweets from Gael Varoquaux (@GaelVaroquaux). Researcher and geek: ?Brain, Data, & Computational science ?#python #pydata #sklearn ?Machine learning for fMRI ?Photography on @artgael. Paris, France Ga?l Varoquaux: computer / data / brain science gael-varoquaux.info Ga?l Varoquaux, computer / data / brain science ... Latest posts . misc personnal programming science Data science instrumenting social media for advertising is ... > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... ------------------------------ Message: 3 Date: Mon, 5 Dec 2016 08:51:47 -0500 From: Andy To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit Message-ID: <2864ea23-e6ca-cf83-599f-f8ec149e8d67 at gmail.com> Content-Type: text/plain; charset=windows-1252; format=flowed On 12/04/2016 04:27 PM, Raghav R V wrote: > Hi! > > It looks like you are using the old sklearn.cross_validation's > LeaveOneLabelOut cross-validator. It has been deprecated since v0.18. > > Use the LeaveOneLabelOut from sklearn.model_selection, that should > fix your issue I think (thought I have not looked into your code in > detail). > You mean LeaveOneGroupOut, right? ------------------------------ Message: 4 Date: Mon, 5 Dec 2016 08:54:01 -0500 From: Andy To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit Message-ID: <6e970af0-faeb-bf81-3e9f-28dcc5df9168 at gmail.com> Content-Type: text/plain; charset="windows-1252"; Format="flowed" I'm not sure what the issue with your custom CV is but this seems like a complicated way to implement this. Try model_selection.LeaveOneGroupOut, which directly implements LOSO On 12/04/2016 03:12 PM, Ludovico Coletta wrote: > Dear scikit experts, > > I'm struggling with the implementation of a nested cross validation. > > My data: I have 26 subjects (13 per class) x 6670 features. I used a > feature reduction algorithm (you may have heard about Boruta) to > reduce the dimensionality of my data. Problems start now: I defined > LOSO as outer partitioning schema. Therefore, for each of the 26 cv > folds I used 24 subjects for feature reduction. This lead to a > different number of features in each cv fold. Now, for each cv fold I > would like to use the same 24 subjects for hyperparameter optimization > (SVM with rbf kernel). > > This is what I did: > > /cv = list(LeaveOneout(len(y))) # in y I stored the labels/ > / > / > /inner_train = [None] * len(y)/ > / > / > /inner_test = [None] * len(y)/ > / > / > /ii = 0/ > / > / > /while ii < len(y):/ > / cv = list(LeaveOneOut(len(y))) / > / a = cv[ii][0]/ > / a = a[:-1]/ > / inner_train[ii] = a/ > / > / > / b = cv[ii][0]/ > / b = np.array(b[((len(cv[0][0]))-1)])/ > / inner_test[ii]=b/ > / > / > / ii = ii + 1/ > / > / > /custom_cv = zip(inner_train,inner_test) # inner cv/ > / > / > / > / > /pipe_logistic = Pipeline([('scl', StandardScaler()),('clf', > SVC(kernel="rbf"))])/ > / > / > /parameters = [{'clf__C': np.logspace(-2, 10, 13), > 'clf__gamma':np.logspace(-9, 3, 13)}]/ > / > / > / > / > / > / > /scores = [None] * (len(y)) / > / > / > /ii = 0/ > / > / > /while ii < len(scores):/ > / > / > / a = data[ii][0] # data for train/ > / b = data[ii][1] # data for test/ > / c = np.concatenate((a,b)) # shape: number of subjects * number of > features/ > / d = cv[ii][0] # labels for train/ > / e = cv[ii][1] # label for test/ > / f = np.concatenate((d,e))/ > / > / > / grid_search = GridSearchCV(estimator=pipe_logistic, > param_grid=parameters, verbose=1, scoring='accuracy', cv= > zip(([custom_cv[ii][0]]), ([custom_cv[ii][1]])))/ > / > / > / scores[ii] = cross_validation.cross_val_score(grid_search, c, > y[f], scoring='accuracy', cv = zip(([cv[ii][0]]), ([cv[ii][1]])))/ > / > / > / ii = ii + 1/ > However, I got the following error message: index 25 is out of bounds > for size 25 > > Would it be so bad if I do not perform a nested LOSO but I use the > default setting for hyperparameter optimization? > > Any help would be really appreciated > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Subject: Digest Footer _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... ------------------------------ End of scikit-learn Digest, Vol 9, Issue 15 ******************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: From siddharthgupta234 at gmail.com Wed Dec 7 09:35:51 2016 From: siddharthgupta234 at gmail.com (Siddharth Gupta) Date: Wed, 7 Dec 2016 20:05:51 +0530 Subject: [scikit-learn] New to scikit In-Reply-To: References: Message-ID: Great! Welcome to the community. I would suggest you to check out the issues page on the github repo, raise hand to the issues you feel like you can give a go to, check out the issues that are tagged as require contributor. Issues are a good way to start, they will direct you about the areas of the code base to explore. On Dec 7, 2016 6:02 PM, "Chinmay Talegaonkar" wrote: > Hi everyone, > I have a prior experience in python, and have started > learning machine learning recently. I wanted to contribute to scikit, can > anyone suggest a relatively easy codebase to explore. > > Thanks in advance! > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chinmay0301 at gmail.com Wed Dec 7 09:42:49 2016 From: chinmay0301 at gmail.com (Chinmay Talegaonkar) Date: Wed, 7 Dec 2016 20:12:49 +0530 Subject: [scikit-learn] New to scikit In-Reply-To: References: Message-ID: Yeah, I found an easy bug. Looking for some help in writing deprecation cycles for a bug. On Wed, Dec 7, 2016 at 8:05 PM, Siddharth Gupta wrote: > Great! Welcome to the community. I would suggest you to check out the > issues page on the github repo, raise hand to the issues you feel like you > can give a go to, check out the issues that are tagged as require > contributor. Issues are a good way to start, they will direct you about the > areas of the code base to explore. > > On Dec 7, 2016 6:02 PM, "Chinmay Talegaonkar" > wrote: > >> Hi everyone, >> I have a prior experience in python, and have started >> learning machine learning recently. I wanted to contribute to scikit, can >> anyone suggest a relatively easy codebase to explore. >> >> Thanks in advance! >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- -- *Chinmay Talegaonkar* Cultural and Events Coordinator, Mood Indigo .............................................. +91-8879178724 chinmay0301 at gmail.com www.moodi.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From dylkot at gmail.com Wed Dec 7 10:22:21 2016 From: dylkot at gmail.com (Dylan Kotliar) Date: Wed, 7 Dec 2016 10:22:21 -0500 Subject: [scikit-learn] Latent Dirichlet Allocation transformation of data with pre-determined topic_word distribution Message-ID: Hello, I am running Latent Dirichlet Allocation 100 times on bootstrapped versions of a dataset, gathering up the topic_word matrix from each run (components_), and merging it into a final cleaner topic_word matrix. Because I am bootstrapping documents, not every document is in every run and so it isn't clear how to get a final merged doc_topic distribution. I was wondering if there is any way to run the LatentDirichletAllocation transform method with a pre-determined components_ matrix. I tried this out in a few ways none of which worked. from sklearn.decomposition import LatentDirichletAllocation as skLDA mod = skLDA(n_topics=7, learning_method='batch', doc_topic_prior=.1, topic_word_prior=.1, evaluate_every=1) mod.components_ = median_beta # my collapsed estimates of this matrix topic_usage = mod.transform(word_matrix) crashes with: AttributeError: 'LatentDirichletAllocation' object has no attribute 'exp_dirichlet_component_' I try to correct this with: mod.components_ = median_beta mod.exp_dirichlet_component_ = np.exp( _dirichlet_expectation_2d(mod.components_)) mod._init_latent_vars(components_.shape[1]) and now transform will complete will run but the results don't match in the least what I would expect after looking at multiple LDA runs. Note that this kind of functionality is available for NMF where you can run: (W, H, niter) = non_negative_factorization(wordmatrix, H=median_beta, n_components=median_beta.shape[0], update_H=False) Thanks for any insight or help you can provide. Best, Dylan -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Dec 7 11:33:38 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 7 Dec 2016 11:33:38 -0500 Subject: [scikit-learn] New to scikit In-Reply-To: References: Message-ID: http://scikit-learn.org/dev/developers/contributing.html#deprecation On 12/07/2016 09:42 AM, Chinmay Talegaonkar wrote: > Yeah, I found an easy bug. Looking for some help in writing > deprecation cycles for a bug. > > On Wed, Dec 7, 2016 at 8:05 PM, Siddharth Gupta > > wrote: > > Great! Welcome to the community. I would suggest you to check out > the issues page on the github repo, raise hand to the issues you > feel like you can give a go to, check out the issues that are > tagged as require contributor. Issues are a good way to start, > they will direct you about the areas of the code base to explore. > > On Dec 7, 2016 6:02 PM, "Chinmay Talegaonkar" > > wrote: > > Hi everyone, > I have a prior experience in python, and > have started learning machine learning recently. I wanted to > contribute to scikit, can anyone suggest a relatively easy > codebase to explore. > Thanks in advance! > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > -- > -- > *Chinmay Talegaonkar* > Cultural and Events Coordinator, Mood Indigo > .............................................. > > > +91-8879178724 > chinmay0301 at gmail.com > www.moodi.org > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Dec 7 11:33:00 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 7 Dec 2016 11:33:00 -0500 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References: Message-ID: On 12/07/2016 07:41 AM, Ludovico Coletta wrote: > > Dear scikit experts, > > > I did as you suggested, but it is not exactly what I would like to do > ( I also read this: > http://stackoverflow.com/questions/40400351/nested-cross-validation-with-stratifiedshufflesplit-in-sklearn) > > Perhaps I should ask my question in another way: it is possible to > split the nested cv folds just once? It seems to me that this is not > possible, do you have any hints? > > Not sure I understand your question. You can do a single split by using ShuffleSplit(n_splits=1) for example. -------------- next part -------------- An HTML attachment was scrubbed... URL: From nilay.euler16 at gmail.com Wed Dec 7 11:44:18 2016 From: nilay.euler16 at gmail.com (Nilay Shrivastava) Date: Wed, 7 Dec 2016 22:14:18 +0530 Subject: [scikit-learn] return type of StandardScaler Message-ID: StandardScaler returns numpy array even if the object passed is a pandas dataframe, shouldn't it return a dataframe? -------------- next part -------------- An HTML attachment was scrubbed... URL: From bharat.didwania.eee14 at itbhu.ac.in Wed Dec 7 11:48:47 2016 From: bharat.didwania.eee14 at itbhu.ac.in (Bharat Didwania .) Date: Wed, 7 Dec 2016 08:48:47 -0800 Subject: [scikit-learn] return type of StandardScaler In-Reply-To: References: Message-ID: you can use pandas.get_dummies() . It will perform one hot encoding on categorical columns, and produce a dataframe as the result. From there you can use pandas.concat([existing_df, new_df],axis=0) to add the new columns to your existing dataframe. This will avoid the use of a numpy array. On Wed, Dec 7, 2016 at 8:44 AM, Nilay Shrivastava wrote: > StandardScaler returns numpy array even if the object passed is a pandas > dataframe, shouldn't it return a dataframe? > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Dec 7 12:06:06 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 7 Dec 2016 12:06:06 -0500 Subject: [scikit-learn] return type of StandardScaler In-Reply-To: References: Message-ID: <10cb2c89-4c53-304c-6a38-606e7935f024@gmail.com> On 12/07/2016 11:44 AM, Nilay Shrivastava wrote: > StandardScaler returns numpy array even if the object passed is a > pandas dataframe, shouldn't it return a dataframe? > See https://github.com/scikit-learn/scikit-learn/issues/5523 sklearn-pandas might be of help for now. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ludo25_90 at hotmail.com Wed Dec 7 15:13:28 2016 From: ludo25_90 at hotmail.com (Ludovico Coletta) Date: Wed, 7 Dec 2016 20:13:28 +0000 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References: Message-ID: Thank you for the answer. I also thought about ShuffleSplit (n_splits=1), but I need to control which indices are used for training and which for testing in the nested folds. The problem is that I did feature selection before hyperparameters optimization (with a nested Leave One Out schema) and now I need the same partitioning for hyperparameters optimization. The reason why I did this is that the feature selection step is incredibly slow, I hope I can get rid of that step in the permutation test. Is not clear to me if I have to include feature selection in the permutation test as well. Maybe LeavePOut is what I need. Best Ludovico ________________________________ Da: scikit-learn per conto di scikit-learn-request at python.org Inviato: mercoled? 7 dicembre 2016 17.48 A: scikit-learn at python.org Oggetto: scikit-learn Digest, Vol 9, Issue 22 Send scikit-learn mailing list submissions to scikit-learn at python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... or, via email, send a message with subject or body 'help' to scikit-learn-request at python.org You can reach the person managing the list at scikit-learn-owner at python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. Re: New to scikit (Andreas Mueller) 2. Re: Nested Leave One Subject Out (LOSO) cross validation with scikit (Andreas Mueller) 3. return type of StandardScaler (Nilay Shrivastava) 4. Re: return type of StandardScaler (Bharat Didwania .) ---------------------------------------------------------------------- Message: 1 Date: Wed, 7 Dec 2016 11:33:38 -0500 From: Andreas Mueller To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] New to scikit Message-ID: Content-Type: text/plain; charset="windows-1252"; Format="flowed" http://scikit-learn.org/dev/developers/contributing.html#deprecation On 12/07/2016 09:42 AM, Chinmay Talegaonkar wrote: > Yeah, I found an easy bug. Looking for some help in writing > deprecation cycles for a bug. > > On Wed, Dec 7, 2016 at 8:05 PM, Siddharth Gupta > > wrote: > > Great! Welcome to the community. I would suggest you to check out > the issues page on the github repo, raise hand to the issues you > feel like you can give a go to, check out the issues that are > tagged as require contributor. Issues are a good way to start, > they will direct you about the areas of the code base to explore. > > On Dec 7, 2016 6:02 PM, "Chinmay Talegaonkar" > > wrote: > > Hi everyone, > I have a prior experience in python, and > have started learning machine learning recently. I wanted to > contribute to scikit, can anyone suggest a relatively easy > codebase to explore. > Thanks in advance! > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > -- > -- > *Chinmay Talegaonkar* > Cultural and Events Coordinator, Mood Indigo > .............................................. > > > +91-8879178724 > chinmay0301 at gmail.com > www.moodi.org > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Wed, 7 Dec 2016 11:33:00 -0500 From: Andreas Mueller To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit Message-ID: Content-Type: text/plain; charset="windows-1252"; Format="flowed" On 12/07/2016 07:41 AM, Ludovico Coletta wrote: > > Dear scikit experts, > > > I did as you suggested, but it is not exactly what I would like to do > ( I also read this: > http://stackoverflow.com/questions/40400351/nested-cross-validation-with-stratifiedshufflesplit-in-sklearn) > > Perhaps I should ask my question in another way: it is possible to > split the nested cv folds just once? It seems to me that this is not > possible, do you have any hints? > > Not sure I understand your question. You can do a single split by using ShuffleSplit(n_splits=1) for example. -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 3 Date: Wed, 7 Dec 2016 22:14:18 +0530 From: Nilay Shrivastava To: scikit-learn at python.org Subject: [scikit-learn] return type of StandardScaler Message-ID: Content-Type: text/plain; charset="utf-8" StandardScaler returns numpy array even if the object passed is a pandas dataframe, shouldn't it return a dataframe? -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 4 Date: Wed, 7 Dec 2016 08:48:47 -0800 From: "Bharat Didwania ." To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] return type of StandardScaler Message-ID: Content-Type: text/plain; charset="utf-8" you can use pandas.get_dummies() . It will perform one hot encoding on categorical columns, and produce a dataframe as the result. From there you can use pandas.concat([existing_df, new_df],axis=0) to add the new columns to your existing dataframe. This will avoid the use of a numpy array. On Wed, Dec 7, 2016 at 8:44 AM, Nilay Shrivastava wrote: > StandardScaler returns numpy array even if the object passed is a pandas > dataframe, shouldn't it return a dataframe? > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Subject: Digest Footer _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn ------------------------------ End of scikit-learn Digest, Vol 9, Issue 22 ******************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Dec 7 16:06:29 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 7 Dec 2016 16:06:29 -0500 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References: Message-ID: PredefinedSplit allows to define your split but I think it's better if you look at this pr: https://github.com/scikit-learn/scikit-learn/pull/7990 That allows you to cache steps in a pipeline so you don't have to recompute the feature selection for each parameter. On 12/07/2016 03:13 PM, Ludovico Coletta wrote: > > Thank you for the answer. > > > I also thought about ShuffleSplit (n_splits=1), but I need to control > which indices are used for training and which for testing in the > nested folds. The problem is that I did feature selection before > hyperparameters optimization (with a nested Leave One Out schema) and > now I need the same partitioning for hyperparameters optimization. The > reason why I did this is that the feature selection step is incredibly > slow, I hope I can get rid of that step in the permutation test. Is > not clear to me if I have to include feature selection in the > permutation test as well. > > > Maybe LeavePOut is what I need. > > > Best > > Ludovico > > > > ------------------------------------------------------------------------ > *Da:* scikit-learn > per conto di > scikit-learn-request at python.org > *Inviato:* mercoled? 7 dicembre 2016 17.48 > *A:* scikit-learn at python.org > *Oggetto:* scikit-learn Digest, Vol 9, Issue 22 > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > > scikit-learn Info Page - Python > > mail.python.org > To see the collection of prior postings to the list, visit the > scikit-learn Archives. Using scikit-learn: To post a message to all > the list members ... > > > > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: New to scikit (Andreas Mueller) > 2. Re: Nested Leave One Subject Out (LOSO) cross validation with > scikit (Andreas Mueller) > 3. return type of StandardScaler (Nilay Shrivastava) > 4. Re: return type of StandardScaler (Bharat Didwania .) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 7 Dec 2016 11:33:38 -0500 > From: Andreas Mueller > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] New to scikit > Message-ID: > Content-Type: text/plain; charset="windows-1252"; Format="flowed" > > http://scikit-learn.org/dev/developers/contributing.html#deprecation > > On 12/07/2016 09:42 AM, Chinmay Talegaonkar wrote: > > Yeah, I found an easy bug. Looking for some help in writing > > deprecation cycles for a bug. > > > > On Wed, Dec 7, 2016 at 8:05 PM, Siddharth Gupta > > > > wrote: > > > > Great! Welcome to the community. I would suggest you to check out > > the issues page on the github repo, raise hand to the issues you > > feel like you can give a go to, check out the issues that are > > tagged as require contributor. Issues are a good way to start, > > they will direct you about the areas of the code base to explore. > > > > On Dec 7, 2016 6:02 PM, "Chinmay Talegaonkar" > > > wrote: > > > > Hi everyone, > > I have a prior experience in python, and > > have started learning machine learning recently. I wanted to > > contribute to scikit, can anyone suggest a relatively easy > > codebase to explore. > > Thanks in advance! > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > -- > > -- > > *Chinmay Talegaonkar* > > Cultural and Events Coordinator, Mood Indigo > > .............................................. > > > > > > +91-8879178724 > > chinmay0301 at gmail.com > > www.moodi.org > > > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > ------------------------------ > > Message: 2 > Date: Wed, 7 Dec 2016 11:33:00 -0500 > From: Andreas Mueller > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] Nested Leave One Subject Out (LOSO) cross > validation with scikit > Message-ID: > Content-Type: text/plain; charset="windows-1252"; Format="flowed" > > > > On 12/07/2016 07:41 AM, Ludovico Coletta wrote: > > > > Dear scikit experts, > > > > > > I did as you suggested, but it is not exactly what I would like to do > > ( I also read this: > > > http://stackoverflow.com/questions/40400351/nested-cross-validation-with-stratifiedshufflesplit-in-sklearn) > > > > Perhaps I should ask my question in another way: it is possible to > > split the nested cv folds just once? It seems to me that this is not > > possible, do you have any hints? > > > > > Not sure I understand your question. > You can do a single split by using ShuffleSplit(n_splits=1) for example. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > ------------------------------ > > Message: 3 > Date: Wed, 7 Dec 2016 22:14:18 +0530 > From: Nilay Shrivastava > To: scikit-learn at python.org > Subject: [scikit-learn] return type of StandardScaler > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > StandardScaler returns numpy array even if the object passed is a pandas > dataframe, shouldn't it return a dataframe? > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > ------------------------------ > > Message: 4 > Date: Wed, 7 Dec 2016 08:48:47 -0800 > From: "Bharat Didwania ." > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] return type of StandardScaler > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > you can use pandas.get_dummies() > . > It will perform one hot encoding on categorical columns, and produce a > dataframe as the result. From there you can use > pandas.concat([existing_df, > new_df],axis=0) to add the new columns to your existing dataframe. This > will avoid the use of a numpy array. > > > On Wed, Dec 7, 2016 at 8:44 AM, Nilay Shrivastava > > wrote: > > > StandardScaler returns numpy array even if the object passed is a pandas > > dataframe, shouldn't it return a dataframe? > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 9, Issue 22 > ******************************************* > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From ludo25_90 at hotmail.com Wed Dec 7 17:25:14 2016 From: ludo25_90 at hotmail.com (Ludovico Coletta) Date: Wed, 7 Dec 2016 22:25:14 +0000 Subject: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit In-Reply-To: References: Message-ID: Thank you for the good news! [?] Best Ludovico ________________________________ Da: scikit-learn per conto di scikit-learn-request at python.org Inviato: mercoled? 7 dicembre 2016 22.06 A: scikit-learn at python.org Oggetto: scikit-learn Digest, Vol 9, Issue 24 Send scikit-learn mailing list submissions to scikit-learn at python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... or, via email, send a message with subject or body 'help' to scikit-learn-request at python.org You can reach the person managing the list at scikit-learn-owner at python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. Re: Nested Leave One Subject Out (LOSO) cross validation with scikit (Andreas Mueller) ---------------------------------------------------------------------- Message: 1 Date: Wed, 7 Dec 2016 16:06:29 -0500 From: Andreas Mueller To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] Nested Leave One Subject Out (LOSO) cross validation with scikit Message-ID: Content-Type: text/plain; charset="windows-1252"; Format="flowed" PredefinedSplit allows to define your split but I think it's better if you look at this pr: https://github.com/scikit-learn/scikit-learn/pull/7990 [https://avatars2.githubusercontent.com/u/7454015?v=3&s=400] [WIP] Cache pipeline by glemaitre ? Pull Request #7990 ? scikit-learn/scikit-learn github.com Reference Issue Address the discussions in #3951 Other related issues and PR: #2086 #5082 #5080 What does this implement/fix? Explain your changes. It implements a version of Pipeline which allows ... That allows you to cache steps in a pipeline so you don't have to recompute the feature selection for each parameter. On 12/07/2016 03:13 PM, Ludovico Coletta wrote: > > Thank you for the answer. > > > I also thought about ShuffleSplit (n_splits=1), but I need to control > which indices are used for training and which for testing in the > nested folds. The problem is that I did feature selection before > hyperparameters optimization (with a nested Leave One Out schema) and > now I need the same partitioning for hyperparameters optimization. The > reason why I did this is that the feature selection step is incredibly > slow, I hope I can get rid of that step in the permutation test. Is > not clear to me if I have to include feature selection in the > permutation test as well. > > > Maybe LeavePOut is what I need. > > > Best > > Ludovico > > > > ------------------------------------------------------------------------ > *Da:* scikit-learn > per conto di > scikit-learn-request at python.org > *Inviato:* mercoled? 7 dicembre 2016 17.48 > *A:* scikit-learn at python.org > *Oggetto:* scikit-learn Digest, Vol 9, Issue 22 > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > scikit-learn Info Page - Python > scikit-learn Info Page - Python mail.python.org To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ... > mail.python.org > To see the collection of prior postings to the list, visit the > scikit-learn Archives. Using scikit-learn: To post a message to all > the list members ... > > > > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Re: New to scikit (Andreas Mueller) > 2. Re: Nested Leave One Subject Out (LOSO) cross validation with > scikit (Andreas Mueller) > 3. return type of StandardScaler (Nilay Shrivastava) > 4. Re: return type of StandardScaler (Bharat Didwania .) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 7 Dec 2016 11:33:38 -0500 > From: Andreas Mueller > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] New to scikit > Message-ID: > Content-Type: text/plain; charset="windows-1252"; Format="flowed" > > http://scikit-learn.org/dev/developers/contributing.html#deprecation > > On 12/07/2016 09:42 AM, Chinmay Talegaonkar wrote: > > Yeah, I found an easy bug. Looking for some help in writing > > deprecation cycles for a bug. > > > > On Wed, Dec 7, 2016 at 8:05 PM, Siddharth Gupta > > > > wrote: > > > > Great! Welcome to the community. I would suggest you to check out > > the issues page on the github repo, raise hand to the issues you > > feel like you can give a go to, check out the issues that are > > tagged as require contributor. Issues are a good way to start, > > they will direct you about the areas of the code base to explore. > > > > On Dec 7, 2016 6:02 PM, "Chinmay Talegaonkar" > > > wrote: > > > > Hi everyone, > > I have a prior experience in python, and > > have started learning machine learning recently. I wanted to > > contribute to scikit, can anyone suggest a relatively easy > > codebase to explore. > > Thanks in advance! > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > -- > > -- > > *Chinmay Talegaonkar* > > Cultural and Events Coordinator, Mood Indigo > > .............................................. > > > > > > +91-8879178724 > > chinmay0301 at gmail.com > > www.moodi.org > > > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > ------------------------------ > > Message: 2 > Date: Wed, 7 Dec 2016 11:33:00 -0500 > From: Andreas Mueller > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] Nested Leave One Subject Out (LOSO) cross > validation with scikit > Message-ID: > Content-Type: text/plain; charset="windows-1252"; Format="flowed" > > > > On 12/07/2016 07:41 AM, Ludovico Coletta wrote: > > > > Dear scikit experts, > > > > > > I did as you suggested, but it is not exactly what I would like to do > > ( I also read this: > > > http://stackoverflow.com/questions/40400351/nested-cross-validation-with-stratifiedshufflesplit-in-sklearn) > > > > Perhaps I should ask my question in another way: it is possible to > > split the nested cv folds just once? It seems to me that this is not > > possible, do you have any hints? > > > > > Not sure I understand your question. > You can do a single split by using ShuffleSplit(n_splits=1) for example. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > ------------------------------ > > Message: 3 > Date: Wed, 7 Dec 2016 22:14:18 +0530 > From: Nilay Shrivastava > To: scikit-learn at python.org > Subject: [scikit-learn] return type of StandardScaler > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > StandardScaler returns numpy array even if the object passed is a pandas > dataframe, shouldn't it return a dataframe? > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > ------------------------------ > > Message: 4 > Date: Wed, 7 Dec 2016 08:48:47 -0800 > From: "Bharat Didwania ." > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] return type of StandardScaler > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > you can use pandas.get_dummies() > . > It will perform one hot encoding on categorical columns, and produce a > dataframe as the result. From there you can use > pandas.concat([existing_df, > new_df],axis=0) to add the new columns to your existing dataframe. This > will avoid the use of a numpy array. > > > On Wed, Dec 7, 2016 at 8:44 AM, Nilay Shrivastava > > wrote: > > > StandardScaler returns numpy array even if the object passed is a pandas > > dataframe, shouldn't it return a dataframe? > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 9, Issue 22 > ******************************************* > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Subject: Digest Footer _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn ------------------------------ End of scikit-learn Digest, Vol 9, Issue 24 ******************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OutlookEmoji-?.png Type: image/png Size: 488 bytes Desc: OutlookEmoji-?.png URL: From tevang3 at gmail.com Wed Dec 7 18:07:35 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 8 Dec 2016 00:07:35 +0100 Subject: [scikit-learn] NuSVC and ValueError: specified nu is infeasible Message-ID: Greetings, I want to use the Nu-Support Vector Classifier with the following input data: X= [ array([ 3.90387012, 1.60732281, -0.33315799, 4.02770896, 1.82337731, -0.74007214, 6.75989219, 3.68538903, .................. 0. , 11.64276776, 0. , 0. ]), array([ 3.36856769e+00, 1.48705816e+00, 4.28566992e-01, 3.35622071e+00, 1.64046508e+00, 5.66879661e-01, ..................... 4.25335335e+00, 1.96508829e+00, 8.63453394e-06]), array([ 3.74986249e+00, 1.69060713e+00, -5.09921270e-01, 3.76320781e+00, 1.67664455e+00, -6.21126735e-01, .......................... 4.16700259e+00, 1.88688784e+00, 7.34729942e-06]), ....... ] and Y= [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ............................ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] > ?Each array of X contains 60 numbers and the dataset consists of 48 > positive and 1230 negative observations. When I train an svm.SVC() > classifier I get quite good predictions, but wit the ?svm.NuSVC?() I keep > getting the following error no matter which value of nu in [0.1, ..., 0.9, > 0.99, 0.999, 0.9999] I try: > /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in fit(self, > X, y, sample_weight) > 187 > 188 seed = rnd.randint(np.iinfo('i').max) > --> 189 fit(X, y, sample_weight, solver_type, kernel, > random_seed=seed) > 190 # see comment on the other call to np.iinfo in this file > 191 > /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in > _dense_fit(self, X, y, sample_weight, solver_type, kernel, random_seed) > 254 cache_size=self.cache_size, coef0=self.coef0, > 255 gamma=self._gamma, epsilon=self.epsilon, > --> 256 max_iter=self.max_iter, random_seed=random_seed) > 257 > 258 self._warn_from_fit_status() > /usr/local/lib/python2.7/dist-packages/sklearn/svm/libsvm.so in > sklearn.svm.libsvm.fit (sklearn/svm/libsvm.c:2501)() > ValueError: specified nu is infeasible ? ?Does anyone know what might be wrong? Could it be the input data? thanks in advance for any advice Thomas? -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Wed Dec 7 18:45:25 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 8 Dec 2016 00:45:25 +0100 Subject: [scikit-learn] no positive predictions by neural_network.MLPClassifier Message-ID: I tried the sklearn.neural_network.MLPClassifier with the default parameters using the input data I quoted in my previous post about Nu-Support Vector Classifier. The predictions are great but the problem is that sometimes when I rerun the MLPClassifier it predicts no positive observations (class 1). I have noticed that this can be controlled by the random_state parameter, e.g. MLPClassifier(random_state=0) gives always no positive predictions. My question is how can I chose the right random_state value in a real blind test case? thanks in advance Thomas -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Wed Dec 7 19:19:28 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Wed, 7 Dec 2016 19:19:28 -0500 Subject: [scikit-learn] no positive predictions by neural_network.MLPClassifier In-Reply-To: References: Message-ID: <73C79E45-46C6-4BCA-8284-C848942706CE@gmail.com> Hi, Thomas, we had a related thread on the email list some time ago, let me post it for reference further below. Regarding your question, I think you may want make sure that you standardized the features (which makes the learning generally it less sensitive to learning rate and random weight initialization). However, even then, I would try at least 1-3 different random seeds and look at the cost vs time ? what can happen is that you land in different minima depending on the weight initialization as demonstrated in the example below (in MLPs you have the problem of a complex cost surface). Best, Sebastian > The default is set 100 units in the hidden layer, but theoretically, it should work with 2 hidden logistic units (I think that?s the typical textbook/class example). I think what happens is that it gets stuck in local minima depending on the random weight initialization. E.g., the following works just fine: > > from sklearn.neural_network import MLPClassifier > X = [[0, 0], [0, 1], [1, 0], [1, 1]] > y = [0, 1, 1, 0] > clf = MLPClassifier(solver='lbfgs', > activation='logistic', > alpha=0.0, > hidden_layer_sizes=(2,), > learning_rate_init=0.1, > max_iter=1000, > random_state=20) > clf.fit(X, y) > res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > print(clf.loss_) > > > but changing the random seed to 1 leads to: > > [0 1 1 1] > 0.34660921283 > > For comparison, I used a more vanilla MLP (1 hidden layer with 2 units and logistic activation as well; https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb), essentially resulting in the same problem: > On Dec 7, 2016, at 6:45 PM, Thomas Evangelidis wrote: > > I tried the sklearn.neural_network.MLPClassifier with the default parameters using the input data I quoted in my previous post about Nu-Support Vector Classifier. The predictions are great but the problem is that sometimes when I rerun the MLPClassifier it predicts no positive observations (class 1). I have noticed that this can be controlled by the random_state parameter, e.g. MLPClassifier(random_state=0) gives always no positive predictions. My question is how can I chose the right random_state value in a real blind test case? > > thanks in advance > Thomas > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Unknown-1.png Type: image/png Size: 10222 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Unknown-2.png Type: image/png Size: 9601 bytes Desc: not available URL: From joel.nothman at gmail.com Wed Dec 7 20:56:30 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 8 Dec 2016 12:56:30 +1100 Subject: [scikit-learn] Github project management tools In-Reply-To: References: <20161203105215.GH455403@phare.normalesup.org> <43b13054-ef73-54b3-0def-d138e814823d@gmail.com> Message-ID: And yet GitHub just rolled out a new "reviewers" field for assigning these things... On 7 December 2016 at 03:26, Raghav R V wrote: > +1 for self assigning PRs by reviewers... > > On Tue, Dec 6, 2016 at 4:19 PM, Andy wrote: > >> Thanks for your thoughts. >> I'm working in a similar mode, though I kind of try to avoid too much >> last-in first-out - I do it too, though, >> because I'm trying to keep up with all notifications. >> However, there are many older PRs and issues that are important bug-fixes >> and they get lost because of some minor new feature being added. >> Your point about faster communication in recent issues is taken, though. >> >> But I feel we should prioritize bug fixes much more - they do need more >> brain power to review, though :-/ >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > > -- > Raghav RV > https://github.com/raghavrv > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From piotr.bialecki at hotmail.de Thu Dec 8 02:56:37 2016 From: piotr.bialecki at hotmail.de (Piotr Bialecki) Date: Thu, 8 Dec 2016 07:56:37 +0000 Subject: [scikit-learn] NuSVC and ValueError: specified nu is infeasible In-Reply-To: References: Message-ID: Hi Thomas, the doc says, that nu gives an upper bound on the fraction of training errors and a lower bound of the fractions of support vectors. http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html Therefore, it acts as a hard bound on the allowed misclassification on your dataset. To me it seems as if the error bound is not feasible. How well did the SVC perform? What was your training error there? Will the NuSVC converge when you skip the sample_weights? Greets, Piotr On 08.12.2016 00:07, Thomas Evangelidis wrote: Greetings, I want to use the Nu-Support Vector Classifier with the following input data: X= [ array([ 3.90387012, 1.60732281, -0.33315799, 4.02770896, 1.82337731, -0.74007214, 6.75989219, 3.68538903, .................. 0. , 11.64276776, 0. , 0. ]), array([ 3.36856769e+00, 1.48705816e+00, 4.28566992e-01, 3.35622071e+00, 1.64046508e+00, 5.66879661e-01, ..................... 4.25335335e+00, 1.96508829e+00, 8.63453394e-06]), array([ 3.74986249e+00, 1.69060713e+00, -5.09921270e-01, 3.76320781e+00, 1.67664455e+00, -6.21126735e-01, .......................... 4.16700259e+00, 1.88688784e+00, 7.34729942e-06]), ....... ] and Y= [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ............................ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ?Each array of X contains 60 numbers and the dataset consists of 48 positive and 1230 negative observations. When I train an svm.SVC() classifier I get quite good predictions, but wit the ?svm.NuSVC?() I keep getting the following error no matter which value of nu in [0.1, ..., 0.9, 0.99, 0.999, 0.9999] I try: /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in fit(self, X, y, sample_weight) 187 188 seed = rnd.randint(np.iinfo('i').max) --> 189 fit(X, y, sample_weight, solver_type, kernel, random_seed=seed) 190 # see comment on the other call to np.iinfo in this file 191 /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in _dense_fit(self, X, y, sample_weight, solver_type, kernel, random_seed) 254 cache_size=self.cache_size, coef0=self.coef0, 255 gamma=self._gamma, epsilon=self.epsilon, --> 256 max_iter=self.max_iter, random_seed=random_seed) 257 258 self._warn_from_fit_status() /usr/local/lib/python2.7/dist-packages/sklearn/svm/libsvm.so in sklearn.svm.libsvm.fit (sklearn/svm/libsvm.c:2501)() ValueError: specified nu is infeasible ? ?Does anyone know what might be wrong? Could it be the input data? thanks in advance for any advice Thomas? -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From piotr.bialecki at hotmail.de Thu Dec 8 03:04:40 2016 From: piotr.bialecki at hotmail.de (Piotr Bialecki) Date: Thu, 8 Dec 2016 08:04:40 +0000 Subject: [scikit-learn] no positive predictions by neural_network.MLPClassifier In-Reply-To: <73C79E45-46C6-4BCA-8284-C848942706CE@gmail.com> References: <73C79E45-46C6-4BCA-8284-C848942706CE@gmail.com> Message-ID: Hi Thomas, Hi Thomas, besides that information of Sebastian, you dataset seems to be quite imbalances (48 positive and 1230 negative observations). You could try rebalancing your data using https://github.com/scikit-learn-contrib/imbalanced-learn This package offers some methods for resampling your data (under-sampling the majority class, over-sampling the minority class, etc.) Greets, Piotr On 08.12.2016 01:19, Sebastian Raschka wrote: Hi, Thomas, we had a related thread on the email list some time ago, let me post it for reference further below. Regarding your question, I think you may want make sure that you standardized the features (which makes the learning generally it less sensitive to learning rate and random weight initialization). However, even then, I would try at least 1-3 different random seeds and look at the cost vs time ? what can happen is that you land in different minima depending on the weight initialization as demonstrated in the example below (in MLPs you have the problem of a complex cost surface). Best, Sebastian The default is set 100 units in the hidden layer, but theoretically, it should work with 2 hidden logistic units (I think that?s the typical textbook/class example). I think what happens is that it gets stuck in local minima depending on the random weight initialization. E.g., the following works just fine: from sklearn.neural_network import MLPClassifier X = [[0, 0], [0, 1], [1, 0], [1, 1]] y = [0, 1, 1, 0] clf = MLPClassifier(solver='lbfgs', activation='logistic', alpha=0.0, hidden_layer_sizes=(2,), learning_rate_init=0.1, max_iter=1000, random_state=20) clf.fit(X, y) res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) print(res) print(clf.loss_) but changing the random seed to 1 leads to: [0 1 1 1] 0.34660921283 For comparison, I used a more vanilla MLP (1 hidden layer with 2 units and logistic activation as well; https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb), essentially resulting in the same problem: [cid:part2.01000907.08040901 at hotmail.de][cid:part3.06010207.00090501 at hotmail.de] On Dec 7, 2016, at 6:45 PM, Thomas Evangelidis > wrote: I tried the sklearn.neural_network.MLPClassifier with the default parameters using the input data I quoted in my previous post about Nu-Support Vector Classifier. The predictions are great but the problem is that sometimes when I rerun the MLPClassifier it predicts no positive observations (class 1). I have noticed that this can be controlled by the random_state parameter, e.g. MLPClassifier(random_state=0) gives always no positive predictions. My question is how can I chose the right random_state value in a real blind test case? thanks in advance Thomas -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ATT00001.png Type: image/png Size: 10222 bytes Desc: ATT00001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ATT00002.png Type: image/png Size: 9601 bytes Desc: ATT00002.png URL: From tevang3 at gmail.com Thu Dec 8 04:49:51 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 8 Dec 2016 10:49:51 +0100 Subject: [scikit-learn] NuSVC and ValueError: specified nu is infeasible In-Reply-To: References: Message-ID: Hi Piotr, the SVC performs quite well, slightly better than random forests on the same data. By training error do you mean this? clf = svm.SVC(probability=True) clf.fit(train_list_resampled3, train_activity_list_resampled3) print "training error=", clf.score(train_list_resampled3, train_activity_list_resampled3) If this is what you mean by "skip the sample_weights": clf = svm.NuSVC(probability=True) clf.fit(train_list_resampled3, train_activity_list_resampled3, sample_weight=None) then no, it does not converge. After all "sample_weight=None" is the default value. I am out of ideas about what may be the problem. Thomas On 8 December 2016 at 08:56, Piotr Bialecki wrote: > Hi Thomas, > > the doc says, that nu gives an upper bound on the fraction of training > errors and a lower bound of the fractions > of support vectors. > http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html > > Therefore, it acts as a hard bound on the allowed misclassification on > your dataset. > > To me it seems as if the error bound is not feasible. > How well did the SVC perform? What was your training error there? > > Will the NuSVC converge when you skip the sample_weights? > > > Greets, > Piotr > > > On 08.12.2016 00:07, Thomas Evangelidis wrote: > > Greetings, > > I want to use the Nu-Support Vector Classifier with the following input > data: > > X= [ > array([ 3.90387012, 1.60732281, -0.33315799, 4.02770896, > 1.82337731, -0.74007214, 6.75989219, 3.68538903, > .................. > 0. , 11.64276776, 0. , 0. ]), > array([ 3.36856769e+00, 1.48705816e+00, 4.28566992e-01, > 3.35622071e+00, 1.64046508e+00, 5.66879661e-01, > ..................... > 4.25335335e+00, 1.96508829e+00, 8.63453394e-06]), > array([ 3.74986249e+00, 1.69060713e+00, -5.09921270e-01, > 3.76320781e+00, 1.67664455e+00, -6.21126735e-01, > .......................... > 4.16700259e+00, 1.88688784e+00, 7.34729942e-06]), > ....... > ] > > and > > Y= [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, > 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, ............................ > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0, 0, 0, 0] > > >> ?Each array of X contains 60 numbers and the dataset consists of 48 >> positive and 1230 negative observations. When I train an svm.SVC() >> classifier I get quite good predictions, but wit the ?svm.NuSVC?() I keep >> getting the following error no matter which value of nu in [0.1, ..., 0.9, >> 0.99, 0.999, 0.9999] I try: >> /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in fit(self, >> X, y, sample_weight) >> 187 >> 188 seed = rnd.randint(np.iinfo('i').max) >> --> 189 fit(X, y, sample_weight, solver_type, kernel, >> random_seed=seed) >> 190 # see comment on the other call to np.iinfo in this file >> 191 >> /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in >> _dense_fit(self, X, y, sample_weight, solver_type, kernel, random_seed) >> 254 cache_size=self.cache_size, coef0=self.coef0, >> 255 gamma=self._gamma, epsilon=self.epsilon, >> --> 256 max_iter=self.max_iter, random_seed=random_seed) >> 257 >> 258 self._warn_from_fit_status() >> /usr/local/lib/python2.7/dist-packages/sklearn/svm/libsvm.so in >> sklearn.svm.libsvm.fit (sklearn/svm/libsvm.c:2501)() >> ValueError: specified nu is infeasible > > > ? > ?Does anyone know what might be wrong? Could it be the input data? > > thanks in advance for any advice > Thomas? > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Thu Dec 8 04:57:10 2016 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Thu, 8 Dec 2016 10:57:10 +0100 Subject: [scikit-learn] NuSVC and ValueError: specified nu is infeasible In-Reply-To: References: Message-ID: You have to set a bigger \nu. Try nus =2 ** np.arange(-1, 10) # starting at .5 (default), going to 512 for nu in nus: clf = svm.NuSVC(nu=nu) try: clf.fit ... except ValueError as e: print("nu {} not feasible".format(nu)) At some point it should start working. Hope that helps, Michael On Thu, Dec 8, 2016 at 10:49 AM, Thomas Evangelidis wrote: > Hi Piotr, > > the SVC performs quite well, slightly better than random forests on the > same data. By training error do you mean this? > > clf = svm.SVC(probability=True) > clf.fit(train_list_resampled3, train_activity_list_resampled3) > print "training error=", clf.score(train_list_resampled3, > train_activity_list_resampled3) > > If this is what you mean by "skip the sample_weights": > clf = svm.NuSVC(probability=True) > clf.fit(train_list_resampled3, train_activity_list_resampled3, > sample_weight=None) > > then no, it does not converge. After all "sample_weight=None" is the > default value. > > I am out of ideas about what may be the problem. > > Thomas > > > On 8 December 2016 at 08:56, Piotr Bialecki > wrote: > >> Hi Thomas, >> >> the doc says, that nu gives an upper bound on the fraction of training >> errors and a lower bound of the fractions >> of support vectors. >> http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html >> >> Therefore, it acts as a hard bound on the allowed misclassification on >> your dataset. >> >> To me it seems as if the error bound is not feasible. >> How well did the SVC perform? What was your training error there? >> >> Will the NuSVC converge when you skip the sample_weights? >> >> >> Greets, >> Piotr >> >> >> On 08.12.2016 00:07, Thomas Evangelidis wrote: >> >> Greetings, >> >> I want to use the Nu-Support Vector Classifier with the following input >> data: >> >> X= [ >> array([ 3.90387012, 1.60732281, -0.33315799, 4.02770896, >> 1.82337731, -0.74007214, 6.75989219, 3.68538903, >> .................. >> 0. , 11.64276776, 0. , 0. ]), >> array([ 3.36856769e+00, 1.48705816e+00, 4.28566992e-01, >> 3.35622071e+00, 1.64046508e+00, 5.66879661e-01, >> ..................... >> 4.25335335e+00, 1.96508829e+00, 8.63453394e-06]), >> array([ 3.74986249e+00, 1.69060713e+00, -5.09921270e-01, >> 3.76320781e+00, 1.67664455e+00, -6.21126735e-01, >> .......................... >> 4.16700259e+00, 1.88688784e+00, 7.34729942e-06]), >> ....... >> ] >> >> and >> >> Y= [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, >> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, >> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, >> 0, 0, 0, ............................ >> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, >> 0, 0, 0, 0, 0, 0, 0, 0] >> >> >>> ?Each array of X contains 60 numbers and the dataset consists of 48 >>> positive and 1230 negative observations. When I train an svm.SVC() >>> classifier I get quite good predictions, but wit the ?svm.NuSVC?() I keep >>> getting the following error no matter which value of nu in [0.1, ..., 0.9, >>> 0.99, 0.999, 0.9999] I try: >>> /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in >>> fit(self, X, y, sample_weight) >>> 187 >>> 188 seed = rnd.randint(np.iinfo('i').max) >>> --> 189 fit(X, y, sample_weight, solver_type, kernel, >>> random_seed=seed) >>> 190 # see comment on the other call to np.iinfo in this file >>> 191 >>> /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in >>> _dense_fit(self, X, y, sample_weight, solver_type, kernel, random_seed) >>> 254 cache_size=self.cache_size, coef0=self.coef0, >>> 255 gamma=self._gamma, epsilon=self.epsilon, >>> --> 256 max_iter=self.max_iter, random_seed=random_seed) >>> 257 >>> 258 self._warn_from_fit_status() >>> /usr/local/lib/python2.7/dist-packages/sklearn/svm/libsvm.so in >>> sklearn.svm.libsvm.fit (sklearn/svm/libsvm.c:2501)() >>> ValueError: specified nu is infeasible >> >> >> ? >> ?Does anyone know what might be wrong? Could it be the input data? >> >> thanks in advance for any advice >> Thomas? >> >> >> >> -- >> >> ====================================================================== >> >> Thomas Evangelidis >> >> Research Specialist >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/1S081, >> 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> >> tevang3 at gmail.com >> >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From piotr.bialecki at hotmail.de Thu Dec 8 05:08:23 2016 From: piotr.bialecki at hotmail.de (Piotr Bialecki) Date: Thu, 8 Dec 2016 10:08:23 +0000 Subject: [scikit-learn] NuSVC and ValueError: specified nu is infeasible In-Reply-To: References: Message-ID: Hi Michael, hi Thomas, I think the nu value is bound to (0, 1]. So the code will result in a ValueError (at least in sklearn 0.18). @Thomas I still think the optimization problem is not feasible due to your data. Have you tried balancing the dataset as I mentioned in your other question regarding the MLPClassifier? Greets, Piotr On 08.12.2016 10:57, Michael Eickenberg wrote: You have to set a bigger \nu. Try nus =2 ** np.arange(-1, 10) # starting at .5 (default), going to 512 for nu in nus: clf = svm.NuSVC(nu=nu) try: clf.fit ... except ValueError as e: print("nu {} not feasible".format(nu)) At some point it should start working. Hope that helps, Michael On Thu, Dec 8, 2016 at 10:49 AM, Thomas Evangelidis > wrote: Hi Piotr, the SVC performs quite well, slightly better than random forests on the same data. By training error do you mean this? clf = svm.SVC(probability=True) clf.fit(train_list_resampled3, train_activity_list_resampled3) print "training error=", clf.score(train_list_resampled3, train_activity_list_resampled3) If this is what you mean by "skip the sample_weights": clf = svm.NuSVC(probability=True) clf.fit(train_list_resampled3, train_activity_list_resampled3, sample_weight=None) then no, it does not converge. After all "sample_weight=None" is the default value. I am out of ideas about what may be the problem. Thomas On 8 December 2016 at 08:56, Piotr Bialecki <piotr.bialecki at hotmail.de> wrote: Hi Thomas, the doc says, that nu gives an upper bound on the fraction of training errors and a lower bound of the fractions of support vectors. http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html Therefore, it acts as a hard bound on the allowed misclassification on your dataset. To me it seems as if the error bound is not feasible. How well did the SVC perform? What was your training error there? Will the NuSVC converge when you skip the sample_weights? Greets, Piotr On 08.12.2016 00:07, Thomas Evangelidis wrote: Greetings, I want to use the Nu-Support Vector Classifier with the following input data: X= [ array([ 3.90387012, 1.60732281, -0.33315799, 4.02770896, 1.82337731, -0.74007214, 6.75989219, 3.68538903, .................. 0. , 11.64276776, 0. , 0. ]), array([ 3.36856769e+00, 1.48705816e+00, 4.28566992e-01, 3.35622071e+00, 1.64046508e+00, 5.66879661e-01, ..................... 4.25335335e+00, 1.96508829e+00, 8.63453394e-06]), array([ 3.74986249e+00, 1.69060713e+00, -5.09921270e-01, 3.76320781e+00, 1.67664455e+00, -6.21126735e-01, .......................... 4.16700259e+00, 1.88688784e+00, 7.34729942e-06]), ....... ] and Y= [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ............................ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ?Each array of X contains 60 numbers and the dataset consists of 48 positive and 1230 negative observations. When I train an svm.SVC() classifier I get quite good predictions, but wit the ?svm.NuSVC?() I keep getting the following error no matter which value of nu in [0.1, ..., 0.9, 0.99, 0.999, 0.9999] I try: /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in fit(self, X, y, sample_weight) 187 188 seed = rnd.randint(np.iinfo('i').max) --> 189 fit(X, y, sample_weight, solver_type, kernel, random_seed=seed) 190 # see comment on the other call to np.iinfo in this file 191 /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in _dense_fit(self, X, y, sample_weight, solver_type, kernel, random_seed) 254 cache_size=self.cache_size, coef0=self.coef0, 255 gamma=self._gamma, epsilon=self.epsilon, --> 256 max_iter=self.max_iter, random_seed=random_seed) 257 258 self._warn_from_fit_status() /usr/local/lib/python2.7/dist-packages/sklearn/svm/libsvm.so in sklearn.svm.libsvm.fit (sklearn/svm/libsvm.c:2501)() ValueError: specified nu is infeasible ? ? Does anyone know what might be wrong? Could it be the input data? thanks in advance for any advice Thomas ? -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From michael.eickenberg at gmail.com Thu Dec 8 05:18:17 2016 From: michael.eickenberg at gmail.com (Michael Eickenberg) Date: Thu, 8 Dec 2016 11:18:17 +0100 Subject: [scikit-learn] NuSVC and ValueError: specified nu is infeasible In-Reply-To: References: Message-ID: Ah, sorry, true. It is the error fraction instead of the number of errors. In any case, try varying this quantity. At one point I thought that nuSVC is just the constrained optimization version of the lagrange-style (penalized) normal SVC. That would mean that there is a correspondence between C for SVC and nu for nuSVC, leading to the conclusion that there must be nus that are feasible. So setting to nu=1. should always lead to feasibility. Now, looking at the docstring, since the nu controls two quantities at the same time, I am not entirely 1000% sure of this anymore, but I think it still holds. Michael On Thu, Dec 8, 2016 at 11:08 AM, Piotr Bialecki wrote: > Hi Michael, hi Thomas, > > I think the nu value is bound to (0, 1]. > So the code will result in a ValueError (at least in sklearn 0.18). > > @Thomas > I still think the optimization problem is not feasible due to your data. > Have you tried balancing the dataset as I mentioned in your other question > regarding the MLPClassifier? > > > Greets, > Piotr > > > > > > > On 08.12.2016 10:57, Michael Eickenberg wrote: > > You have to set a bigger \nu. > Try > > nus =2 ** np.arange(-1, 10) # starting at .5 (default), going to 512 > for nu in nus: > clf = svm.NuSVC(nu=nu) > try: > clf.fit ... > except ValueError as e: > print("nu {} not feasible".format(nu)) > > At some point it should start working. > > Hope that helps, > Michael > > > > > On Thu, Dec 8, 2016 at 10:49 AM, Thomas Evangelidis > wrote: > >> Hi Piotr, >> >> the SVC performs quite well, slightly better than random forests on the >> same data. By training error do you mean this? >> >> clf = svm.SVC(probability=True) >> clf.fit(train_list_resampled3, train_activity_list_resampled3) >> print "training error=", clf.score(train_list_resampled3, >> train_activity_list_resampled3) >> >> If this is what you mean by "skip the sample_weights": >> clf = svm.NuSVC(probability=True) >> clf.fit(train_list_resampled3, train_activity_list_resampled3, >> sample_weight=None) >> >> then no, it does not converge. After all "sample_weight=None" is the >> default value. >> >> I am out of ideas about what may be the problem. >> >> Thomas >> >> >> On 8 December 2016 at 08:56, Piotr Bialecki < >> piotr.bialecki at hotmail.de> wrote: >> >>> Hi Thomas, >>> >>> the doc says, that nu gives an upper bound on the fraction of training >>> errors and a lower bound of the fractions >>> of support vectors. >>> http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html >>> >>> Therefore, it acts as a hard bound on the allowed misclassification on >>> your dataset. >>> >>> To me it seems as if the error bound is not feasible. >>> How well did the SVC perform? What was your training error there? >>> >>> Will the NuSVC converge when you skip the sample_weights? >>> >>> >>> Greets, >>> Piotr >>> >>> >>> On 08.12.2016 00:07, Thomas Evangelidis wrote: >>> >>> Greetings, >>> >>> I want to use the Nu-Support Vector Classifier with the following >>> input data: >>> >>> X= [ >>> array([ 3.90387012, 1.60732281, -0.33315799, 4.02770896, >>> 1.82337731, -0.74007214, 6.75989219, 3.68538903, >>> .................. >>> 0. , 11.64276776, 0. , 0. ]), >>> array([ 3.36856769e+00, 1.48705816e+00, 4.28566992e-01, >>> 3.35622071e+00, 1.64046508e+00, 5.66879661e-01, >>> ..................... >>> 4.25335335e+00, 1.96508829e+00, 8.63453394e-06]), >>> array([ 3.74986249e+00, 1.69060713e+00, -5.09921270e-01, >>> 3.76320781e+00, 1.67664455e+00, -6.21126735e-01, >>> .......................... >>> 4.16700259e+00, 1.88688784e+00, 7.34729942e-06]), >>> ....... >>> ] >>> >>> and >>> >>> Y= [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, >>> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, >>> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, >>> 0, 0, 0, 0, ............................ >>> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, >>> 0, 0, 0, 0, 0, 0, 0, 0] >>> >>> >>>> ?Each array of X contains 60 numbers and the dataset consists of 48 >>>> positive and 1230 negative observations. When I train an svm.SVC() >>>> classifier I get quite good predictions, but wit the ?svm.NuSVC?() I keep >>>> getting the following error no matter which value of nu in [0.1, ..., 0.9, >>>> 0.99, 0.999, 0.9999] I try: >>>> /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in >>>> fit(self, X, y, sample_weight) >>>> 187 >>>> 188 seed = rnd.randint(np.iinfo('i').max) >>>> --> 189 fit(X, y, sample_weight, solver_type, kernel, >>>> random_seed=seed) >>>> 190 # see comment on the other call to np.iinfo in this file >>>> 191 >>>> /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in >>>> _dense_fit(self, X, y, sample_weight, solver_type, kernel, random_seed) >>>> 254 cache_size=self.cache_size, coef0=self.coef0, >>>> 255 gamma=self._gamma, epsilon=self.epsilon, >>>> --> 256 max_iter=self.max_iter, random_seed=random_seed) >>>> 257 >>>> 258 self._warn_from_fit_status() >>>> /usr/local/lib/python2.7/dist-packages/sklearn/svm/libsvm.so in >>>> sklearn.svm.libsvm.fit (sklearn/svm/libsvm.c:2501)() >>>> ValueError: specified nu is infeasible >>> >>> >>> ? >>> ? Does anyone know what might be wrong? Could it be the input data? >>> >>> thanks in advance for any advice >>> Thomas ? >>> >>> >>> >>> -- >>> >>> ====================================================================== >>> >>> Thomas Evangelidis >>> >>> Research Specialist >>> CEITEC - Central European Institute of Technology >>> Masaryk University >>> Kamenice 5/A35/1S081, >>> 62500 Brno, Czech Republic >>> >>> email: tevang at pharm.uoa.gr >>> >>> tevang3 at gmail.com >>> >>> >>> website: >>> https://sites.google.com/site/thomasevangelidishomepage/ >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ scikit-learn mailing >>> list scikit-learn at python.org https://mail.python.org/mailma >>> n/listinfo/scikit-learn >> >> -- >> >> ====================================================================== >> >> Thomas Evangelidis >> >> Research Specialist >> CEITEC - Central European Institute of Technology Masaryk University >> Kamenice 5/A35/1S081, 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> >> tevang3 at gmail.com >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> _______________________________________________ scikit-learn mailing >> list scikit-learn at python.org https://mail.python.org/mailma >> n/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From piotr.bialecki at hotmail.de Thu Dec 8 05:24:06 2016 From: piotr.bialecki at hotmail.de (Piotr Bialecki) Date: Thu, 8 Dec 2016 10:24:06 +0000 Subject: [scikit-learn] NuSVC and ValueError: specified nu is infeasible In-Reply-To: References: Message-ID: I thought the same about the correpondence between SVC and nuSVC. Any ideas why lowering the value might also help? http://stackoverflow.com/questions/35221433/error-in-using-non-linear-svm-in-scikit-learn He apparently used a very low value for nu (0.01) and the error vanished. Greets, Piotr On 08.12.2016 11:18, Michael Eickenberg wrote: Ah, sorry, true. It is the error fraction instead of the number of errors. In any case, try varying this quantity. At one point I thought that nuSVC is just the constrained optimization version of the lagrange-style (penalized) normal SVC. That would mean that there is a correspondence between C for SVC and nu for nuSVC, leading to the conclusion that there must be nus that are feasible. So setting to nu=1. should always lead to feasibility. Now, looking at the docstring, since the nu controls two quantities at the same time, I am not entirely 1000% sure of this anymore, but I think it still holds. Michael On Thu, Dec 8, 2016 at 11:08 AM, Piotr Bialecki > wrote: Hi Michael, hi Thomas, I think the nu value is bound to (0, 1]. So the code will result in a ValueError (at least in sklearn 0.18). @Thomas I still think the optimization problem is not feasible due to your data. Have you tried balancing the dataset as I mentioned in your other question regarding the MLPClassifier? Greets, Piotr On 08.12.2016 10:57, Michael Eickenberg wrote: You have to set a bigger \nu. Try nus =2 ** np.arange(-1, 10) # starting at .5 (default), going to 512 for nu in nus: clf = svm.NuSVC(nu=nu) try: clf.fit ... except ValueError as e: print("nu {} not feasible".format(nu)) At some point it should start working. Hope that helps, Michael On Thu, Dec 8, 2016 at 10:49 AM, Thomas Evangelidis > wrote: Hi Piotr, the SVC performs quite well, slightly better than random forests on the same data. By training error do you mean this? clf = svm.SVC(probability=True) clf.fit(train_list_resampled3, train_activity_list_resampled3) print "training error=", clf.score(train_list_resampled3, train_activity_list_resampled3) If this is what you mean by "skip the sample_weights": clf = svm.NuSVC(probability=True) clf.fit(train_list_resampled3, train_activity_list_resampled3, sample_weight=None) then no, it does not converge. After all "sample_weight=None" is the default value. I am out of ideas about what may be the problem. Thomas On 8 December 2016 at 08:56, Piotr Bialecki > wrote: Hi Thomas, the doc says, that nu gives an upper bound on the fraction of training errors and a lower bound of the fractions of support vectors. http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html Therefore, it acts as a hard bound on the allowed misclassification on your dataset. To me it seems as if the error bound is not feasible. How well did the SVC perform? What was your training error there? Will the NuSVC converge when you skip the sample_weights? Greets, Piotr On 08.12.2016 00:07, Thomas Evangelidis wrote: Greetings, I want to use the Nu-Support Vector Classifier with the following input data: X= [ array([ 3.90387012, 1.60732281, -0.33315799, 4.02770896, 1.82337731, -0.74007214, 6.75989219, 3.68538903, .................. 0. , 11.64276776, 0. , 0. ]), array([ 3.36856769e+00, 1.48705816e+00, 4.28566992e-01, 3.35622071e+00, 1.64046508e+00, 5.66879661e-01, ..................... 4.25335335e+00, 1.96508829e+00, 8.63453394e-06]), array([ 3.74986249e+00, 1.69060713e+00, -5.09921270e-01, 3.76320781e+00, 1.67664455e+00, -6.21126735e-01, .......................... 4.16700259e+00, 1.88688784e+00, 7.34729942e-06]), ....... ] and Y= [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ............................ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ?Each array of X contains 60 numbers and the dataset consists of 48 positive and 1230 negative observations. When I train an svm.SVC() classifier I get quite good predictions, but wit the ?svm.NuSVC?() I keep getting the following error no matter which value of nu in [0.1, ..., 0.9, 0.99, 0.999, 0.9999] I try: /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in fit(self, X, y, sample_weight) 187 188 seed = rnd.randint(np.iinfo('i').max) --> 189 fit(X, y, sample_weight, solver_type, kernel, random_seed=seed) 190 # see comment on the other call to np.iinfo in this file 191 /usr/local/lib/python2.7/dist-packages/sklearn/svm/base.pyc in _dense_fit(self, X, y, sample_weight, solver_type, kernel, random_seed) 254 cache_size=self.cache_size, coef0=self.coef0, 255 gamma=self._gamma, epsilon=self.epsilon, --> 256 max_iter=self.max_iter, random_seed=random_seed) 257 258 self._warn_from_fit_status() /usr/local/lib/python2.7/dist-packages/sklearn/svm/libsvm.so in sklearn.svm.libsvm.fit (sklearn/svm/libsvm.c:2501)() ValueError: specified nu is infeasible ? ? Does anyone know what might be wrong? Could it be the input data? thanks in advance for any advice Thomas ? -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Thu Dec 8 06:57:30 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 8 Dec 2016 12:57:30 +0100 Subject: [scikit-learn] NuSVC and ValueError: specified nu is infeasible In-Reply-To: References: Message-ID: > > > @Thomas > I still think the optimization problem is not feasible due to your data. > Have you tried balancing the dataset as I mentioned in your other question > regarding the > ?? > MLPClassifier? > > > ?Hi Piotr, I had tried all the balancing algorithms in the link that you stated, but the only one that really offered some improvement was the SMOTE over-sampling of positive observations. The original dataset contained ?24 positive and 1230 negative but after SMOTE I doubled the positive to 48. Reduction of the negative observations led to poor predictions, at least using random forests. I haven't tried it with ? MLPClassifier yet though. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Thu Dec 8 09:55:24 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 8 Dec 2016 15:55:24 +0100 Subject: [scikit-learn] no positive predictions by neural_network.MLPClassifier In-Reply-To: <73C79E45-46C6-4BCA-8284-C848942706CE@gmail.com> References: <73C79E45-46C6-4BCA-8284-C848942706CE@gmail.com> Message-ID: Hello Sebastian, I did normalization of my training set and used the same mean and stdev values to normalize my test set, instead of calculating means and stdev from the test set. I did that because my training set size is finite and the value of each feature is a descriptor that is characteristic of the 3D shape of the observation. The test set would definitely have different mean and stdev values from the training set, and if I had used them to normalize it then I believe I would have distorted the original descriptor values. Anyway, after this normalization I don't get 0 positive predictions anymore by the MLPClassifier. I still don't understand your second suggestion. I cannot find any parameter to control the epoch or measure the cost in sklearn .neural_network.MLPClassifier. Do you suggest to use your own classes from github instead? Besides that my goal is not to make one MLPClassifier using a specific training set, but rather to write a program that can take as input various training sets each time and and train a neural network that will classify a given test set. Therefore, unless I didn't understand your points, working with 3 arbitrary random_state values on my current training set in order to find one value to yield good predictions, wont solve my problem. best Thomas On 8 December 2016 at 01:19, Sebastian Raschka wrote: > Hi, Thomas, > we had a related thread on the email list some time ago, let me post it > for reference further below. Regarding your question, I think you may want > make sure that you standardized the features (which makes the learning > generally it less sensitive to learning rate and random weight > initialization). However, even then, I would try at least 1-3 different > random seeds and look at the cost vs time ? what can happen is that you > land in different minima depending on the weight initialization as > demonstrated in the example below (in MLPs you have the problem of a > complex cost surface). > > Best, > Sebastian > > The default is set 100 units in the hidden layer, but theoretically, it > should work with 2 hidden logistic units (I think that?s the typical > textbook/class example). I think what happens is that it gets stuck in > local minima depending on the random weight initialization. E.g., the > following works just fine: > > from sklearn.neural_network import MLPClassifier > X = [[0, 0], [0, 1], [1, 0], [1, 1]] > y = [0, 1, 1, 0] > clf = MLPClassifier(solver='lbfgs', > activation='logistic', > alpha=0.0, > hidden_layer_sizes=(2,), > learning_rate_init=0.1, > max_iter=1000, > random_state=20) > clf.fit(X, y) > res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) > print(res) > print(clf.loss_) > > > but changing the random seed to 1 leads to: > > [0 1 1 1] > 0.34660921283 > > For comparison, I used a more vanilla MLP (1 hidden layer with 2 units and > logistic activation as well; https://github.com/ > rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb), > essentially resulting in the same problem: > > > > > > > > > > > > > > > > > On Dec 7, 2016, at 6:45 PM, Thomas Evangelidis wrote: > > I tried the sklearn.neural_network.MLPClassifier with the default > parameters using the input data I quoted in my previous post about > Nu-Support Vector Classifier. The predictions are great but the problem is > that sometimes when I rerun the MLPClassifier it predicts no positive > observations (class 1). I have noticed that this can be controlled by the > random_state parameter, e.g. MLPClassifier(random_state=0) gives always no > positive predictions. My question is how can I chose the right random_state > value in a real blind test case? > > thanks in advance > Thomas > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Unknown-2.png Type: image/png Size: 9601 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Unknown-1.png Type: image/png Size: 10222 bytes Desc: not available URL: From se.raschka at gmail.com Thu Dec 8 10:21:42 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 8 Dec 2016 10:21:42 -0500 Subject: [scikit-learn] no positive predictions by neural_network.MLPClassifier In-Reply-To: References: <73C79E45-46C6-4BCA-8284-C848942706CE@gmail.com> Message-ID: <0ECE8375-4739-4A68-A2A1-EAFA6C1DB208@gmail.com> > Besides that my goal is not to make one MLPClassifier using a specific training set, but rather to write a program that can take as input various training sets each time and and train a neural network that will classify a given test set. Therefore, unless I didn't understand your points, working with 3 arbitrary random_state values on my current training set in order to find one value to yield good predictions, wont solve my problem. Unfortunately, there?s no silver regarding default hyperparameter values that works across all training sets. Here, random state may also be considered a hyperparam, since you don?t have a convex cost function in MLP, it may or may not get stuck in different local minima depending on your random weight initialization. > I cannot find any parameter to control the epoch You can control the maximum number of iterations via the max_iter parameter. I don?t know though whether one iteration is equal to one epoch (pass over the training set) for minibatch training in this particular implementation. > or measure the cost in sklearn.neural_network.MLPClassifier The cost of the last iteration is available via the loss_ attribute: mlp = MLPClassifier(?) # after training: mlp.loss_ > On Dec 8, 2016, at 9:55 AM, Thomas Evangelidis wrote: > > Hello Sebastian, > > I did normalization of my training set and used the same mean and stdev values to normalize my test set, instead of calculating means and stdev from the test set. I did that because my training set size is finite and the value of each feature is a descriptor that is characteristic of the 3D shape of the observation. The test set would definitely have different mean and stdev values from the training set, and if I had used them to normalize it then I believe I would have distorted the original descriptor values. Anyway, after this normalization I don't get 0 positive predictions anymore by the MLPClassifier. > > I still don't understand your second suggestion. I cannot find any parameter to control the epoch or measure the cost in sklearn.neural_network.MLPClassifier. Do you suggest to use your own classes from github instead? > Besides that my goal is not to make one MLPClassifier using a specific training set, but rather to write a program that can take as input various training sets each time and and train a neural network that will classify a given test set. Therefore, unless I didn't understand your points, working with 3 arbitrary random_state values on my current training set in order to find one value to yield good predictions, wont solve my problem. > > best > Thomas > > > > On 8 December 2016 at 01:19, Sebastian Raschka wrote: > Hi, Thomas, > we had a related thread on the email list some time ago, let me post it for reference further below. Regarding your question, I think you may want make sure that you standardized the features (which makes the learning generally it less sensitive to learning rate and random weight initialization). However, even then, I would try at least 1-3 different random seeds and look at the cost vs time ? what can happen is that you land in different minima depending on the weight initialization as demonstrated in the example below (in MLPs you have the problem of a complex cost surface). > > Best, > Sebastian > >> The default is set 100 units in the hidden layer, but theoretically, it should work with 2 hidden logistic units (I think that?s the typical textbook/class example). I think what happens is that it gets stuck in local minima depending on the random weight initialization. E.g., the following works just fine: >> >> from sklearn.neural_network import MLPClassifier >> X = [[0, 0], [0, 1], [1, 0], [1, 1]] >> y = [0, 1, 1, 0] >> clf = MLPClassifier(solver='lbfgs', >> activation='logistic', >> alpha=0.0, >> hidden_layer_sizes=(2,), >> learning_rate_init=0.1, >> max_iter=1000, >> random_state=20) >> clf.fit(X, y) >> res = clf.predict([[0, 0], [0, 1], [1, 0], [1, 1]]) >> print(res) >> print(clf.loss_) >> >> >> but changing the random seed to 1 leads to: >> >> [0 1 1 1] >> 0.34660921283 >> >> For comparison, I used a more vanilla MLP (1 hidden layer with 2 units and logistic activation as well; https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch12/ch12.ipynb), essentially resulting in the same problem: >> > > > > > > > > > > > > > > > >> On Dec 7, 2016, at 6:45 PM, Thomas Evangelidis wrote: >> >> I tried the sklearn.neural_network.MLPClassifier with the default parameters using the input data I quoted in my previous post about Nu-Support Vector Classifier. The predictions are great but the problem is that sometimes when I rerun the MLPClassifier it predicts no positive observations (class 1). I have noticed that this can be controlled by the random_state parameter, e.g. MLPClassifier(random_state=0) gives always no positive predictions. My question is how can I chose the right random_state value in a real blind test case? >> >> thanks in advance >> Thomas >> >> >> -- >> ====================================================================== >> Thomas Evangelidis >> Research Specialist >> CEITEC - Central European Institute of Technology >> Masaryk University >> Kamenice 5/A35/1S081, >> 62500 Brno, Czech Republic >> >> email: tevang at pharm.uoa.gr >> tevang3 at gmail.com >> >> website: https://sites.google.com/site/thomasevangelidishomepage/ >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tevang3 at gmail.com Thu Dec 8 10:59:14 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Thu, 8 Dec 2016 16:59:14 +0100 Subject: [scikit-learn] NuSVC and ValueError: specified nu is infeasible In-Reply-To: References: Message-ID: It finally works with nu=0.01 or less and the predictions are good. Is there a problem with that? On 8 December 2016 at 12:57, Thomas Evangelidis wrote: > > >> >> @Thomas >> I still think the optimization problem is not feasible due to your data. >> Have you tried balancing the dataset as I mentioned in your other >> question regarding the >> ?? >> MLPClassifier? >> >> >> > ?Hi Piotr, > > I had tried all the balancing algorithms in the link that you stated, but > the only one that really offered some improvement was the SMOTE > over-sampling of positive observations. The original dataset contained ?24 > positive and 1230 negative but after SMOTE I doubled the positive to 48. > Reduction of the negative observations led to poor predictions, at least > using random forests. I haven't tried it with > ? > MLPClassifier yet though. > > > > -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Dec 8 11:41:41 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 8 Dec 2016 11:41:41 -0500 Subject: [scikit-learn] Github project management tools In-Reply-To: References: <20161203105215.GH455403@phare.normalesup.org> <43b13054-ef73-54b3-0def-d138e814823d@gmail.com> Message-ID: <2994d7f7-e37c-46e7-d951-0e5185642c4c@gmail.com> On 12/07/2016 08:56 PM, Joel Nothman wrote: > And yet GitHub just rolled out a new "reviewers" field for assigning > these things... > Yeah I'm not sure what the difference is from assignment. I guess they thought as "assignment" for issues and reviewers for PRs? But assignments also exist for PRs.. There's also a new automatic list of who's active on a PR. From mailfordebu at gmail.com Fri Dec 9 02:16:11 2016 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Fri, 9 Dec 2016 12:46:11 +0530 Subject: [scikit-learn] Need Urgent help please in resolving JobLibMemoryError Message-ID: Hi All, Greetings ! I am getting JoblibMemoryError while executing a scikit-learn RandomForestClassifier code. Here is my algorithm in short: from sklearn.ensemble import RandomForestClassifier from sklearn.cross_validation import train_test_split import pandas as pd import numpy as np clf = RandomForestClassifier(n_estimators=5000, n_jobs=1000) clf.fit(p_input_features_train,p_input_labels_train) The dataframe p_input_features contain 134 columns (features) and 5 million rows (observations). The exact *error message* is given below: Executing Random Forest Classifier Traceback (most recent call last): File "/home/user/rf_fold.py", line 43, in clf.fit(p_features_train,p_labels_train) File "/var/opt/ lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 290, in fit for i, t in enumerate(trees)) File "/var/opt/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 810, in __call__ self.retrieve() File "/var/opt/lib /python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 757, in retrieve raise exception sklearn.externals.joblib.my_exceptions.JoblibMemoryError: JoblibMemoryError ___________________________________________________________________________ Multiprocessing exception: ........................................................................... /var/opt/lib/python2.7/site-packages/sklearn/ensemble/forest.py in fit(self=RandomForestClassifier(bootstrap=True, class_wei...te=None, verbose=0, warm_start=False), X=array([[ 0. , 0. , 0. , .... 0. , 0. ]], dtype=float32), y=array([[ 0.], [ 0.], [ 0.], ..., [ 0.], [ 0.], [ 0.]]), sample_weight=None) 285 trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, 286 backend="threading")( 287 delayed(_parallel_build_trees)( 288 t, self, X, y, sample_weight, i, len(trees), 289 verbose=self.verbose, class_weight=self.class_weight) --> 290 for i, t in enumerate(trees)) i = 4999 291 292 # Collect newly grown trees 293 self.estimators_.extend(trees) 294 ........................................................................... Please can you help me to identify a possible resolution to this. Thanks, Debu -------------- next part -------------- An HTML attachment was scrubbed... URL: From piotr.bialecki at hotmail.de Fri Dec 9 04:18:05 2016 From: piotr.bialecki at hotmail.de (Piotr Bialecki) Date: Fri, 9 Dec 2016 09:18:05 +0000 Subject: [scikit-learn] Need Urgent help please in resolving JobLibMemoryError In-Reply-To: References: Message-ID: Hi Debu, it seems that you run out of memory. Try using fewer processes. I don't think that n_jobs = 1000 will perform as you wish. Setting n_jobs to -1 uses the number of cores in your system. Greets, Piotr On 09.12.2016 08:16, Debabrata Ghosh wrote: Hi All, Greetings ! I am getting JoblibMemoryError while executing a scikit-learn RandomForestClassifier code. Here is my algorithm in short: from sklearn.ensemble import RandomForestClassifier from sklearn.cross_validation import train_test_split import pandas as pd import numpy as np clf = RandomForestClassifier(n_estimators=5000, n_jobs=1000) clf.fit(p_input_features_train,p_input_labels_train) The dataframe p_input_features contain 134 columns (features) and 5 million rows (observations). The exact error message is given below: Executing Random Forest Classifier Traceback (most recent call last): File "/home/user/rf_fold.py", line 43, in clf.fit(p_features_train,p_labels_train) File "/var/opt/ lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 290, in fit for i, t in enumerate(trees)) File "/var/opt/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 810, in __call__ self.retrieve() File "/var/opt/lib /python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 757, in retrieve raise exception sklearn.externals.joblib.my_exceptions.JoblibMemoryError: JoblibMemoryError ___________________________________________________________________________ Multiprocessing exception: ........................................................................... /var/opt/lib/python2.7/site-packages/sklearn/ensemble/forest.py in fit(self=RandomForestClassifier(bootstrap=True, class_wei...te=None, verbose=0, warm_start=False), X=array([[ 0. , 0. , 0. , .... 0. , 0. ]], dtype=float32), y=array([[ 0.], [ 0.], [ 0.], ..., [ 0.], [ 0.], [ 0.]]), sample_weight=None) 285 trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, 286 backend="threading")( 287 delayed(_parallel_build_trees)( 288 t, self, X, y, sample_weight, i, len(trees), 289 verbose=self.verbose, class_weight=self.class_weight) --> 290 for i, t in enumerate(trees)) i = 4999 291 292 # Collect newly grown trees 293 self.estimators_.extend(trees) 294 ........................................................................... Please can you help me to identify a possible resolution to this. Thanks, Debu _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailfordebu at gmail.com Fri Dec 9 04:56:53 2016 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Fri, 9 Dec 2016 15:26:53 +0530 Subject: [scikit-learn] Need Urgent help please in resolving JobLibMemoryError In-Reply-To: References: Message-ID: Hi Piotr, Yes, I did use n_jobs = - 1 as well. But the code didn't run successfully. On my output screen , I got the following message instead of the JobLibMemoryError: 16/12/08 22:12:26 INFO YarnExtensionServices: In shutdown hook for org.apache.spark.scheduler.cluster.YarnExtensionServices$$anon1 at 176b071d 16/12/08 22:12:26 INFO YarnHistoryService: Shutting down: pushing out 0 events 16/12/08 22:12:26 INFO YarnHistoryService: Event handler thread stopping the service 16/12/08 22:12:26 INFO YarnHistoryService: Stopping dequeue service, final queue size is 0 16/12/08 22:12:26 INFO YarnHistoryService: Stopped: Service History Service in state History Service: STOPPED endpoint= http://servername.com:8188/ws/v1/timeline/ ; bonded to ATS=false; listening=true; batchSize=3; flush count=17; current queue size=0; total number queued=52, processed=50; post failures=0; 16/12/08 22:12:26 INFO SparkContext: Invoking stop() from shutdown hook 16/12/08 22:12:26 INFO YarnHistoryService: History service stopped; ignoring queued event : [1481256746854]: SparkListenerApplicationEnd( 1481256746854) Just to get you a background I am executing the scikit-learn Random Classifier using pyspark command. I am not getting what has gone wrong while using n_jobs = -1 and suddenly the program is shutting down certain services. Please can you suggest a remedy as I have been given the task to run this via pyspark itself. Thanks in advance ! Cheers, Debu On Fri, Dec 9, 2016 at 2:48 PM, Piotr Bialecki wrote: > Hi Debu, > > it seems that you run out of memory. > Try using fewer processes. > I don't think that n_jobs = 1000 will perform as you wish. > > Setting n_jobs to -1 uses the number of cores in your system. > > > Greets, > Piotr > > > On 09.12.2016 08:16, Debabrata Ghosh wrote: > > Hi All, > > Greetings ! > > > > I am getting JoblibMemoryError while executing a scikit-learn > RandomForestClassifier code. Here is my algorithm in short: > > > > from sklearn.ensemble import RandomForestClassifier > > from sklearn.cross_validation import train_test_split > > import pandas as pd > > import numpy as np > > clf = RandomForestClassifier(n_estimators=5000, n_jobs=1000) > > clf.fit(p_input_features_train,p_input_labels_train) > > > The dataframe p_input_features contain 134 columns (features) and 5 > million rows (observations). The exact *error message* is given below: > > > Executing Random Forest Classifier > Traceback (most recent call last): > File "/home/user/rf_fold.py", line 43, in > clf.fit(p_features_train,p_labels_train) > File "/var/opt/ lib/python2.7/site-packages/sklearn/ensemble/forest.py", > line 290, in fit > for i, t in enumerate(trees)) > File "/var/opt/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", > line 810, in __call__ > self.retrieve() > File "/var/opt/lib /python2.7/site-packages/sklearn/externals/joblib/parallel.py", > line 757, in retrieve > raise exception > sklearn.externals.joblib.my_exceptions.JoblibMemoryError: > JoblibMemoryError > ____________________________________________________________ > _______________ > Multiprocessing exception: > ............................................................ > ............... > > /var/opt/lib/python2.7/site-packages/sklearn/ensemble/forest.py in > fit(self=RandomForestClassifier(bootstrap=True, class_wei...te=None, > verbose=0, > warm_start=False), X=array([[ 0. , 0. , > 0. , .... 0. , 0. ]], dtype=float32), > y=array([[ 0.], > [ 0.], > [ 0.], > ..., > [ 0.], > [ 0.], > [ 0.]]), sample_weight=None) > 285 trees = Parallel(n_jobs=self.n_jobs, > verbose=self.verbose, > 286 backend="threading")( > 287 delayed(_parallel_build_trees)( > 288 t, self, X, y, sample_weight, i, len(trees), > 289 verbose=self.verbose, class_weight=self.class_ > weight) > --> 290 for i, t in enumerate(trees)) > i = 4999 > 291 > 292 # Collect newly grown trees > 293 self.estimators_.extend(trees) > 294 > > ............................................................ > ............... > > > > Please can you help me to identify a possible resolution to this. > > > Thanks, > > Debu > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From piotr.bialecki at hotmail.de Fri Dec 9 05:07:06 2016 From: piotr.bialecki at hotmail.de (Piotr Bialecki) Date: Fri, 9 Dec 2016 10:07:06 +0000 Subject: [scikit-learn] Need Urgent help please in resolving JobLibMemoryError In-Reply-To: References: Message-ID: Hi Debu, I have not worked with pyspark yet and cannot resolve your error, but have you tried out sparkit-learn? https://github.com/lensacom/sparkit-learn It seems to be a package combining pyspark with sklearn and it also has a RandomForest and other classifiers: (SparkRandomForestClassifier, https://github.com/lensacom/sparkit-learn/blob/master/splearn/ensemble/__init__.py) Greets, Piotr On 09.12.2016 10:56, Debabrata Ghosh wrote: Hi Piotr, Yes, I did use n_jobs = - 1 as well. But the code didn't run successfully. On my output screen , I got the following message instead of the JobLibMemoryError: 16/12/08 22:12:26 INFO YarnExtensionServices: In shutdown hook for org.apache.spark.scheduler.cluster.YarnExtensionServices$$anon$1 at 176b071d 16/12/08 22:12:26 INFO YarnHistoryService: Shutting down: pushing out 0 events 16/12/08 22:12:26 INFO YarnHistoryService: Event handler thread stopping the service 16/12/08 22:12:26 INFO YarnHistoryService: Stopping dequeue service, final queue size is 0 16/12/08 22:12:26 INFO YarnHistoryService: Stopped: Service History Service in state History Service: STOPPED endpoint=http://servername.com:8188/ws/v1/timeline/; bonded to ATS=false; listening=true; batchSize=3; flush count=17; current queue size=0; total number queued=52, processed=50; post failures=0; 16/12/08 22:12:26 INFO SparkContext: Invoking stop() from shutdown hook 16/12/08 22:12:26 INFO YarnHistoryService: History service stopped; ignoring queued event : [1481256746854]: SparkListenerApplicationEnd(1481256746854) Just to get you a background I am executing the scikit-learn Random Classifier using pyspark command. I am not getting what has gone wrong while using n_jobs = -1 and suddenly the program is shutting down certain services. Please can you suggest a remedy as I have been given the task to run this via pyspark itself. Thanks in advance ! Cheers, Debu On Fri, Dec 9, 2016 at 2:48 PM, Piotr Bialecki > wrote: Hi Debu, it seems that you run out of memory. Try using fewer processes. I don't think that n_jobs = 1000 will perform as you wish. Setting n_jobs to -1 uses the number of cores in your system. Greets, Piotr On 09.12.2016 08:16, Debabrata Ghosh wrote: Hi All, Greetings ! I am getting JoblibMemoryError while executing a scikit-learn RandomForestClassifier code. Here is my algorithm in short: from sklearn.ensemble import RandomForestClassifier from sklearn.cross_validation import train_test_split import pandas as pd import numpy as np clf = RandomForestClassifier(n_estimators=5000, n_jobs=1000) clf.fit(p_input_features_train,p_input_labels_train) The dataframe p_input_features contain 134 columns (features) and 5 million rows (observations). The exact error message is given below: Executing Random Forest Classifier Traceback (most recent call last): File "/home/user/rf_fold.py", line 43, in clf.fit(p_features_train,p_labels_train) File "/var/opt/ lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 290, in fit for i, t in enumerate(trees)) File "/var/opt/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 810, in __call__ self.retrieve() File "/var/opt/lib /python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 757, in retrieve raise exception sklearn.externals.joblib.my_exceptions.JoblibMemoryError: JoblibMemoryError ___________________________________________________________________________ Multiprocessing exception: ........................................................................... /var/opt/lib/python2.7/site-packages/sklearn/ensemble/forest.py in fit(self=RandomForestClassifier(bootstrap=True, class_wei...te=None, verbose=0, warm_start=False), X=array([[ 0. , 0. , 0. , .... 0. , 0. ]], dtype=float32), y=array([[ 0.], [ 0.], [ 0.], ..., [ 0.], [ 0.], [ 0.]]), sample_weight=None) 285 trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, 286 backend="threading")( 287 delayed(_parallel_build_trees)( 288 t, self, X, y, sample_weight, i, len(trees), 289 verbose=self.verbose, class_weight=self.class_weight) --> 290 for i, t in enumerate(trees)) i = 4999 291 292 # Collect newly grown trees 293 self.estimators_.extend(trees) 294 ........................................................................... Please can you help me to identify a possible resolution to this. Thanks, Debu _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailfordebu at gmail.com Fri Dec 9 06:03:30 2016 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Fri, 9 Dec 2016 16:33:30 +0530 Subject: [scikit-learn] Need Urgent help please in resolving JobLibMemoryError In-Reply-To: References: Message-ID: Thanks Piotr for your feedback ! I did look into the sparkit-learn yesterday but couldn't locate the fact that it contained RandomForestClassifier method in it. I would need to request customer for downloading this for me as I don't have permission for that. May I please get your possible help whether sparkit-learn will have the following methods (corresponding to skikit learn): 1.sklearn.ensemble -> RandomForestClassifier 2.sklearn.cross_validation -> StratifiedKFold 3.sklearn.cross_validation -> train_test_split Do we have a URl for sparkit-learn similar to skikit learn where all the methods are listed I have figured out that sparkit-learn needs to be downloaded from https://pypi.python.org/pypi/sparkit-learn but apart from it does anything else need to be downloaded. Just wanted to check once before requesting my customer as otherwise it would be a bit embarrassing. Thanks again ! Cheers, Debu On Fri, Dec 9, 2016 at 3:37 PM, Piotr Bialecki wrote: > Hi Debu, > > I have not worked with pyspark yet and cannot resolve your error, > but have you tried out sparkit-learn? > https://github.com/lensacom/sparkit-learn > > It seems to be a package combining pyspark with sklearn and it also has a > RandomForest and other classifiers: > (SparkRandomForestClassifier, https://github.com/lensacom/ > sparkit-learn/blob/master/splearn/ensemble/__init__.py) > > > Greets, > Piotr > > On 09.12.2016 10:56, Debabrata Ghosh wrote: > > Hi Piotr, > Yes, I did use n_jobs = - 1 as well. But the code > didn't run successfully. On my output screen , I got the following message > instead of the JobLibMemoryError: > > 16/12/08 22:12:26 INFO YarnExtensionServices: In shutdown hook for > org.apache.spark.scheduler.cluster.YarnExtensionServices$$anon$1 at 176b071d > 16/12/08 22:12:26 INFO YarnHistoryService: Shutting down: pushing out 0 > events > 16/12/08 22:12:26 INFO YarnHistoryService: Event handler thread stopping > the service > 16/12/08 22:12:26 INFO YarnHistoryService: Stopping dequeue service, final > queue size is 0 > 16/12/08 22:12:26 INFO YarnHistoryService: Stopped: Service History > Service in state History Service: STOPPED endpoint= > > http://servername.com:8188/ws/v1/timeline/ > ; bonded to > ATS=false; listening=true; batchSize=3; flush count=17; current queue > size=0; total number queued=52, processed=50; post failures=0; > 16/12/08 22:12:26 INFO SparkContext: Invoking stop() from shutdown hook > 16/12/08 22:12:26 INFO YarnHistoryService: History service stopped; > ignoring queued event : [1481256746854]: SparkListenerApplicationEnd(14 > 81256746854) > > Just to get you a background I am executing the > scikit-learn Random Classifier using pyspark command. I am not getting what > has gone wrong while using n_jobs = -1 and suddenly the program is shutting > down certain services. Please can you suggest a remedy as I have been given > the task to run this via pyspark itself. > > Thanks in advance ! > > Cheers, > > Debu > > On Fri, Dec 9, 2016 at 2:48 PM, Piotr Bialecki > wrote: > >> Hi Debu, >> >> it seems that you run out of memory. >> Try using fewer processes. >> I don't think that n_jobs = 1000 will perform as you wish. >> >> Setting n_jobs to -1 uses the number of cores in your system. >> >> >> Greets, >> Piotr >> >> >> On 09.12.2016 08:16, Debabrata Ghosh wrote: >> >> Hi All, >> >> Greetings ! >> >> >> >> I am getting JoblibMemoryError while executing a scikit-learn >> RandomForestClassifier code. Here is my algorithm in short: >> >> >> >> from sklearn.ensemble import RandomForestClassifier >> >> from sklearn.cross_validation import train_test_split >> >> import pandas as pd >> >> import numpy as np >> >> clf = RandomForestClassifier(n_estimators=5000, n_jobs=1000) >> >> clf.fit(p_input_features_train,p_input_labels_train) >> >> >> The dataframe p_input_features contain 134 columns (features) and 5 >> million rows (observations). The exact *error message* is given below: >> >> >> Executing Random Forest Classifier >> Traceback (most recent call last): >> File "/home/user/rf_fold.py", line 43, in >> clf.fit(p_features_train,p_labels_train) >> File "/var/opt/ lib/python2.7/site-packages/sklearn/ensemble/forest.py", >> line 290, in fit >> for i, t in enumerate(trees)) >> File "/var/opt/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", >> line 810, in __call__ >> self.retrieve() >> File "/var/opt/lib /python2.7/site-packages/sklea >> rn/externals/joblib/parallel.py", line 757, in retrieve >> raise exception >> sklearn.externals.joblib.my_exceptions.JoblibMemoryError: >> JoblibMemoryError >> ____________________________________________________________ >> _______________ >> Multiprocessing exception: >> ............................................................ >> ............... >> >> /var/opt/lib/python2.7/site-packages/sklearn/ensemble/forest.py in >> fit(self=RandomForestClassifier(bootstrap=True, class_wei...te=None, >> verbose=0, >> warm_start=False), X=array([[ 0. , 0. , >> 0. , .... 0. , 0. ]], dtype=float32), >> y=array([[ 0.], >> [ 0.], >> [ 0.], >> ..., >> [ 0.], >> [ 0.], >> [ 0.]]), sample_weight=None) >> 285 trees = Parallel(n_jobs=self.n_jobs, >> verbose=self.verbose, >> 286 backend="threading")( >> 287 delayed(_parallel_build_trees)( >> 288 t, self, X, y, sample_weight, i, len(trees), >> 289 verbose=self.verbose, >> class_weight=self.class_weight) >> --> 290 for i, t in enumerate(trees)) >> i = 4999 >> 291 >> 292 # Collect newly grown trees >> 293 self.estimators_.extend(trees) >> 294 >> >> ............................................................ >> ............... >> >> >> >> Please can you help me to identify a possible resolution to this. >> >> >> Thanks, >> >> Debu >> >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ scikit-learn mailing >> list scikit-learn at python.org https://mail.python.org/mailma >> n/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fabizs at yahoo.com Fri Dec 9 09:30:24 2016 From: fabizs at yahoo.com (Fabio Santos) Date: Fri, 9 Dec 2016 14:30:24 +0000 (UTC) Subject: [scikit-learn] Clustering information from one file References: <1340756476.31830.1481293824712.ref@mail.yahoo.com> Message-ID: <1340756476.31830.1481293824712@mail.yahoo.com> Hi all, My name is F?bio and I'm?new in scikit,?and I trying to cluster information?from one file with python script (i fount on web). But i saw that the output had problem with numbers...See: Script# import clickimport reimport numpyimport random from collections import defaultdict from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.cluster import KMeans @click.command()@click.argument('filename')@click.option('--clusters', default=50, help='Number of clusters')@click.option('--sample', default=400, help='Number of samples to print')def cluster_lines(filename, clusters, sample):? ? lines = numpy.array(list(_get_lines(filename))) ? ? doc_feat = TfidfVectorizer().fit_transform(lines)? ? km = KMeans(clusters).fit(doc_feat) ? ? k = 0? ? clusters = defaultdict(list)? ? for i in km.labels_:? ? ? clusters[i].append(lines[k])? ? ? k += 1 ? ? s_clusters = sorted(clusters.values(), key=lambda l: -len(l)) ? ? for cluster in s_clusters:? ? ? ? print 'Cluster [%s]:' % len(cluster)? ? ? ? if len(cluster) > sample:? ? ? ? ? ? cluster = random.sample(cluster, sample)? ? ? ? for line in cluster:? ? ? ? ? ? print line? ? ? ? print '--------' def _clean_line(line):? ? line = line.strip().lower()? ? line = re.sub('\d+', '(N)', line)? ? return line def _get_lines(filename):? ? for line in open(filename).readlines():? ? ? ? yield _clean_line(line) if __name__ == '__main__':? ? cluster_lines() output?[root at vmcaiosyscolprod01 71001492]# ?python Cluster-LearnMachine.py DataSets/ospf.teste3Cluster [7]:"rjbotaa max-metric router-lsa on-startup ispf log-adjacency-changes detail auto-cost reference-bandwidth timers throttle spf timers throttle lsa timers lsa arrival timers pacing flood passive-interface default maximum-paths mpls ldp sync mpls traffic-eng router-id loopback mpls traffic-eng area""rjmteab max-metric router-lsa on-startup ispf log-adjacency-changes detail auto-cost reference-bandwidth timers throttle spf timers throttle lsa timers lsa arrival timers pacing flood passive-interface default maximum-paths mpls ldp sync mpls traffic-eng router-id loopback mpls traffic-eng area""rjmckaa max-metric router-lsa on-startup ispf log-adjacency-changes detail auto-cost reference-bandwidth timers throttle spf timers throttle lsa timers lsa arrival timers pacing flood passive-interface default maximum-paths mpls ldp sync mpls traffic-eng router-id loopback mpls traffic-eng area""rjdqcaa max-metric router-lsa on-startup ispf log-adjacency-changes detail auto-cost reference-bandwidth timers throttle spf timers throttle lsa timers lsa arrival timers pacing flood passive-interface default maximum-paths mpls ldp sync mpls traffic-eng router-id loopback mpls traffic-eng area""rjdqcab max-metric router-lsa on-startup ispf log-adjacency-changes detail auto-cost reference-bandwidth timers throttle spf timers throttle lsa timers lsa arrival timers pacing flood passive-interface default maximum-paths mpls ldp sync mpls traffic-eng router-id loopback mpls traffic-eng area""rjcenaa max-metric router-lsa on-startup ispf log-adjacency-changes detail auto-cost reference-bandwidth timers throttle spf timers throttle lsa timers lsa arrival timers pacing flood passive-interface default maximum-paths mpls ldp sync mpls traffic-eng router-id loopback mpls traffic-eng area""rjcenab max-metric router-lsa on-startup ispf log-adjacency-changes detail auto-cost reference-bandwidth timers throttle spf timers throttle lsa timers lsa arrival timers pacing flood passive-interface default maximum-paths mpls ldp sync mpls traffic-eng router-id loopback mpls traffic-eng area"--------Cluster [1]:"rjbotab max-metric router-lsa on-startup log-adjacency-changes detail auto-cost reference-bandwidth timers throttle spf timers throttle lsa timers lsa arrival timers pacing flood maximum-paths mpls ldp sync mpls traffic-eng router-id loopback mpls traffic-eng area"--------Cluster [1]:"rjmteaa ispf log-adjacency-changes detail auto-cost reference-bandwidth timers throttle spf timers throttle lsa timers lsa arrival timers pacing flood passive-interface default maximum-paths mpls ldp sync mpls traffic-eng router-id loopback mpls traffic-eng area"--------Cluster [1]:"rjmckab max-metric router-lsa on-startup ispf log-adjacency-changes detail auto-cost reference-bandwidth timers throttle spf timers throttle lsa timers lsa arrival timers pacing flood passive-interface default maximum-paths mpls ldp sync mpls traffic-eng router-id loopback mpls traffic-eng area"-------- See that the output shown (N) on numbers, ?and i'm not fount a way to use the big cluster as a template fo fount diference between the bigger cluster and others clusters. How can i do that? Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailfordebu at gmail.com Fri Dec 9 13:00:37 2016 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Fri, 9 Dec 2016 23:30:37 +0530 Subject: [scikit-learn] ensemble method within splearn Message-ID: Hi, I have downloaded sparkit-learn from https://pypi.python.org/pypi/sparkit-learn but it doesn't have the ensemble method. Please can you suggest the solution for the same. It's urgent please. Thanks, Debu -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Fri Dec 9 14:44:29 2016 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Fri, 9 Dec 2016 11:44:29 -0800 Subject: [scikit-learn] ensemble method within splearn In-Reply-To: References: Message-ID: Hello, This mailing list is dedicated to scikit-learn. For sparkit-learn information, I suggest you contact directly the developers directly, maybe by opening a ticket on their github project page: https://github.com/lensacom/sparkit-learn Thanks, Nelle On 9 December 2016 at 10:00, Debabrata Ghosh wrote: > Hi, > I have downloaded sparkit-learn from > https://pypi.python.org/pypi/sparkit-learn but it doesn't have the ensemble > method. > > Please can you suggest the solution for the same. It's urgent > please. > > Thanks, > > Debu > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From mamunbabu2001 at gmail.com Mon Dec 12 07:56:15 2016 From: mamunbabu2001 at gmail.com (Mamun Rashid) Date: Mon, 12 Dec 2016 12:56:15 +0000 Subject: [scikit-learn] SMOTE-ENN in Imbalanced-learn package Message-ID: Hi All, Not sure if questions regarding the contributory packages are answered here. Just trying my luck. I am have a seriously imbalanced classification problem. I am trying to use SMOTE+ENN oversampling and undersampling method to oversample my minority class and oversample my majority class. ======== from sklearn.datasets import make_classification from imblearn.combine import SMOTEENN sm = SMOTEENN() X, y = make_classification(n_classes=2, class_sep=2, weights=[0.2, 0.8], n_informative=1, n_redundant=1, flip_y=0, n_features=3, n_clusters_per_class=1, n_samples=50, random_state=10) X_df = pd.DataFrame(X) X_resampled, y_resampled = sm.fit_sample(X_df, y) ========= I understand that SMOTE returns a resampled data matrix i.e. X_resampled. I was wondering if there is a direct way to retrieve the indexes of the original data observations ? Thanks in advance. Best Regards and Seasons Greetings., Mamun -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Mon Dec 12 08:05:31 2016 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Mon, 12 Dec 2016 14:05:31 +0100 Subject: [scikit-learn] SMOTE-ENN in Imbalanced-learn package In-Reply-To: References: Message-ID: Hi Mamun, The new samples generated through SMOTE are synthetically created. You can refer to the paper of Chawla for more information. Therefore, there is no indexes linked to the original data. However, while under-sampling you can get this information setting up the return_indices=True. Cheers, On 12 December 2016 at 13:56, Mamun Rashid wrote: > Hi All, > Not sure if questions regarding the contributory packages are answered > here. Just trying my luck. > > I am have a seriously imbalanced classification problem. I am trying to > use SMOTE+ENN oversampling and undersampling method to oversample my > minority class and oversample my majority class. > > ======== > > from sklearn.datasets import make_classification > from imblearn.combine import SMOTEENN > > sm = SMOTEENN() > X, y = make_classification(n_classes=2, class_sep=2, weights=[0.2, 0.8], > n_informative=1, n_redundant=1, flip_y=0, n_features=3, > n_clusters_per_class=1, n_samples=50, random_state=10) > X_df = pd.DataFrame(X) > X_resampled, y_resampled = sm.fit_sample(X_df, y) > > ========= > > I understand that SMOTE returns a resampled data matrix i.e. X_resampled. I > was wondering if there is a direct way to retrieve the indexes of the > original data observations ? > > Thanks in advance. > > Best Regards and Seasons Greetings., > Mamun > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Ile-de-France Equipe PARIETAL guillaume.lemaitre at inria.f r --- https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Dec 13 10:29:24 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 13 Dec 2016 10:29:24 -0500 Subject: [scikit-learn] [Scikit-learn-general] Getting involved(beginner) In-Reply-To: References: Message-ID: <4be9e55d-53ec-b256-2030-e0423f6edb60@gmail.com> Hi. Sorry, IRC is pretty dead. You can try gitter, though it's also not super busy. Instruction to contributing are here: http://scikit-learn.org/dev/developers/contributing.html Andy On 12/13/2016 05:57 AM, piyush goel wrote: > Dear developer, > > I am Piyush an Electrical and Electronics Engineering student from > DTU,India. > I'm new to open source and the Python foundation as well, but I want > to get involved. I tried contacting the org through irc but wasn't > able to talk to anyone, so I'm writing this mail. > Please guide me through the steps of contributing to sci kit. I don't > have much experience in machine learning but I wish to learn while > contributing to the project, please tell me if it's a good idea. If > possible can you assign me some simple issues. > > Thanks. > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -------------- next part -------------- An HTML attachment was scrubbed... URL: From fla168 at 163.com Tue Dec 13 12:48:37 2016 From: fla168 at 163.com (fulean) Date: Wed, 14 Dec 2016 01:48:37 +0800 (CST) Subject: [scikit-learn] svm low-api gives bad prediction results Message-ID: <1d02610e.1.158f94ce94c.Coremail.fla168@163.com> hi all, I want to export the svm parameters and apply it in a c++ svm implementation at https://github.com/yctung/AndroidLibSvm. after grid search, SVC with C=1.0,gamma =10.0 get 92% accuracy, but unfortunately, SVC model does't contains the paratmeters needed by the c++ model, which take as input the low-level svm params, ie sv_coef,probA ..., the returns of low level svm api ,'libsvm.fit', match the requirement, but the prediction result is different from SVC model: The code : model = libsvm.fit(X_data.astype(np.float64),Y_data.astype(np.float64),svm_type=0,kernel='rbf',C = 1.0,gamm= 10.0) pred = libsvm.predict(X_data.astype(np.float64), *model,kernel='rbf') print "hello mean " + sa tr(np.mean(pred == Y_data)) # result "hello mean 0.570588235294" can somebody give some suggestion on this problem,thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Tue Dec 13 13:27:05 2016 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Tue, 13 Dec 2016 10:27:05 -0800 Subject: [scikit-learn] Model checksums Message-ID: I'd like to cache some functions to avoid rebuilding models like so: @cached def train(model, dataparams): ... model is an (untrained) scikit-learn object and dataparams is a dict. The @cached annotation forms a SHA checksum out of the parameters of the function it annotates and returns the previously calculated function result if the parameters match. The tricky part here is reliably generating a checksum from the parameters. Scikit uses Python's pickle ( http://scikit-learn.org/stable/modules/model_persistence.html) but the pickle library is non-deterministic (same inputs to pickle.dumps yields differing output! -- *I know*). So... any suggestions on how to generate checksums from models in python? Thanks. - Stuart -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Dec 13 13:34:52 2016 From: g.lemaitre58 at gmail.com (Guillaume Lemaitre) Date: Tue, 13 Dec 2016 19:34:52 +0100 Subject: [scikit-learn] Model checksums In-Reply-To: References: Message-ID: <20161213183452.4894801.68992.23630@gmail.com> An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Dec 13 15:10:47 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 13 Dec 2016 21:10:47 +0100 Subject: [scikit-learn] Model checksums In-Reply-To: References: Message-ID: What do you mean non deterministic? If you set the random_state of models,? we try to make them deterministic. Most often, any residual variability is numerical noise that reveals statistical error bars. G ?Sent from my phone. Please forgive brevity and mis spelling? On Dec 13, 2016, 19:29, at 19:29, Stuart Reynolds wrote: >I'd like to cache some functions to avoid rebuilding models like so: > > @cached > def train(model, dataparams): ... > > >model is an (untrained) scikit-learn object and dataparams is a dict. >The @cached annotation forms a SHA checksum out of the parameters of >the >function it annotates and returns the previously calculated function >result >if the parameters match. > >The tricky part here is reliably generating a checksum from the >parameters. >Scikit uses Python's pickle ( >http://scikit-learn.org/stable/modules/model_persistence.html) but the >pickle library is non-deterministic (same inputs to pickle.dumps yields >differing output! -- *I know*). > >So... any suggestions on how to generate checksums from models in >python? > >Thanks. >- Stuart > > >------------------------------------------------------------------------ > >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From graham.arthur.mackenzie at gmail.com Tue Dec 13 15:14:43 2016 From: graham.arthur.mackenzie at gmail.com (Graham Arthur Mackenzie) Date: Tue, 13 Dec 2016 12:14:43 -0800 Subject: [scikit-learn] Why do DTs have a different fit protocol than NB and SVMs? Message-ID: Hello All, I hope this is the right way to ask a question about documentation. In the doc for Decision Trees , the fit statement is assigned back to the classifier: clf = clf.fit(X, Y) Whereas, for Naive Bayes and Support Vector Machines , it's just: clf.fit(X, Y) I assumed this was a typo, but thought I should try and verify such before proceeding under that assumption. I appreciate any feedback you can provide. Thank You and Be Well, Graham -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Tue Dec 13 15:23:00 2016 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Tue, 13 Dec 2016 12:23:00 -0800 Subject: [scikit-learn] Why do DTs have a different fit protocol than NB and SVMs? In-Reply-To: References: Message-ID: The fit method returns the object itself, so regardless of which way you do it, it will work. The reason the fit method returns itself is so that you can chain methods, like "preds = clf.fit(X, y).predict(X)" On Tue, Dec 13, 2016 at 12:14 PM, Graham Arthur Mackenzie < graham.arthur.mackenzie at gmail.com> wrote: > Hello All, > > I hope this is the right way to ask a question about documentation. > > In the doc for Decision Trees > , the fit > statement is assigned back to the classifier: > > clf = clf.fit(X, Y) > > Whereas, for Naive Bayes > > and Support Vector Machines > , > it's just: > > clf.fit(X, Y) > > I assumed this was a typo, but thought I should try and verify such before > proceeding under that assumption. I appreciate any feedback you can provide. > > Thank You and Be Well, > Graham > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Tue Dec 13 15:33:48 2016 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Tue, 13 Dec 2016 12:33:48 -0800 Subject: [scikit-learn] Why do DTs have a different fit protocol than NB and SVMs? In-Reply-To: References: Message-ID: I think he's asking whether returning the model is part of the API (i.e. is it a bug that SVM and NB don't return self?). On Tue, Dec 13, 2016 at 12:23 PM, Jacob Schreiber wrote: > The fit method returns the object itself, so regardless of which way you > do it, it will work. The reason the fit method returns itself is so that > you can chain methods, like "preds = clf.fit(X, y).predict(X)" > > On Tue, Dec 13, 2016 at 12:14 PM, Graham Arthur Mackenzie < > graham.arthur.mackenzie at gmail.com> wrote: > >> Hello All, >> >> I hope this is the right way to ask a question about documentation. >> >> In the doc for Decision Trees >> , the fit >> statement is assigned back to the classifier: >> >> clf = clf.fit(X, Y) >> >> Whereas, for Naive Bayes >> >> and Support Vector Machines >> , >> it's just: >> >> clf.fit(X, Y) >> >> I assumed this was a typo, but thought I should try and verify such >> before proceeding under that assumption. I appreciate any feedback you can >> provide. >> >> Thank You and Be Well, >> Graham >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zephyr14 at gmail.com Tue Dec 13 15:38:42 2016 From: zephyr14 at gmail.com (Vlad Niculae) Date: Tue, 13 Dec 2016 15:38:42 -0500 Subject: [scikit-learn] Why do DTs have a different fit protocol than NB and SVMs? In-Reply-To: References: Message-ID: <94FAD0C4-CE95-4A1E-9079-418962D2B9C0@gmail.com> It is part of the API and enforced with tests, if I'm not mistaken. So you could use either form with all sklearn estimators. Vlad On December 13, 2016 3:33:48 PM EST, Stuart Reynolds wrote: >I think he's asking whether returning the model is part of the API >(i.e. is >it a bug that SVM and NB don't return self?). > >On Tue, Dec 13, 2016 at 12:23 PM, Jacob Schreiber > >wrote: > >> The fit method returns the object itself, so regardless of which way >you >> do it, it will work. The reason the fit method returns itself is so >that >> you can chain methods, like "preds = clf.fit(X, y).predict(X)" >> >> On Tue, Dec 13, 2016 at 12:14 PM, Graham Arthur Mackenzie < >> graham.arthur.mackenzie at gmail.com> wrote: >> >>> Hello All, >>> >>> I hope this is the right way to ask a question about documentation. >>> >>> In the doc for Decision Trees >>> , the fit >>> statement is assigned back to the classifier: >>> >>> clf = clf.fit(X, Y) >>> >>> Whereas, for Naive Bayes >>> > >>> and Support Vector Machines >>> >, >>> it's just: >>> >>> clf.fit(X, Y) >>> >>> I assumed this was a typo, but thought I should try and verify such >>> before proceeding under that assumption. I appreciate any feedback >you can >>> provide. >>> >>> Thank You and Be Well, >>> Graham >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > >------------------------------------------------------------------------ > >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Dec 13 15:45:02 2016 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 13 Dec 2016 15:45:02 -0500 Subject: [scikit-learn] Why do DTs have a different fit protocol than NB and SVMs? In-Reply-To: <94FAD0C4-CE95-4A1E-9079-418962D2B9C0@gmail.com> References: <94FAD0C4-CE95-4A1E-9079-418962D2B9C0@gmail.com> Message-ID: <084b7a4e-1361-dcc7-04a9-c67e1b3ca611@gmail.com> On 12/13/2016 03:38 PM, Vlad Niculae wrote: > It is part of the API and enforced with tests, if I'm not mistaken. So > you could use either form with all sklearn estimators. It is indeed enforced. Though I feel clf = clf.fit(X, y) is somewhat ugly and I would rather not have it in the docs. Alsok this example uses a capital Y,so two reasons to change it ;) From zephyr14 at gmail.com Tue Dec 13 16:11:01 2016 From: zephyr14 at gmail.com (Vlad Niculae) Date: Tue, 13 Dec 2016 16:11:01 -0500 Subject: [scikit-learn] Why do DTs have a different fit protocol than NB and SVMs? In-Reply-To: <084b7a4e-1361-dcc7-04a9-c67e1b3ca611@gmail.com> References: <94FAD0C4-CE95-4A1E-9079-418962D2B9C0@gmail.com> <084b7a4e-1361-dcc7-04a9-c67e1b3ca611@gmail.com> Message-ID: I agree; if you're not actually doing daisy-chaining, the stateful and more concise form clf.fit(X.y) looks more pythonic in my opinion. Also it seems that the "fit returns self" convention is not documented here [1], maybe we should briefly mention it? http://scikit-learn.org/stable/tutorial/basic/tutorial.html On Tue, Dec 13, 2016 at 3:45 PM, Andreas Mueller wrote: > > > On 12/13/2016 03:38 PM, Vlad Niculae wrote: >> >> It is part of the API and enforced with tests, if I'm not mistaken. So you >> could use either form with all sklearn estimators. > > > It is indeed enforced. > Though I feel clf = clf.fit(X, y) > is somewhat ugly and I would rather not have it in the docs. > Alsok this example uses a capital Y,so two reasons to change it ;) > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From graham.arthur.mackenzie at gmail.com Tue Dec 13 17:02:29 2016 From: graham.arthur.mackenzie at gmail.com (Graham Arthur Mackenzie) Date: Tue, 13 Dec 2016 14:02:29 -0800 Subject: [scikit-learn] scikit-learn Digest, Vol 9, Issue 42 In-Reply-To: References: Message-ID: Thanks for the speedy and helpful responses! Actually, the thrust of my question was, "I'm assuming the fit() method for all three modules work the same way, so how come the example code for DTs differs from NB, SVMs?" Since you seem to be saying that it'll work either way, I'm assuming there's no real reason behind it, which was my suspicion, but just wanted to have it confirmed, as the inconsistency was conspicuous. Thanks! GAM ps, My apologies if this is the improper way to respond to responses. I am receiving the Digest rather than individual messages, so this was the best I could think to do... On Tue, Dec 13, 2016 at 12:38 PM, wrote: > Send scikit-learn mailing list submissions to > scikit-learn at python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > or, via email, send a message with subject or body 'help' to > scikit-learn-request at python.org > > You can reach the person managing the list at > scikit-learn-owner at python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. Why do DTs have a different fit protocol than NB and SVMs? > (Graham Arthur Mackenzie) > 2. Re: Why do DTs have a different fit protocol than NB and > SVMs? (Jacob Schreiber) > 3. Re: Why do DTs have a different fit protocol than NB and > SVMs? (Stuart Reynolds) > 4. Re: Why do DTs have a different fit protocol than NB and > SVMs? (Vlad Niculae) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 13 Dec 2016 12:14:43 -0800 > From: Graham Arthur Mackenzie > To: scikit-learn at python.org > Subject: [scikit-learn] Why do DTs have a different fit protocol than > NB and SVMs? > Message-ID: > ail.com> > Content-Type: text/plain; charset="utf-8" > > Hello All, > > I hope this is the right way to ask a question about documentation. > > In the doc for Decision Trees > , the fit statement > is assigned back to the classifier: > > clf = clf.fit(X, Y) > > Whereas, for Naive Bayes > naive_bayes.GaussianNB.html> > and Support Vector Machines > , > it's just: > > clf.fit(X, Y) > > I assumed this was a typo, but thought I should try and verify such before > proceeding under that assumption. I appreciate any feedback you can > provide. > > Thank You and Be Well, > Graham > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: 20161213/8bbeacdb/attachment-0001.html> > > ------------------------------ > > Message: 2 > Date: Tue, 13 Dec 2016 12:23:00 -0800 > From: Jacob Schreiber > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] Why do DTs have a different fit protocol > than NB and SVMs? > Message-ID: > ail.com> > Content-Type: text/plain; charset="utf-8" > > The fit method returns the object itself, so regardless of which way you do > it, it will work. The reason the fit method returns itself is so that you > can chain methods, like "preds = clf.fit(X, y).predict(X)" > > On Tue, Dec 13, 2016 at 12:14 PM, Graham Arthur Mackenzie < > graham.arthur.mackenzie at gmail.com> wrote: > > > Hello All, > > > > I hope this is the right way to ask a question about documentation. > > > > In the doc for Decision Trees > > , the fit > > statement is assigned back to the classifier: > > > > clf = clf.fit(X, Y) > > > > Whereas, for Naive Bayes > > naive_bayes.GaussianNB.html> > > and Support Vector Machines > > , > > it's just: > > > > clf.fit(X, Y) > > > > I assumed this was a typo, but thought I should try and verify such > before > > proceeding under that assumption. I appreciate any feedback you can > provide. > > > > Thank You and Be Well, > > Graham > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: 20161213/08e2e7c2/attachment-0001.html> > > ------------------------------ > > Message: 3 > Date: Tue, 13 Dec 2016 12:33:48 -0800 > From: Stuart Reynolds > To: Scikit-learn user and developer mailing list > > Subject: Re: [scikit-learn] Why do DTs have a different fit protocol > than NB and SVMs? > Message-ID: > gmail.com> > Content-Type: text/plain; charset="utf-8" > > I think he's asking whether returning the model is part of the API (i.e. is > it a bug that SVM and NB don't return self?). > > On Tue, Dec 13, 2016 at 12:23 PM, Jacob Schreiber > > wrote: > > > The fit method returns the object itself, so regardless of which way you > > do it, it will work. The reason the fit method returns itself is so that > > you can chain methods, like "preds = clf.fit(X, y).predict(X)" > > > > On Tue, Dec 13, 2016 at 12:14 PM, Graham Arthur Mackenzie < > > graham.arthur.mackenzie at gmail.com> wrote: > > > >> Hello All, > >> > >> I hope this is the right way to ask a question about documentation. > >> > >> In the doc for Decision Trees > >> , the fit > >> statement is assigned back to the classifier: > >> > >> clf = clf.fit(X, Y) > >> > >> Whereas, for Naive Bayes > >> naive_bayes.GaussianNB.html> > >> and Support Vector Machines > >> >, > >> it's just: > >> > >> clf.fit(X, Y) > >> > >> I assumed this was a typo, but thought I should try and verify such > >> before proceeding under that assumption. I appreciate any feedback you > can > >> provide. > >> > >> Thank You and Be Well, > >> Graham > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: 20161213/2d6dd8e7/attachment-0001.html> > > ------------------------------ > > Message: 4 > Date: Tue, 13 Dec 2016 15:38:42 -0500 > From: Vlad Niculae > To: Scikit-learn user and developer mailing list > , Stuart Reynolds < > stuart at stuartreynolds.net> > Subject: Re: [scikit-learn] Why do DTs have a different fit protocol > than NB and SVMs? > Message-ID: <94FAD0C4-CE95-4A1E-9079-418962D2B9C0 at gmail.com> > Content-Type: text/plain; charset="utf-8" > > It is part of the API and enforced with tests, if I'm not mistaken. So you > could use either form with all sklearn estimators. > > Vlad > > On December 13, 2016 3:33:48 PM EST, Stuart Reynolds < > stuart at stuartreynolds.net> wrote: > >I think he's asking whether returning the model is part of the API > >(i.e. is > >it a bug that SVM and NB don't return self?). > > > >On Tue, Dec 13, 2016 at 12:23 PM, Jacob Schreiber > > > >wrote: > > > >> The fit method returns the object itself, so regardless of which way > >you > >> do it, it will work. The reason the fit method returns itself is so > >that > >> you can chain methods, like "preds = clf.fit(X, y).predict(X)" > >> > >> On Tue, Dec 13, 2016 at 12:14 PM, Graham Arthur Mackenzie < > >> graham.arthur.mackenzie at gmail.com> wrote: > >> > >>> Hello All, > >>> > >>> I hope this is the right way to ask a question about documentation. > >>> > >>> In the doc for Decision Trees > >>> , the fit > >>> statement is assigned back to the classifier: > >>> > >>> clf = clf.fit(X, Y) > >>> > >>> Whereas, for Naive Bayes > >>> > > naive_bayes.GaussianNB.html> > >>> and Support Vector Machines > >>> > >, > >>> it's just: > >>> > >>> clf.fit(X, Y) > >>> > >>> I assumed this was a typo, but thought I should try and verify such > >>> before proceeding under that assumption. I appreciate any feedback > >you can > >>> provide. > >>> > >>> Thank You and Be Well, > >>> Graham > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> > > > > > >------------------------------------------------------------------------ > > > >_______________________________________________ > >scikit-learn mailing list > >scikit-learn at python.org > >https://mail.python.org/mailman/listinfo/scikit-learn > > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity. > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: 20161213/45217ea3/attachment.html> > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > ------------------------------ > > End of scikit-learn Digest, Vol 9, Issue 42 > ******************************************* > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailfordebu at gmail.com Wed Dec 14 03:13:29 2016 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Wed, 14 Dec 2016 13:43:29 +0530 Subject: [scikit-learn] Scikit Learn Random Classifier - TPR and FPR plotted on matplotlib Message-ID: Hi All, I have run scikit-learn Random Forest Classifier algorithm against a dataset and here is my TPR and FPR against various thresholds: [image: Inline image 1] Further I have plotted the above values in matplotlib and am getting a very low AUC. Here is my matplotlib code. Can I understand the interpretation of the graph from you please.Is my model Ok or is there something wrong ? Appreciate for a quick response please. import matplotlib.pyplot as plt import numpy as np from sklearn import metrics plt.title('Receiver Operating Characteristic') plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') fpr = [0.0002337345394340,0.0001924870472260,0.0001626973851550,0.0000950977673794, 0.0000721826427097,0.0000538505429739,0.0000389557119386,0.0000263523933702, 0.0000137490748018] tpr = [0.19673638244100000000,0.18984141576600000000,0.18122270742400000000, 0.17055510860800000000,0.16434892541100000000,0.15789473684200000000, 0.15134451850100000000,0.14410480349300000000,0.13238336014700000000] roc_auc = metrics.auc(fpr, tpr) plt.plot([0, 1], [0, 1],'r--') plt.plot(fpr, tpr, 'bo-', label = 'AUC = %0.9f' % roc_auc) plt.legend(loc = 'lower right') plt.show() [image: Inline image 2] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 18986 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 8894 bytes Desc: not available URL: From Dale.T.Smith at macys.com Wed Dec 14 08:10:51 2016 From: Dale.T.Smith at macys.com (Dale T Smith) Date: Wed, 14 Dec 2016 13:10:51 +0000 Subject: [scikit-learn] Scikit Learn Random Classifier - TPR and FPR plotted on matplotlib In-Reply-To: References: Message-ID: I think you need to look at the examples. __________________________________________________________________________________________________________________________________________ Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Debabrata Ghosh Sent: Wednesday, December 14, 2016 3:13 AM To: Scikit-learn user and developer mailing list Subject: [scikit-learn] Scikit Learn Random Classifier - TPR and FPR plotted on matplotlib ? EXT MSG: Hi All, I have run scikit-learn Random Forest Classifier algorithm against a dataset and here is my TPR and FPR against various thresholds: [Inline image 1] Further I have plotted the above values in matplotlib and am getting a very low AUC. Here is my matplotlib code. Can I understand the interpretation of the graph from you please.Is my model Ok or is there something wrong ? Appreciate for a quick response please. import matplotlib.pyplot as plt import numpy as np from sklearn import metrics plt.title('Receiver Operating Characteristic') plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') fpr = [0.0002337345394340,0.0001924870472260,0.0001626973851550,0.0000950977673794, 0.0000721826427097,0.0000538505429739,0.0000389557119386,0.0000263523933702, 0.0000137490748018] tpr = [0.19673638244100000000,0.18984141576600000000,0.18122270742400000000, 0.17055510860800000000,0.16434892541100000000,0.15789473684200000000, 0.15134451850100000000,0.14410480349300000000,0.13238336014700000000] roc_auc = metrics.auc(fpr, tpr) plt.plot([0, 1], [0, 1],'r--') plt.plot(fpr, tpr, 'bo-', label = 'AUC = %0.9f' % roc_auc) plt.legend(loc = 'lower right') plt.show() [Inline image 2] * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8894 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 13149 bytes Desc: image002.jpg URL: From Dale.T.Smith at macys.com Wed Dec 14 08:08:52 2016 From: Dale.T.Smith at macys.com (Dale T Smith) Date: Wed, 14 Dec 2016 13:08:52 +0000 Subject: [scikit-learn] Renaming subject lines if you get a digest Message-ID: Please rename subjects if you use the digest ? now the thread is not complete in the archive. Others will have a harder time benefitting from answers. __________________________________________________________________________________________________________________________________________ Dale T. Smith | Macy's Systems and Technology | IFS eCom CSE Data Science 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Graham Arthur Mackenzie Sent: Tuesday, December 13, 2016 5:02 PM To: scikit-learn at python.org Subject: Re: [scikit-learn] scikit-learn Digest, Vol 9, Issue 42 ? EXT MSG: Thanks for the speedy and helpful responses! Actually, the thrust of my question was, "I'm assuming the fit() method for all three modules work the same way, so how come the example code for DTs differs from NB, SVMs?" Since you seem to be saying that it'll work either way, I'm assuming there's no real reason behind it, which was my suspicion, but just wanted to have it confirmed, as the inconsistency was conspicuous. Thanks! GAM ps, My apologies if this is the improper way to respond to responses. I am receiving the Digest rather than individual messages, so this was the best I could think to do... On Tue, Dec 13, 2016 at 12:38 PM, > wrote: Send scikit-learn mailing list submissions to scikit-learn at python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request at python.org You can reach the person managing the list at scikit-learn-owner at python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. Why do DTs have a different fit protocol than NB and SVMs? (Graham Arthur Mackenzie) 2. Re: Why do DTs have a different fit protocol than NB and SVMs? (Jacob Schreiber) 3. Re: Why do DTs have a different fit protocol than NB and SVMs? (Stuart Reynolds) 4. Re: Why do DTs have a different fit protocol than NB and SVMs? (Vlad Niculae) ---------------------------------------------------------------------- Message: 1 Date: Tue, 13 Dec 2016 12:14:43 -0800 From: Graham Arthur Mackenzie > To: scikit-learn at python.org Subject: [scikit-learn] Why do DTs have a different fit protocol than NB and SVMs? Message-ID: > Content-Type: text/plain; charset="utf-8" Hello All, I hope this is the right way to ask a question about documentation. In the doc for Decision Trees , the fit statement is assigned back to the classifier: clf = clf.fit(X, Y) Whereas, for Naive Bayes and Support Vector Machines , it's just: clf.fit(X, Y) I assumed this was a typo, but thought I should try and verify such before proceeding under that assumption. I appreciate any feedback you can provide. Thank You and Be Well, Graham -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Tue, 13 Dec 2016 12:23:00 -0800 From: Jacob Schreiber > To: Scikit-learn user and developer mailing list > Subject: Re: [scikit-learn] Why do DTs have a different fit protocol than NB and SVMs? Message-ID: > Content-Type: text/plain; charset="utf-8" The fit method returns the object itself, so regardless of which way you do it, it will work. The reason the fit method returns itself is so that you can chain methods, like "preds = clf.fit(X, y).predict(X)" On Tue, Dec 13, 2016 at 12:14 PM, Graham Arthur Mackenzie < graham.arthur.mackenzie at gmail.com> wrote: > Hello All, > > I hope this is the right way to ask a question about documentation. > > In the doc for Decision Trees > , the fit > statement is assigned back to the classifier: > > clf = clf.fit(X, Y) > > Whereas, for Naive Bayes > > and Support Vector Machines > , > it's just: > > clf.fit(X, Y) > > I assumed this was a typo, but thought I should try and verify such before > proceeding under that assumption. I appreciate any feedback you can provide. > > Thank You and Be Well, > Graham > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 3 Date: Tue, 13 Dec 2016 12:33:48 -0800 From: Stuart Reynolds > To: Scikit-learn user and developer mailing list > Subject: Re: [scikit-learn] Why do DTs have a different fit protocol than NB and SVMs? Message-ID: > Content-Type: text/plain; charset="utf-8" I think he's asking whether returning the model is part of the API (i.e. is it a bug that SVM and NB don't return self?). On Tue, Dec 13, 2016 at 12:23 PM, Jacob Schreiber > wrote: > The fit method returns the object itself, so regardless of which way you > do it, it will work. The reason the fit method returns itself is so that > you can chain methods, like "preds = clf.fit(X, y).predict(X)" > > On Tue, Dec 13, 2016 at 12:14 PM, Graham Arthur Mackenzie < > graham.arthur.mackenzie at gmail.com> wrote: > >> Hello All, >> >> I hope this is the right way to ask a question about documentation. >> >> In the doc for Decision Trees >> , the fit >> statement is assigned back to the classifier: >> >> clf = clf.fit(X, Y) >> >> Whereas, for Naive Bayes >> >> and Support Vector Machines >> , >> it's just: >> >> clf.fit(X, Y) >> >> I assumed this was a typo, but thought I should try and verify such >> before proceeding under that assumption. I appreciate any feedback you can >> provide. >> >> Thank You and Be Well, >> Graham >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 4 Date: Tue, 13 Dec 2016 15:38:42 -0500 From: Vlad Niculae > To: Scikit-learn user and developer mailing list >, Stuart Reynolds > Subject: Re: [scikit-learn] Why do DTs have a different fit protocol than NB and SVMs? Message-ID: <94FAD0C4-CE95-4A1E-9079-418962D2B9C0 at gmail.com> Content-Type: text/plain; charset="utf-8" It is part of the API and enforced with tests, if I'm not mistaken. So you could use either form with all sklearn estimators. Vlad On December 13, 2016 3:33:48 PM EST, Stuart Reynolds > wrote: >I think he's asking whether returning the model is part of the API >(i.e. is >it a bug that SVM and NB don't return self?). > >On Tue, Dec 13, 2016 at 12:23 PM, Jacob Schreiber >> >wrote: > >> The fit method returns the object itself, so regardless of which way >you >> do it, it will work. The reason the fit method returns itself is so >that >> you can chain methods, like "preds = clf.fit(X, y).predict(X)" >> >> On Tue, Dec 13, 2016 at 12:14 PM, Graham Arthur Mackenzie < >> graham.arthur.mackenzie at gmail.com> wrote: >> >>> Hello All, >>> >>> I hope this is the right way to ask a question about documentation. >>> >>> In the doc for Decision Trees >>> , the fit >>> statement is assigned back to the classifier: >>> >>> clf = clf.fit(X, Y) >>> >>> Whereas, for Naive Bayes >>> > >>> and Support Vector Machines >>> >, >>> it's just: >>> >>> clf.fit(X, Y) >>> >>> I assumed this was a typo, but thought I should try and verify such >>> before proceeding under that assumption. I appreciate any feedback >you can >>> provide. >>> >>> Thank You and Be Well, >>> Graham >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > >------------------------------------------------------------------------ > >_______________________________________________ >scikit-learn mailing list >scikit-learn at python.org >https://mail.python.org/mailman/listinfo/scikit-learn -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Subject: Digest Footer _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn ------------------------------ End of scikit-learn Digest, Vol 9, Issue 42 ******************************************* * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Wed Dec 14 11:52:22 2016 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Wed, 14 Dec 2016 16:52:22 +0000 Subject: [scikit-learn] Scikit Learn Random Classifier - TPR and FPR plotted on matplotlib In-Reply-To: References: Message-ID: You're looking at a tiny subset of the possible cutoff thresholds for this classifier. Lower thresholds will give higher tot at the expense of tpr. Usually, AUC is computed at the integral of this graph over the whole range of FPRs (from zero to one). If you have your classifier output probabilities or activations, the maximum and minimum of these values will tell you what the largest and smallest thresholds should be. Scikit also has a function to directly receive the activations and true classes and compute the AUC and tpr/fpr curve. On Wed, Dec 14, 2016 at 5:12 AM Dale T Smith wrote: > > > > > > > > > > > > > > > > > I think you need to look at the examples. > > > > > > > > > > __________________________________________________________________________________________________________________________________________ > > > *Dale T. Smith* > > *|* Macy's Systems and Technology > > *|* IFS eCom CSE Data Science > > > > > 5985 State Bridge Road, Johns Creek, GA 30097 *|* dale.t.smith at macys.com > > > > > > *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith= > macys.com at python.org] > > *On Behalf Of *Debabrata Ghosh > > > *Sent:* Wednesday, December 14, 2016 3:13 AM > > > *To:* Scikit-learn user and developer mailing list > > > *Subject:* [scikit-learn] Scikit Learn Random Classifier - TPR and FPR > plotted on matplotlib > > > > > > > > ? EXT MSG: > > > > > > > > > > > Hi All, > > > > > I have run scikit-learn Random Forest Classifier > algorithm against a dataset and here is my TPR and FPR against various > thresholds: > > > > > > [image: Inline image 1] > > > > > Further I have plotted the above values in matplotlib and am getting a > very low AUC. Here is my matplotlib code. Can I understand the > interpretation of the graph from you please.Is my model Ok or is there > something wrong ? Appreciate for > > a quick response please. > > > > > > import matplotlib.pyplot as plt > > > import numpy as np > > > from sklearn import metrics > > > plt.title('Receiver Operating Characteristic') > > > plt.ylabel('True Positive Rate') > > > plt.xlabel('False Positive Rate') > > > fpr = > [0.0002337345394340,0.0001924870472260,0.0001626973851550,0.0000950977673794, > > > > 0.0000721826427097,0.0000538505429739,0.0000389557119386,0.0000263523933702, > > > 0.0000137490748018] > > > > > > tpr = > [0.19673638244100000000,0.18984141576600000000,0.18122270742400000000, > > > > 0.17055510860800000000,0.16434892541100000000,0.15789473684200000000, > > > > 0.15134451850100000000,0.14410480349300000000,0.13238336014700000000] > > > > > > roc_auc = metrics.auc(fpr, tpr) > > > > > > plt.plot([0, 1], [0, 1],'r--') > > > plt.plot(fpr, tpr, 'bo-', label = 'AUC = %0.9f' % roc_auc) > > > plt.legend(loc = 'lower right') > > > > > > plt.show() > > > > > > [image: Inline image 2] > > > > > > > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or > opening attachments. > > > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8894 bytes Desc: not available URL: From jmschreiber91 at gmail.com Wed Dec 14 13:46:10 2016 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Wed, 14 Dec 2016 10:46:10 -0800 Subject: [scikit-learn] Scikit Learn Random Classifier - TPR and FPR plotted on matplotlib In-Reply-To: References: Message-ID: To make a proper ROC curve you need to test all possible thresholds, not just a subset of them. You can do this easily in sklearn. import matplotlib.pyplot as plt from sklearn.metrics import roc_curve, roc_auc_score ... ... y_pred = clf.predict_proba(X) fpr, tpr, _ = roc_curve(y_true, y_pred) auc = roc_auc_score(y_true, y_pred) plt.plot(fpr, tpr, label=auc) On Wed, Dec 14, 2016 at 8:52 AM, Stuart Reynolds wrote: > You're looking at a tiny subset of the possible cutoff thresholds for this > classifier. > Lower thresholds will give higher tot at the expense of tpr. > Usually, AUC is computed at the integral of this graph over the whole > range of FPRs (from zero to one). > > If you have your classifier output probabilities or activations, the > maximum and minimum of these values will tell you what the largest and > smallest thresholds should be. Scikit also has a function to directly > receive the activations and true classes and compute the AUC and tpr/fpr > curve. > > On Wed, Dec 14, 2016 at 5:12 AM Dale T Smith > wrote: > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> I think you need to look at the examples. >> >> >> >> >> >> >> >> >> ____________________________________________________________ >> ____________________________________________________________ >> __________________ >> >> >> *Dale T. Smith* >> >> *|* Macy's Systems and Technology >> >> *|* IFS eCom CSE Data Science >> >> >> >> >> 5985 State Bridge Road, Johns Creek, GA 30097 *|* dale.t.smith at macys.com >> >> >> >> >> >> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith= >> macys.com at python.org] >> >> *On Behalf Of *Debabrata Ghosh >> >> >> *Sent:* Wednesday, December 14, 2016 3:13 AM >> >> >> *To:* Scikit-learn user and developer mailing list >> >> >> *Subject:* [scikit-learn] Scikit Learn Random Classifier - TPR and FPR >> plotted on matplotlib >> >> >> >> >> >> >> >> ? EXT MSG: >> >> >> >> >> >> >> >> >> >> >> Hi All, >> >> >> >> >> I have run scikit-learn Random Forest Classifier >> algorithm against a dataset and here is my TPR and FPR against various >> thresholds: >> >> >> >> >> >> [image: Inline image 1] >> >> >> >> >> Further I have plotted the above values in matplotlib and am getting a >> very low AUC. Here is my matplotlib code. Can I understand the >> interpretation of the graph from you please.Is my model Ok or is there >> something wrong ? Appreciate for >> >> a quick response please. >> >> >> >> >> >> import matplotlib.pyplot as plt >> >> >> import numpy as np >> >> >> from sklearn import metrics >> >> >> plt.title('Receiver Operating Characteristic') >> >> >> plt.ylabel('True Positive Rate') >> >> >> plt.xlabel('False Positive Rate') >> >> >> fpr = [0.0002337345394340,0.0001924870472260,0.0001626973851550,0. >> 0000950977673794, >> >> >> 0.0000721826427097,0.0000538505429739,0.0000389557119386,0. >> 0000263523933702, >> >> >> 0.0000137490748018] >> >> >> >> >> >> tpr = [0.19673638244100000000,0.18984141576600000000,0. >> 18122270742400000000, >> >> >> 0.17055510860800000000,0.16434892541100000000,0. >> 15789473684200000000, >> >> >> 0.15134451850100000000,0.14410480349300000000,0. >> 13238336014700000000] >> >> >> >> >> >> roc_auc = metrics.auc(fpr, tpr) >> >> >> >> >> >> plt.plot([0, 1], [0, 1],'r--') >> >> >> plt.plot(fpr, tpr, 'bo-', label = 'AUC = %0.9f' % roc_auc) >> >> >> plt.legend(loc = 'lower right') >> >> >> >> >> >> plt.show() >> >> >> >> >> >> [image: Inline image 2] >> >> >> >> >> >> >> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or >> opening attachments. >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From melamed at uchicago.edu Thu Dec 15 12:21:42 2016 From: melamed at uchicago.edu (Rachel Melamed) Date: Thu, 15 Dec 2016 17:21:42 +0000 Subject: [scikit-learn] biased predictions in logistic regression Message-ID: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> Hi all, Does anyone have any suggestions for this problem: http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results I am running around 1000 similar logistic regressions, with the same covariates but slightly different data and response variables. All of my response variables have a sparse successes (p(success) < .05 usually). I noticed that with the regularized regression, the results are consistently biased to predict more "successes" than is observed in the training data. When I relax the regularization, this bias goes away. The bias observed is unacceptable for my use case, but the more-regularized model does seem a bit better. Below, I plot the results for the 1000 different regressions for 2 different values of C: [results for the different regressions for 2 different values of C] I looked at the parameter estimates for one of these regressions: below each point is one parameter. It seems like the intercept (the point on the bottom left) is too high for the C=1 model. [enter image description here] -------------- next part -------------- An HTML attachment was scrubbed... URL: From aadral at gmail.com Thu Dec 15 13:54:44 2016 From: aadral at gmail.com (Alexey Dral) Date: Thu, 15 Dec 2016 21:54:44 +0300 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> Message-ID: Hi Rachel, Do you have your data normalized? 2016-12-15 20:21 GMT+03:00 Rachel Melamed : > Hi all, > Does anyone have any suggestions for this problem: > http://stackoverflow.com/questions/41125342/sklearn- > logistic-regression-gives-biased-results > > I am running around 1000 similar logistic regressions, with the same > covariates but slightly different data and response variables. All of my > response variables have a sparse successes (p(success) < .05 usually). > > I noticed that with the regularized regression, the results are > consistently biased to predict more "successes" than is observed in the > training data. When I relax the regularization, this bias goes away. The > bias observed is unacceptable for my use case, but the more-regularized > model does seem a bit better. > > Below, I plot the results for the 1000 different regressions for 2 > different values of C: [image: results for the different regressions for > 2 different values of C] > > I looked at the parameter estimates for one of these regressions: below > each point is one parameter. It seems like the intercept (the point on the > bottom left) is too high for the C=1 model. [image: enter image > description here] > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Yours sincerely, Alexey A. Dral -------------- next part -------------- An HTML attachment was scrubbed... URL: From melamed at uchicago.edu Thu Dec 15 14:03:22 2016 From: melamed at uchicago.edu (Rachel Melamed) Date: Thu, 15 Dec 2016 19:03:22 +0000 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> Message-ID: <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> Thanks for the reply. The covariates (?X") are all dummy/categorical variables. So I guess no, nothing is normalized. On Dec 15, 2016, at 1:54 PM, Alexey Dral > wrote: Hi Rachel, Do you have your data normalized? 2016-12-15 20:21 GMT+03:00 Rachel Melamed >: Hi all, Does anyone have any suggestions for this problem: http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results I am running around 1000 similar logistic regressions, with the same covariates but slightly different data and response variables. All of my response variables have a sparse successes (p(success) < .05 usually). I noticed that with the regularized regression, the results are consistently biased to predict more "successes" than is observed in the training data. When I relax the regularization, this bias goes away. The bias observed is unacceptable for my use case, but the more-regularized model does seem a bit better. Below, I plot the results for the 1000 different regressions for 2 different values of C: [results for the different regressions for 2 different values of C] I looked at the parameter estimates for one of these regressions: below each point is one parameter. It seems like the intercept (the point on the bottom left) is too high for the C=1 model. [enter image description here] _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- Yours sincerely, Alexey A. Dral _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From aadral at gmail.com Thu Dec 15 14:16:15 2016 From: aadral at gmail.com (Alexey Dral) Date: Thu, 15 Dec 2016 22:16:15 +0300 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> Message-ID: Could you try to normalize dataset after feature dummy encoding and see if it is reproducible behavior? 2016-12-15 22:03 GMT+03:00 Rachel Melamed : > Thanks for the reply. The covariates (?X") are all dummy/categorical > variables. So I guess no, nothing is normalized. > > On Dec 15, 2016, at 1:54 PM, Alexey Dral wrote: > > Hi Rachel, > > Do you have your data normalized? > > 2016-12-15 20:21 GMT+03:00 Rachel Melamed : > >> Hi all, >> Does anyone have any suggestions for this problem: >> http://stackoverflow.com/questions/41125342/sklearn-logistic >> -regression-gives-biased-results >> >> I am running around 1000 similar logistic regressions, with the same >> covariates but slightly different data and response variables. All of my >> response variables have a sparse successes (p(success) < .05 usually). >> >> I noticed that with the regularized regression, the results are >> consistently biased to predict more "successes" than is observed in the >> training data. When I relax the regularization, this bias goes away. The >> bias observed is unacceptable for my use case, but the more-regularized >> model does seem a bit better. >> >> Below, I plot the results for the 1000 different regressions for 2 >> different values of C: [image: results for the different regressions for >> 2 different values of C] >> >> I looked at the parameter estimates for one of these regressions: below >> each point is one parameter. It seems like the intercept (the point on the >> bottom left) is too high for the C=1 model. [image: enter image >> description here] >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Yours sincerely, > Alexey A. Dral > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Yours sincerely, Alexey A. Dral -------------- next part -------------- An HTML attachment was scrubbed... URL: From melamed at uchicago.edu Thu Dec 15 16:02:28 2016 From: melamed at uchicago.edu (Rachel Melamed) Date: Thu, 15 Dec 2016 21:02:28 +0000 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> Message-ID: I just tried it and it did not appear to change the results at all? I ran it as follows: 1) Normalize dummy variables (by subtracting median) to make a matrix of about 10000 x 5 2) For each of the 1000 output variables: a. Each output variable uses the same dummy variables, but not all settings of covariates are observed for all output variables. So I create the design matrix using patsy per output variable to include pairwise interactions. Then, I have an around 10000 x 350 design matrix , and a matrix I call ?success_fail? that has for each setting the number of success and number of fail, so it is of size 10000 x 2 b. Run regression using: skdesign = np.vstack((design,design)) sklabel = np.hstack((np.ones(success_fail.shape[0]), np.zeros(success_fail.shape[0]))) skweight = np.hstack((success_fail['success'], success_fail['fail'])) logregN = linear_model.LogisticRegression(C=1, solver= 'lbfgs',fit_intercept=False) logregN.fit(skdesign, sklabel, sample_weight=skweight) On Dec 15, 2016, at 2:16 PM, Alexey Dral > wrote: Could you try to normalize dataset after feature dummy encoding and see if it is reproducible behavior? 2016-12-15 22:03 GMT+03:00 Rachel Melamed >: Thanks for the reply. The covariates (?X") are all dummy/categorical variables. So I guess no, nothing is normalized. On Dec 15, 2016, at 1:54 PM, Alexey Dral > wrote: Hi Rachel, Do you have your data normalized? 2016-12-15 20:21 GMT+03:00 Rachel Melamed >: Hi all, Does anyone have any suggestions for this problem: http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results I am running around 1000 similar logistic regressions, with the same covariates but slightly different data and response variables. All of my response variables have a sparse successes (p(success) < .05 usually). I noticed that with the regularized regression, the results are consistently biased to predict more "successes" than is observed in the training data. When I relax the regularization, this bias goes away. The bias observed is unacceptable for my use case, but the more-regularized model does seem a bit better. Below, I plot the results for the 1000 different regressions for 2 different values of C: [results for the different regressions for 2 different values of C] I looked at the parameter estimates for one of these regressions: below each point is one parameter. It seems like the intercept (the point on the bottom left) is too high for the C=1 model. [enter image description here] _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- Yours sincerely, Alexey A. Dral _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- Yours sincerely, Alexey A. Dral _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Thu Dec 15 16:41:32 2016 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Thu, 15 Dec 2016 13:41:32 -0800 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> Message-ID: LR is biased with imbalanced datasets. Is your dataset unbalanced? (e.g. is there one class that has a much smaller prevalence in the data that the other)? On Thu, Dec 15, 2016 at 1:02 PM, Rachel Melamed wrote: > I just tried it and it did not appear to change the results at all? > I ran it as follows: > 1) Normalize dummy variables (by subtracting median) to make a matrix of > about 10000 x 5 > > 2) For each of the 1000 output variables: > a. Each output variable uses the same dummy variables, but not all > settings of covariates are observed for all output variables. So I create > the design matrix using patsy per output variable to include pairwise > interactions. Then, I have an around 10000 x 350 design matrix , and a > matrix I call ?success_fail? that has for each setting the number of > success and number of fail, so it is of size 10000 x 2 > > b. Run regression using: > skdesign = np.vstack((design,design)) > sklabel = np.hstack((np.ones(success_fail.shape[0]), > np.zeros(success_fail.shape[0]))) > skweight = np.hstack((success_fail['success'], success_fail['fail'])) > > logregN = linear_model.LogisticRegression(C=1, > solver= 'lbfgs',fit_intercept=False) > logregN.fit(skdesign, sklabel, sample_weight=skweight) > > > On Dec 15, 2016, at 2:16 PM, Alexey Dral wrote: > > Could you try to normalize dataset after feature dummy encoding and see if > it is reproducible behavior? > > 2016-12-15 22:03 GMT+03:00 Rachel Melamed : > >> Thanks for the reply. The covariates (?X") are all dummy/categorical >> variables. So I guess no, nothing is normalized. >> >> On Dec 15, 2016, at 1:54 PM, Alexey Dral wrote: >> >> Hi Rachel, >> >> Do you have your data normalized? >> >> 2016-12-15 20:21 GMT+03:00 Rachel Melamed : >> >>> Hi all, >>> Does anyone have any suggestions for this problem: >>> http://stackoverflow.com/questions/41125342/sklearn-logistic >>> -regression-gives-biased-results >>> >>> I am running around 1000 similar logistic regressions, with the same >>> covariates but slightly different data and response variables. All of my >>> response variables have a sparse successes (p(success) < .05 usually). >>> >>> I noticed that with the regularized regression, the results are >>> consistently biased to predict more "successes" than is observed in the >>> training data. When I relax the regularization, this bias goes away. The >>> bias observed is unacceptable for my use case, but the more-regularized >>> model does seem a bit better. >>> >>> Below, I plot the results for the 1000 different regressions for 2 >>> different values of C: [image: results for the different regressions >>> for 2 different values of C] >>> >>> I looked at the parameter estimates for one of these regressions: below >>> each point is one parameter. It seems like the intercept (the point on the >>> bottom left) is too high for the C=1 model. [image: enter image >>> description here] >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Yours sincerely, >> Alexey A. Dral >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Yours sincerely, > Alexey A. Dral > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Thu Dec 15 16:43:35 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Thu, 15 Dec 2016 16:43:35 -0500 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> Message-ID: Subtracting the median wouldn?t result in normalizing the usual sense, since subtracting a constant just shifts the values by a constant. Instead, for logistic regression & most optimizers, I would recommend subtracting the mean to center the features at mean zero and divide by the standard deviation to get ?z? scores (e.g., this can be done by the StandardScaler()). Best, Sebastian > On Dec 15, 2016, at 4:02 PM, Rachel Melamed wrote: > > I just tried it and it did not appear to change the results at all? > I ran it as follows: > 1) Normalize dummy variables (by subtracting median) to make a matrix of about 10000 x 5 > > 2) For each of the 1000 output variables: > a. Each output variable uses the same dummy variables, but not all settings of covariates are observed for all output variables. So I create the design matrix using patsy per output variable to include pairwise interactions. Then, I have an around 10000 x 350 design matrix , and a matrix I call ?success_fail? that has for each setting the number of success and number of fail, so it is of size 10000 x 2 > > b. Run regression using: > > skdesign = np.vstack((design,design)) > > sklabel = np.hstack((np.ones(success_fail.shape[0]), > np.zeros(success_fail.shape[0]))) > > skweight = np.hstack((success_fail['success'], success_fail['fail'])) > > logregN = linear_model.LogisticRegression(C=1, > solver= 'lbfgs',fit_intercept=False) > logregN.fit(skdesign, sklabel, sample_weight=skweight) > > >> On Dec 15, 2016, at 2:16 PM, Alexey Dral wrote: >> >> Could you try to normalize dataset after feature dummy encoding and see if it is reproducible behavior? >> >> 2016-12-15 22:03 GMT+03:00 Rachel Melamed : >> Thanks for the reply. The covariates (?X") are all dummy/categorical variables. So I guess no, nothing is normalized. >> >>> On Dec 15, 2016, at 1:54 PM, Alexey Dral wrote: >>> >>> Hi Rachel, >>> >>> Do you have your data normalized? >>> >>> 2016-12-15 20:21 GMT+03:00 Rachel Melamed : >>> Hi all, >>> Does anyone have any suggestions for this problem: >>> http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results >>> >>> I am running around 1000 similar logistic regressions, with the same covariates but slightly different data and response variables. All of my response variables have a sparse successes (p(success) < .05 usually). >>> >>> I noticed that with the regularized regression, the results are consistently biased to predict more "successes" than is observed in the training data. When I relax the regularization, this bias goes away. The bias observed is unacceptable for my use case, but the more-regularized model does seem a bit better. >>> >>> Below, I plot the results for the 1000 different regressions for 2 different values of C: >>> >>> I looked at the parameter estimates for one of these regressions: below each point is one parameter. It seems like the intercept (the point on the bottom left) is too high for the C=1 model. >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> >>> -- >>> Yours sincerely, >>> Alexey A. Dral >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> -- >> Yours sincerely, >> Alexey A. Dral >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From sean.violante at gmail.com Thu Dec 15 17:02:08 2016 From: sean.violante at gmail.com (Sean Violante) Date: Thu, 15 Dec 2016 23:02:08 +0100 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> Message-ID: The problem is the (stupid!) liblinear solver that also penalises the intercept (in regularisation) . Use a different solver or change the intercept_scaling parameter On 15 Dec 2016 10:44 pm, "Sebastian Raschka" wrote: > Subtracting the median wouldn?t result in normalizing the usual sense, > since subtracting a constant just shifts the values by a constant. Instead, > for logistic regression & most optimizers, I would recommend subtracting > the mean to center the features at mean zero and divide by the standard > deviation to get ?z? scores (e.g., this can be done by the > StandardScaler()). > > Best, > Sebastian > > > On Dec 15, 2016, at 4:02 PM, Rachel Melamed > wrote: > > > > I just tried it and it did not appear to change the results at all? > > I ran it as follows: > > 1) Normalize dummy variables (by subtracting median) to make a matrix of > about 10000 x 5 > > > > 2) For each of the 1000 output variables: > > a. Each output variable uses the same dummy variables, but not all > settings of covariates are observed for all output variables. So I create > the design matrix using patsy per output variable to include pairwise > interactions. Then, I have an around 10000 x 350 design matrix , and a > matrix I call ?success_fail? that has for each setting the number of > success and number of fail, so it is of size 10000 x 2 > > > > b. Run regression using: > > > > skdesign = np.vstack((design,design)) > > > > sklabel = np.hstack((np.ones(success_fail.shape[0]), > > np.zeros(success_fail.shape[0]))) > > > > skweight = np.hstack((success_fail['success'], success_fail['fail'])) > > > > logregN = linear_model.LogisticRegression(C=1, > > solver= 'lbfgs',fit_intercept=False) > > logregN.fit(skdesign, sklabel, sample_weight=skweight) > > > > > >> On Dec 15, 2016, at 2:16 PM, Alexey Dral wrote: > >> > >> Could you try to normalize dataset after feature dummy encoding and see > if it is reproducible behavior? > >> > >> 2016-12-15 22:03 GMT+03:00 Rachel Melamed : > >> Thanks for the reply. The covariates (?X") are all dummy/categorical > variables. So I guess no, nothing is normalized. > >> > >>> On Dec 15, 2016, at 1:54 PM, Alexey Dral wrote: > >>> > >>> Hi Rachel, > >>> > >>> Do you have your data normalized? > >>> > >>> 2016-12-15 20:21 GMT+03:00 Rachel Melamed : > >>> Hi all, > >>> Does anyone have any suggestions for this problem: > >>> http://stackoverflow.com/questions/41125342/sklearn- > logistic-regression-gives-biased-results > >>> > >>> I am running around 1000 similar logistic regressions, with the same > covariates but slightly different data and response variables. All of my > response variables have a sparse successes (p(success) < .05 usually). > >>> > >>> I noticed that with the regularized regression, the results are > consistently biased to predict more "successes" than is observed in the > training data. When I relax the regularization, this bias goes away. The > bias observed is unacceptable for my use case, but the more-regularized > model does seem a bit better. > >>> > >>> Below, I plot the results for the 1000 different regressions for 2 > different values of C: > >>> > >>> I looked at the parameter estimates for one of these regressions: > below each point is one parameter. It seems like the intercept (the point > on the bottom left) is too high for the C=1 model. > >>> > >>> > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> > >>> > >>> > >>> -- > >>> Yours sincerely, > >>> Alexey A. Dral > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> scikit-learn at python.org > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> > >> > >> > >> -- > >> Yours sincerely, > >> Alexey A. Dral > >> _______________________________________________ > >> scikit-learn mailing list > >> scikit-learn at python.org > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sean.violante at gmail.com Thu Dec 15 17:05:51 2016 From: sean.violante at gmail.com (Sean Violante) Date: Thu, 15 Dec 2016 23:05:51 +0100 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> Message-ID: Sorry just saw you are not using the liblinear solver, agree with Sebastian, you should subtract mean not median On 15 Dec 2016 11:02 pm, "Sean Violante" wrote: > The problem is the (stupid!) liblinear solver that also penalises the > intercept (in regularisation) . Use a different solver or change the > intercept_scaling parameter > > On 15 Dec 2016 10:44 pm, "Sebastian Raschka" wrote: > >> Subtracting the median wouldn?t result in normalizing the usual sense, >> since subtracting a constant just shifts the values by a constant. Instead, >> for logistic regression & most optimizers, I would recommend subtracting >> the mean to center the features at mean zero and divide by the standard >> deviation to get ?z? scores (e.g., this can be done by the >> StandardScaler()). >> >> Best, >> Sebastian >> >> > On Dec 15, 2016, at 4:02 PM, Rachel Melamed >> wrote: >> > >> > I just tried it and it did not appear to change the results at all? >> > I ran it as follows: >> > 1) Normalize dummy variables (by subtracting median) to make a matrix >> of about 10000 x 5 >> > >> > 2) For each of the 1000 output variables: >> > a. Each output variable uses the same dummy variables, but not all >> settings of covariates are observed for all output variables. So I create >> the design matrix using patsy per output variable to include pairwise >> interactions. Then, I have an around 10000 x 350 design matrix , and a >> matrix I call ?success_fail? that has for each setting the number of >> success and number of fail, so it is of size 10000 x 2 >> > >> > b. Run regression using: >> > >> > skdesign = np.vstack((design,design)) >> > >> > sklabel = np.hstack((np.ones(success_fail.shape[0]), >> > np.zeros(success_fail.shape[0]))) >> > >> > skweight = np.hstack((success_fail['success'], success_fail['fail'])) >> > >> > logregN = linear_model.LogisticRegression(C=1, >> > solver= 'lbfgs',fit_intercept=False) >> > logregN.fit(skdesign, sklabel, sample_weight=skweight) >> > >> > >> >> On Dec 15, 2016, at 2:16 PM, Alexey Dral wrote: >> >> >> >> Could you try to normalize dataset after feature dummy encoding and >> see if it is reproducible behavior? >> >> >> >> 2016-12-15 22:03 GMT+03:00 Rachel Melamed : >> >> Thanks for the reply. The covariates (?X") are all dummy/categorical >> variables. So I guess no, nothing is normalized. >> >> >> >>> On Dec 15, 2016, at 1:54 PM, Alexey Dral wrote: >> >>> >> >>> Hi Rachel, >> >>> >> >>> Do you have your data normalized? >> >>> >> >>> 2016-12-15 20:21 GMT+03:00 Rachel Melamed : >> >>> Hi all, >> >>> Does anyone have any suggestions for this problem: >> >>> http://stackoverflow.com/questions/41125342/sklearn-logistic >> -regression-gives-biased-results >> >>> >> >>> I am running around 1000 similar logistic regressions, with the same >> covariates but slightly different data and response variables. All of my >> response variables have a sparse successes (p(success) < .05 usually). >> >>> >> >>> I noticed that with the regularized regression, the results are >> consistently biased to predict more "successes" than is observed in the >> training data. When I relax the regularization, this bias goes away. The >> bias observed is unacceptable for my use case, but the more-regularized >> model does seem a bit better. >> >>> >> >>> Below, I plot the results for the 1000 different regressions for 2 >> different values of C: >> >>> >> >>> I looked at the parameter estimates for one of these regressions: >> below each point is one parameter. It seems like the intercept (the point >> on the bottom left) is too high for the C=1 model. >> >>> >> >>> >> >>> >> >>> _______________________________________________ >> >>> scikit-learn mailing list >> >>> scikit-learn at python.org >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >>> >> >>> >> >>> >> >>> >> >>> -- >> >>> Yours sincerely, >> >>> Alexey A. Dral >> >>> _______________________________________________ >> >>> scikit-learn mailing list >> >>> scikit-learn at python.org >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> >> >> >> >> -- >> >> Yours sincerely, >> >> Alexey A. Dral >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Thu Dec 15 18:00:12 2016 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Thu, 15 Dec 2016 15:00:12 -0800 Subject: [scikit-learn] Model checksums In-Reply-To: References: Message-ID: I don't mean that scikit-learn's modeling is non-deterministic -- I mean the pickle library. Same input different serialized bytes output. It was my recollection that dictionaries were inconsistently ordered when serialized, or some the object ID was included in the serialization -- anyhow I don't seem to be able reproduce it now I've fixed a bug and am actually providing identical input to serialize. Thanks for the joblib serialization link. The memory serializer is buried in the docs (is not mentioned in the docs on persistence) On Tue, Dec 13, 2016 at 12:10 PM, Gael Varoquaux < gael.varoquaux at normalesup.org> wrote: > What do you mean non deterministic? If you set the random_state of > models, we try to make them deterministic. Most often, any residual > variability is numerical noise that reveals statistical error bars. > > G > > Sent from my phone. Please forgive brevity and mis spelling > On Dec 13, 2016, at 19:29, Stuart Reynolds > wrote: > >> I'd like to cache some functions to avoid rebuilding models like so: >> >> @cached >> def train(model, dataparams): ... >> >> >> model is an (untrained) scikit-learn object and dataparams is a dict. >> The @cached annotation forms a SHA checksum out of the parameters of the >> function it annotates and returns the previously calculated function result >> if the parameters match. >> >> The tricky part here is reliably generating a checksum from the >> parameters. Scikit uses Python's pickle (http://scikit-learn.org/ >> stable/modules/model_persistence.html) but the pickle library is >> non-deterministic (same inputs to pickle.dumps yields differing output! -- >> *I know*). >> >> So... any suggestions on how to generate checksums from models in python? >> >> Thanks. >> - Stuart >> >> >> ------------------------------ >> >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From melamed at uchicago.edu Thu Dec 15 22:04:03 2016 From: melamed at uchicago.edu (Rachel Melamed) Date: Fri, 16 Dec 2016 03:04:03 +0000 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> Message-ID: Stuart, Yes the data is quite imbalanced (this is what I meant by p(success) < .05 ) To be clear, I calculate \sum_i \hat{y_i} = logregN.predict_proba(design)[:,1]*(success_fail.sum(axis=1)) and compare that number to the observed number of success. I find the predicted number to always be higher (I think, because of the intercept). I was not aware of a bias for imbalanced data. Can you tell me more? Why does it not appear with the relaxed regularization? Also, using the same data with statsmodels LR, which has no regularization, this doesn't seem to be a problem. Any suggestions for how I could fix this are welcome. Thank you On Dec 15, 2016, at 4:41 PM, Stuart Reynolds > wrote: LR is biased with imbalanced datasets. Is your dataset unbalanced? (e.g. is there one class that has a much smaller prevalence in the data that the other)? On Thu, Dec 15, 2016 at 1:02 PM, Rachel Melamed > wrote: I just tried it and it did not appear to change the results at all? I ran it as follows: 1) Normalize dummy variables (by subtracting median) to make a matrix of about 10000 x 5 2) For each of the 1000 output variables: a. Each output variable uses the same dummy variables, but not all settings of covariates are observed for all output variables. So I create the design matrix using patsy per output variable to include pairwise interactions. Then, I have an around 10000 x 350 design matrix , and a matrix I call ?success_fail? that has for each setting the number of success and number of fail, so it is of size 10000 x 2 b. Run regression using: skdesign = np.vstack((design,design)) sklabel = np.hstack((np.ones(success_fail.shape[0]), np.zeros(success_fail.shape[0]))) skweight = np.hstack((success_fail['success'], success_fail['fail'])) logregN = linear_model.LogisticRegression(C=1, solver= 'lbfgs',fit_intercept=False) logregN.fit(skdesign, sklabel, sample_weight=skweight) On Dec 15, 2016, at 2:16 PM, Alexey Dral > wrote: Could you try to normalize dataset after feature dummy encoding and see if it is reproducible behavior? 2016-12-15 22:03 GMT+03:00 Rachel Melamed >: Thanks for the reply. The covariates (?X") are all dummy/categorical variables. So I guess no, nothing is normalized. On Dec 15, 2016, at 1:54 PM, Alexey Dral > wrote: Hi Rachel, Do you have your data normalized? 2016-12-15 20:21 GMT+03:00 Rachel Melamed >: Hi all, Does anyone have any suggestions for this problem: http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results I am running around 1000 similar logistic regressions, with the same covariates but slightly different data and response variables. All of my response variables have a sparse successes (p(success) < .05 usually). I noticed that with the regularized regression, the results are consistently biased to predict more "successes" than is observed in the training data. When I relax the regularization, this bias goes away. The bias observed is unacceptable for my use case, but the more-regularized model does seem a bit better. Below, I plot the results for the 1000 different regressions for 2 different values of C: [results for the different regressions for 2 different values of C] I looked at the parameter estimates for one of these regressions: below each point is one parameter. It seems like the intercept (the point on the bottom left) is too high for the C=1 model. [enter image description here] _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- Yours sincerely, Alexey A. Dral _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -- Yours sincerely, Alexey A. Dral _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Thu Dec 15 23:30:36 2016 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Fri, 16 Dec 2016 04:30:36 +0000 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> Message-ID: Here's a discussion http://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression See the Zheng and King reference. It would be nice to have these methods in scikit. On Thu, Dec 15, 2016 at 7:05 PM Rachel Melamed wrote: > > > > > > > > > > > Stuart, > > > > Yes the data is quite imbalanced (this is what I meant by p(success) < .05 > ) > > > > > > > > > > > > To be clear, I calculate > > > > > \sum_i \hat{y_i} > = logregN.predict_proba(design)[:,1]*(success_fail.sum(axis=1)) > > > > > and compare that number to the observed number of success. I find the > predicted number to always be higher (I think, because of the intercept). > > > > > > > > > > > > I was not aware of a bias for imbalanced data. Can you tell me more? Why > does it not appear with the relaxed regularization? Also, using the same > data with statsmodels LR, which has no regularization, this doesn't seem to > be a problem. Any suggestions for > > how I could fix this are welcome. > > > > > > > > > > > > Thank you > > > > > > > > > > > > > > On Dec 15, 2016, at 4:41 PM, Stuart Reynolds > wrote: > > > > > > > > LR is biased with imbalanced datasets. Is your dataset unbalanced? (e.g. > is there one class that has a much smaller prevalence in the data that the > other)? > > > > > > On Thu, Dec 15, 2016 at 1:02 PM, Rachel Melamed > > wrote: > > > > > I just tried it and it did not appear to change the results at all? > > I ran it as follows: > > 1) Normalize dummy variables (by subtracting median) to make a matrix of > about 10000 x 5 > > > > > > > > 2) For each of the 1000 output variables: > > > a. Each output variable uses the same dummy variables, but not all > settings of covariates are observed for all output variables. So I create > the design matrix using patsy per output variable to include pairwise > interactions. Then, I have an around > > 10000 x 350 design matrix , and a matrix I call ?success_fail? that has > for each setting the number of success and number of fail, so it is of size > 10000 x 2 > > > > > > > > b. Run regression using: > > > > > > > skdesign = np.vstack((design,design)) > > > > > sklabel = np.hstack((np.ones(success_fail.shape[0]), > > > np.zeros(success_fail.shape[0]))) > > > > > skweight = np.hstack((success_fail['success'], success_fail['fail'])) > > > > > > > > > > logregN = linear_model.LogisticRegression(C=1, > > > solver= 'lbfgs',fit_intercept=False) > > > logregN.fit(skdesign, sklabel, sample_weight=skweight) > > > > > > > > > > > > > > > > > > > > > > > On Dec 15, 2016, at 2:16 PM, Alexey Dral wrote: > > > > > > > > Could you try to normalize dataset after feature dummy encoding and see if > it is reproducible behavior? > > > > > 2016-12-15 22:03 GMT+03:00 Rachel Melamed > > : > > > > > Thanks for the reply. The covariates (?X") are all dummy/categorical > variables. So I guess no, nothing is normalized. > > > > > > > > > > > > > > > On Dec 15, 2016, at 1:54 PM, Alexey Dral wrote: > > > > > > > > Hi Rachel, > > > > > > > Do you have your data normalized? > > > > > > 2016-12-15 20:21 GMT+03:00 Rachel Melamed > > : > > > > > > > Hi all, > > > Does anyone have any suggestions for this problem: > > > > http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results > > > > > > > > > > > I am running around 1000 similar logistic regressions, with the same > covariates but slightly different data and response variables. All of my > response variables have a sparse successes (p(success) < .05 usually). > > > > > I noticed that with the regularized regression, the results are > consistently biased to predict more "successes" than is observed in the > training data. When I relax the regularization, this bias goes away. The > bias observed is unacceptable for my use case, but > > the more-regularized model does seem a bit better. > > > > > Below, I plot the results for the 1000 different regressions for 2 > different values of C: [image: results for the different regressions for > 2 different values of C] > > > > > I looked at the parameter estimates for one of these regressions: below > each point is one parameter. It seems like the intercept (the point on the > bottom left) is too high for the C=1 model. [image: enter image > description here] > > > > > > > > > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Yours sincerely, > > > Alexey A. Dral > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Yours sincerely, > > > Alexey A. Dral > > > > > > > > > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Fri Dec 16 00:11:00 2016 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Fri, 16 Dec 2016 00:11:00 -0500 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> Message-ID: just some generic comments, I don't have any experience with penalized estimation nor did I go through the math. In unregularized Logistis Regression or Logit and in several other models the estimator satisfies some aggregation properties so that in sample or training set proportions match between predicted proportions and those of the sample. Regularized estimation does not require unbiased estimation of the parameters because it maximizes a different objective function, like mean squared error in the linear model. We are trading off bias against variance. I think this will propagate to the prediction, but I'm not sure whether an unpenalized intercept can be made to compensate for the bias in the average prediction. For Logit this would mean that although we have a bias, we have less variance/variation in the prediction, so overall we are doing better than with unregularized prediction under the chosen penalization measure. I assume because the regularization biases towards zero coefficients it also biases towards a prediction of 0.5, unless it's compensated for by the intercept. I didn't read the King and Zheng (2001) article, but it doesn't mention penalization or regularization, based on a brief search, so it doesn't seem to address the regularization bias. (Aside, from the literature I think many people use a different model than logistic for rare events data, either Poisson with exponential link or Binomial/Bernoulli with an asymmetric link function.) I think, demeaning could help because it reduces the dependence between the intercept and the other penalized variables, but because of the nonlinear model it will not make it orthogonal. The question is whether it's possible to improve the estimator by additionally adjusting the mean or the threshold for 0-1 predictions. It might depend on the criteria to choose the penalization. I don't know and have no idea what scikit-learn does. Josef On Thu, Dec 15, 2016 at 11:30 PM, Stuart Reynolds wrote: > Here's a discussion > > http://stats.stackexchange.com/questions/6067/does-an- > unbalanced-sample-matter-when-doing-logistic-regression > > See the Zheng and King reference. > It would be nice to have these methods in scikit. > > > > On Thu, Dec 15, 2016 at 7:05 PM Rachel Melamed > wrote: > >> >> >> >> >> >> >> >> >> >> >> Stuart, >> >> >> >> Yes the data is quite imbalanced (this is what I meant by p(success) < >> .05 ) >> >> >> >> >> >> >> >> >> >> >> >> To be clear, I calculate >> >> >> >> >> \sum_i \hat{y_i} = logregN.predict_proba(design)[:,1]*(success_fail. >> sum(axis=1)) >> >> >> >> >> and compare that number to the observed number of success. I find the >> predicted number to always be higher (I think, because of the intercept). >> >> >> >> >> >> >> >> >> >> >> >> I was not aware of a bias for imbalanced data. Can you tell me more? Why >> does it not appear with the relaxed regularization? Also, using the same >> data with statsmodels LR, which has no regularization, this doesn't seem to >> be a problem. Any suggestions for >> >> how I could fix this are welcome. >> >> >> >> >> >> >> >> >> >> >> >> Thank you >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Dec 15, 2016, at 4:41 PM, Stuart Reynolds >> wrote: >> >> >> >> >> >> >> >> LR is biased with imbalanced datasets. Is your dataset unbalanced? (e.g. >> is there one class that has a much smaller prevalence in the data that the >> other)? >> >> >> >> >> >> On Thu, Dec 15, 2016 at 1:02 PM, Rachel Melamed >> >> wrote: >> >> >> >> >> I just tried it and it did not appear to change the results at all? >> >> I ran it as follows: >> >> 1) Normalize dummy variables (by subtracting median) to make a matrix of >> about 10000 x 5 >> >> >> >> >> >> >> >> 2) For each of the 1000 output variables: >> >> >> a. Each output variable uses the same dummy variables, but not all >> settings of covariates are observed for all output variables. So I create >> the design matrix using patsy per output variable to include pairwise >> interactions. Then, I have an around >> >> 10000 x 350 design matrix , and a matrix I call ?success_fail? that has >> for each setting the number of success and number of fail, so it is of size >> 10000 x 2 >> >> >> >> >> >> >> >> b. Run regression using: >> >> >> >> >> >> >> skdesign = np.vstack((design,design)) >> >> >> >> >> sklabel = np.hstack((np.ones(success_fail.shape[0]), >> >> >> np.zeros(success_fail.shape[0]))) >> >> >> >> >> skweight = np.hstack((success_fail['success'], success_fail['fail'])) >> >> >> >> >> >> >> >> >> >> logregN = linear_model.LogisticRegression(C=1, >> >> >> solver= 'lbfgs',fit_intercept=False) >> >> >> logregN.fit(skdesign, sklabel, sample_weight=skweight) >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Dec 15, 2016, at 2:16 PM, Alexey Dral wrote: >> >> >> >> >> >> >> >> Could you try to normalize dataset after feature dummy encoding and see >> if it is reproducible behavior? >> >> >> >> >> 2016-12-15 22:03 GMT+03:00 Rachel Melamed >> >> : >> >> >> >> >> Thanks for the reply. The covariates (?X") are all dummy/categorical >> variables. So I guess no, nothing is normalized. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Dec 15, 2016, at 1:54 PM, Alexey Dral wrote: >> >> >> >> >> >> >> >> Hi Rachel, >> >> >> >> >> >> >> Do you have your data normalized? >> >> >> >> >> >> 2016-12-15 20:21 GMT+03:00 Rachel Melamed >> >> : >> >> >> >> >> >> >> Hi all, >> >> >> Does anyone have any suggestions for this problem: >> >> >> http://stackoverflow.com/questions/41125342/sklearn- >> logistic-regression-gives-biased-results >> >> >> >> >> >> >> >> >> >> >> I am running around 1000 similar logistic regressions, with the same >> covariates but slightly different data and response variables. All of my >> response variables have a sparse successes (p(success) < .05 usually). >> >> >> >> >> I noticed that with the regularized regression, the results are >> consistently biased to predict more "successes" than is observed in the >> training data. When I relax the regularization, this bias goes away. The >> bias observed is unacceptable for my use case, but >> >> the more-regularized model does seem a bit better. >> >> >> >> >> Below, I plot the results for the 1000 different regressions for 2 >> different values of C: [image: results for the different regressions for >> 2 different values of C] >> >> >> >> >> I looked at the parameter estimates for one of these regressions: below >> each point is one parameter. It seems like the intercept (the point on the >> bottom left) is too high for the C=1 model. [image: enter image >> description here] >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> >> scikit-learn mailing list >> >> >> scikit-learn at python.org >> >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> >> >> >> >> >> >> >> >> >> >> Yours sincerely, >> >> >> Alexey A. Dral >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> >> scikit-learn mailing list >> >> >> scikit-learn at python.org >> >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> >> scikit-learn mailing list >> >> >> scikit-learn at python.org >> >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> >> >> >> >> >> >> >> >> >> >> Yours sincerely, >> >> >> Alexey A. Dral >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> >> scikit-learn mailing list >> >> >> scikit-learn at python.org >> >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> >> scikit-learn mailing list >> >> >> scikit-learn at python.org >> >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> >> scikit-learn mailing list >> >> >> scikit-learn at python.org >> >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn at python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuart at stuartreynolds.net Fri Dec 16 00:30:42 2016 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Fri, 16 Dec 2016 05:30:42 +0000 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> Message-ID: Sorry... I mean penalized likelihood, not large weight penalization. Here's the reference I was thinking of http://m.statisticalhorizons.com/?task=get&pageid=1424858329 On Thu, Dec 15, 2016 at 9:12 PM wrote: > just some generic comments, I don't have any experience with penalized > estimation nor did I go through the math. > > In unregularized Logistis Regression or Logit and in several other models > the estimator satisfies some aggregation properties so that in sample or > training set proportions match between predicted proportions and those of > the sample. > > Regularized estimation does not require unbiased estimation of the > parameters because it maximizes a different objective function, like mean > squared error in the linear model. We are trading off bias against > variance. I think this will propagate to the prediction, but I'm not sure > whether an unpenalized intercept can be made to compensate for the bias in > the average prediction. > > For Logit this would mean that although we have a bias, we have less > variance/variation in the prediction, so overall we are doing better than > with unregularized prediction under the chosen penalization measure. > I assume because the regularization biases towards zero coefficients it > also biases towards a prediction of 0.5, unless it's compensated for by the > intercept. > > I didn't read the King and Zheng (2001) article, but it doesn't mention > penalization or regularization, based on a brief search, so it doesn't seem > to address the regularization bias. (Aside, from the literature I think > many people use a different model than logistic for rare events data, > either Poisson with exponential link or Binomial/Bernoulli with an > asymmetric link function.) > > I think, demeaning could help because it reduces the dependence between > the intercept and the other penalized variables, but because of the > nonlinear model it will not make it orthogonal. > > The question is whether it's possible to improve the estimator by > additionally adjusting the mean or the threshold for 0-1 predictions. It > might depend on the criteria to choose the penalization. I don't know and > have no idea what scikit-learn does. > > Josef > > On Thu, Dec 15, 2016 at 11:30 PM, Stuart Reynolds < > stuart at stuartreynolds.net> wrote: > > Here's a discussion > > > http://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression > > See the Zheng and King reference. > It would be nice to have these methods in scikit. > > > > On Thu, Dec 15, 2016 at 7:05 PM Rachel Melamed > wrote: > > > > > > > > > > > > Stuart, > > > > Yes the data is quite imbalanced (this is what I meant by p(success) < .05 > ) > > > > > > > > > > > > To be clear, I calculate > > > > > \sum_i \hat{y_i} > = logregN.predict_proba(design)[:,1]*(success_fail.sum(axis=1)) > > > > > and compare that number to the observed number of success. I find the > predicted number to always be higher (I think, because of the intercept). > > > > > > > > > > > > I was not aware of a bias for imbalanced data. Can you tell me more? Why > does it not appear with the relaxed regularization? Also, using the same > data with statsmodels LR, which has no regularization, this doesn't seem to > be a problem. Any suggestions for > > how I could fix this are welcome. > > > > > > > > > > > > Thank you > > > > > > > > > > > > > > On Dec 15, 2016, at 4:41 PM, Stuart Reynolds > wrote: > > > > > > > > LR is biased with imbalanced datasets. Is your dataset unbalanced? (e.g. > is there one class that has a much smaller prevalence in the data that the > other)? > > > > > > On Thu, Dec 15, 2016 at 1:02 PM, Rachel Melamed > > wrote: > > > > > I just tried it and it did not appear to change the results at all? > > I ran it as follows: > > 1) Normalize dummy variables (by subtracting median) to make a matrix of > about 10000 x 5 > > > > > > > > 2) For each of the 1000 output variables: > > > a. Each output variable uses the same dummy variables, but not all > settings of covariates are observed for all output variables. So I create > the design matrix using patsy per output variable to include pairwise > interactions. Then, I have an around > > 10000 x 350 design matrix , and a matrix I call ?success_fail? that has > for each setting the number of success and number of fail, so it is of size > 10000 x 2 > > > > > > > > b. Run regression using: > > > > > > > skdesign = np.vstack((design,design)) > > > > > sklabel = np.hstack((np.ones(success_fail.shape[0]), > > > np.zeros(success_fail.shape[0]))) > > > > > skweight = np.hstack((success_fail['success'], success_fail['fail'])) > > > > > > > > > > logregN = linear_model.LogisticRegression(C=1, > > > solver= 'lbfgs',fit_intercept=False) > > > logregN.fit(skdesign, sklabel, sample_weight=skweight) > > > > > > > > > > > > > > > > > > > > > > > On Dec 15, 2016, at 2:16 PM, Alexey Dral wrote: > > > > > > > > Could you try to normalize dataset after feature dummy encoding and see if > it is reproducible behavior? > > > > > 2016-12-15 22:03 GMT+03:00 Rachel Melamed > > : > > > > > Thanks for the reply. The covariates (?X") are all dummy/categorical > variables. So I guess no, nothing is normalized. > > > > > > > > > > > > > > > On Dec 15, 2016, at 1:54 PM, Alexey Dral wrote: > > > > > > > > Hi Rachel, > > > > > > > Do you have your data normalized? > > > > > > 2016-12-15 20:21 GMT+03:00 Rachel Melamed > > : > > > > > > > Hi all, > > > Does anyone have any suggestions for this problem: > > > > http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results > > > > > > > > > > > I am running around 1000 similar logistic regressions, with the same > covariates but slightly different data and response variables. All of my > response variables have a sparse successes (p(success) < .05 usually). > > > > > I noticed that with the regularized regression, the results are > consistently biased to predict more "successes" than is observed in the > training data. When I relax the regularization, this bias goes away. The > bias observed is unacceptable for my use case, but > > the more-regularized model does seem a bit better. > > > > > Below, I plot the results for the 1000 different regressions for 2 > different values of C: [image: results for the different regressions for > 2 different values of C] > > > > > I looked at the parameter estimates for one of these regressions: below > each point is one parameter. It seems like the intercept (the point on the > bottom left) is too high for the C=1 model. [image: enter image > description here] > > > > > > > > > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Yours sincerely, > > > Alexey A. Dral > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > Yours sincerely, > > > Alexey A. Dral > > > > > > > > > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > > > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From melamed at uchicago.edu Sat Dec 17 22:25:10 2016 From: melamed at uchicago.edu (Rachel Melamed) Date: Sun, 18 Dec 2016 03:25:10 +0000 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> Message-ID: Hi Sean, Sebastian, Alexey (and Josef), I?m not sure I fully understand what normalizing a dummy should consist of, so please let me know if I am interpreting your suggestion right. I believe I can?t use the StandardScaler since I am using grouped data. As I mentioned, I?m using a matrix ?success_fail? containing the number of observations of each outcome per row of the design matrix. I tried to implement your suggestions by making my design matrix per each regression as previously, using patsy, but then normalizing: darray = np.array(design) weights = np.tile(success_fail.sum(axis=1)/success_fail.sum().sum(), (darray.shape[1],1)).transpose() dmean = (darray*weights).sum(axis=0) dvar = (((darray - dmean)**2)*weights).sum(axis=0)**.5 dvar[dvar==0] = 1 design_norm = (darray - dmean) / dvar design_norm[:,0] = 1 ## intercept stays at 1 Then I use the normalized version of the design matrix as input to the regression. It seems like the bias is still there, though the results are a bit different (see plot) [cid:26AE7EFC-97C0-4CE7-91F3-EAC513486B52 at uchicago.edu] Is this what you were suggesting? I?m also not sure I understand why the error would be higher on the training data for the less-regularized setting. Thanks again Rachel On Dec 15, 2016, at 5:05 PM, Sean Violante > wrote: Sorry just saw you are not using the liblinear solver, agree with Sebastian, you should subtract mean not median On 15 Dec 2016 11:02 pm, "Sean Violante" > wrote: The problem is the (stupid!) liblinear solver that also penalises the intercept (in regularisation) . Use a different solver or change the intercept_scaling parameter On 15 Dec 2016 10:44 pm, "Sebastian Raschka" > wrote: Subtracting the median wouldn?t result in normalizing the usual sense, since subtracting a constant just shifts the values by a constant. Instead, for logistic regression & most optimizers, I would recommend subtracting the mean to center the features at mean zero and divide by the standard deviation to get ?z? scores (e.g., this can be done by the StandardScaler()). Best, Sebastian > On Dec 15, 2016, at 4:02 PM, Rachel Melamed > wrote: > > I just tried it and it did not appear to change the results at all? > I ran it as follows: > 1) Normalize dummy variables (by subtracting median) to make a matrix of about 10000 x 5 > > 2) For each of the 1000 output variables: > a. Each output variable uses the same dummy variables, but not all settings of covariates are observed for all output variables. So I create the design matrix using patsy per output variable to include pairwise interactions. Then, I have an around 10000 x 350 design matrix , and a matrix I call ?success_fail? that has for each setting the number of success and number of fail, so it is of size 10000 x 2 > > b. Run regression using: > > skdesign = np.vstack((design,design)) > > sklabel = np.hstack((np.ones(success_fail.shape[0]), > np.zeros(success_fail.shape[0]))) > > skweight = np.hstack((success_fail['success'], success_fail['fail'])) > > logregN = linear_model.LogisticRegression(C=1, > solver= 'lbfgs',fit_intercept=False) > logregN.fit(skdesign, sklabel, sample_weight=skweight) > > >> On Dec 15, 2016, at 2:16 PM, Alexey Dral > wrote: >> >> Could you try to normalize dataset after feature dummy encoding and see if it is reproducible behavior? >> >> 2016-12-15 22:03 GMT+03:00 Rachel Melamed >: >> Thanks for the reply. The covariates (?X") are all dummy/categorical variables. So I guess no, nothing is normalized. >> >>> On Dec 15, 2016, at 1:54 PM, Alexey Dral > wrote: >>> >>> Hi Rachel, >>> >>> Do you have your data normalized? >>> >>> 2016-12-15 20:21 GMT+03:00 Rachel Melamed >: >>> Hi all, >>> Does anyone have any suggestions for this problem: >>> http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results >>> >>> I am running around 1000 similar logistic regressions, with the same covariates but slightly different data and response variables. All of my response variables have a sparse successes (p(success) < .05 usually). >>> >>> I noticed that with the regularized regression, the results are consistently biased to predict more "successes" than is observed in the training data. When I relax the regularization, this bias goes away. The bias observed is unacceptable for my use case, but the more-regularized model does seem a bit better. >>> >>> Below, I plot the results for the 1000 different regressions for 2 different values of C: >>> >>> I looked at the parameter estimates for one of these regressions: below each point is one parameter. It seems like the intercept (the point on the bottom left) is too high for the C=1 model. >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> >>> -- >>> Yours sincerely, >>> Alexey A. Dral >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> -- >> Yours sincerely, >> Alexey A. Dral >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: PastedGraphic-2.tiff Type: image/tiff Size: 25990 bytes Desc: PastedGraphic-2.tiff URL: From josef.pktd at gmail.com Sun Dec 18 10:09:42 2016 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Sun, 18 Dec 2016 10:09:42 -0500 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> Message-ID: On Sat, Dec 17, 2016 at 10:25 PM, Rachel Melamed wrote: > Hi Sean, Sebastian, Alexey (and Josef), > I?m not sure I fully understand what normalizing a dummy should consist > of, so please let me know if I am interpreting your suggestion right. I > believe I can?t use the StandardScaler since I am using grouped data. As I > mentioned, I?m using a matrix ?success_fail? containing the number of > observations of each outcome per row of the design matrix. I tried to > implement your suggestions by making my design matrix per each regression > as previously, using patsy, but then normalizing: > > darray = np.array(design) > weights = np.tile(success_fail.sum(axis=1)/success_fail.sum().sum(), > (darray.shape[1],1)).transpose() > > dmean = (darray*weights).sum(axis=0) > dvar = (((darray - dmean)**2)*weights).sum(axis=0)**.5 > dvar[dvar==0] = 1 > design_norm = (darray - dmean) / dvar > design_norm[:,0] = 1 ## intercept stays at 1 > > Then I use the normalized version of the design matrix as input to the > regression. It seems like the bias is still there, though the results are a > bit different (see plot) > Is this what you were suggesting? I?m also not sure I understand why the > error would be higher on the training data for the less-regularized setting. > Thanks again > Rachel > Doing a partial check on the math: AFAICS: The estimating equation or gradient condition for the intercept is unchanged even when the other parameters are penalized. This means that in the training sample the fraction of success or failures should coincide with the predicted probabilities of success or failures. The intercept should compensate for the penalization of the other parameters and for any transformation or standardization of the other explanatory variables. AFAICS from your code snippets. You are using the intercept as part of X and fit_intercept=False. This means your intercept is penalized. If you use the built-in fit_intercept=True, then the intercept is not penalized and the bias should go away. Josef > > On Dec 15, 2016, at 5:05 PM, Sean Violante > wrote: > > Sorry just saw you are not using the liblinear solver, agree with > Sebastian, you should subtract mean not median > > On 15 Dec 2016 11:02 pm, "Sean Violante" wrote: > >> The problem is the (stupid!) liblinear solver that also penalises the >> intercept (in regularisation) . Use a different solver or change the >> intercept_scaling parameter >> >> On 15 Dec 2016 10:44 pm, "Sebastian Raschka" >> wrote: >> >>> Subtracting the median wouldn?t result in normalizing the usual sense, >>> since subtracting a constant just shifts the values by a constant. Instead, >>> for logistic regression & most optimizers, I would recommend subtracting >>> the mean to center the features at mean zero and divide by the standard >>> deviation to get ?z? scores (e.g., this can be done by the >>> StandardScaler()). >>> >>> Best, >>> Sebastian >>> >>> > On Dec 15, 2016, at 4:02 PM, Rachel Melamed >>> wrote: >>> > >>> > I just tried it and it did not appear to change the results at all? >>> > I ran it as follows: >>> > 1) Normalize dummy variables (by subtracting median) to make a matrix >>> of about 10000 x 5 >>> > >>> > 2) For each of the 1000 output variables: >>> > a. Each output variable uses the same dummy variables, but not all >>> settings of covariates are observed for all output variables. So I create >>> the design matrix using patsy per output variable to include pairwise >>> interactions. Then, I have an around 10000 x 350 design matrix , and a >>> matrix I call ?success_fail? that has for each setting the number of >>> success and number of fail, so it is of size 10000 x 2 >>> > >>> > b. Run regression using: >>> > >>> > skdesign = np.vstack((design,design)) >>> > >>> > sklabel = np.hstack((np.ones(success_fail.shape[0]), >>> > np.zeros(success_fail.shape[0]))) >>> > >>> > skweight = np.hstack((success_fail['success'], success_fail['fail'])) >>> > >>> > logregN = linear_model.LogisticRegression(C=1, >>> > solver= >>> 'lbfgs',fit_intercept=False) >>> > logregN.fit(skdesign, sklabel, sample_weight=skweight) >>> > >>> > >>> >> On Dec 15, 2016, at 2:16 PM, Alexey Dral wrote: >>> >> >>> >> Could you try to normalize dataset after feature dummy encoding and >>> see if it is reproducible behavior? >>> >> >>> >> 2016-12-15 22:03 GMT+03:00 Rachel Melamed : >>> >> Thanks for the reply. The covariates (?X") are all dummy/categorical >>> variables. So I guess no, nothing is normalized. >>> >> >>> >>> On Dec 15, 2016, at 1:54 PM, Alexey Dral wrote: >>> >>> >>> >>> Hi Rachel, >>> >>> >>> >>> Do you have your data normalized? >>> >>> >>> >>> 2016-12-15 20:21 GMT+03:00 Rachel Melamed : >>> >>> Hi all, >>> >>> Does anyone have any suggestions for this problem: >>> >>> http://stackoverflow.com/questions/41125342/sklearn-logistic >>> -regression-gives-biased-results >>> >>> >>> >>> I am running around 1000 similar logistic regressions, with the same >>> covariates but slightly different data and response variables. All of my >>> response variables have a sparse successes (p(success) < .05 usually). >>> >>> >>> >>> I noticed that with the regularized regression, the results are >>> consistently biased to predict more "successes" than is observed in the >>> training data. When I relax the regularization, this bias goes away. The >>> bias observed is unacceptable for my use case, but the more-regularized >>> model does seem a bit better. >>> >>> >>> >>> Below, I plot the results for the 1000 different regressions for 2 >>> different values of C: >>> >>> >>> >>> I looked at the parameter estimates for one of these regressions: >>> below each point is one parameter. It seems like the intercept (the point >>> on the bottom left) is too high for the C=1 model. >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> >>> scikit-learn mailing list >>> >>> scikit-learn at python.org >>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Yours sincerely, >>> >>> Alexey A. Dral >>> >>> _______________________________________________ >>> >>> scikit-learn mailing list >>> >>> scikit-learn at python.org >>> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >>> >> >>> >> _______________________________________________ >>> >> scikit-learn mailing list >>> >> scikit-learn at python.org >>> >> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >>> >> >>> >> >>> >> >>> >> -- >>> >> Yours sincerely, >>> >> Alexey A. Dral >>> >> _______________________________________________ >>> >> scikit-learn mailing list >>> >> scikit-learn at python.org >>> >> https://mail.python.org/mailman/listinfo/scikit-learn >>> > >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From melamed at uchicago.edu Sun Dec 18 17:04:08 2016 From: melamed at uchicago.edu (Rachel Melamed) Date: Sun, 18 Dec 2016 22:04:08 +0000 Subject: [scikit-learn] biased predictions in logistic regression In-Reply-To: References: <615E3D60-DBED-4D81-B08E-FD8944233845@uchicago.edu> <21B65DF0-B382-40F9-A9E7-EA03F0D06E8C@uchicago.edu> Message-ID: <80D4DF68-3649-4B7F-AD07-0D8F8E926960@uchicago.edu> Josef! Thank you! You are a gift to the python statistics world. I don?t know why I did not think of that. That is the answer. Rachel On Dec 18, 2016, at 10:09 AM, josef.pktd at gmail.com wrote: On Sat, Dec 17, 2016 at 10:25 PM, Rachel Melamed > wrote: Hi Sean, Sebastian, Alexey (and Josef), I?m not sure I fully understand what normalizing a dummy should consist of, so please let me know if I am interpreting your suggestion right. I believe I can?t use the StandardScaler since I am using grouped data. As I mentioned, I?m using a matrix ?success_fail? containing the number of observations of each outcome per row of the design matrix. I tried to implement your suggestions by making my design matrix per each regression as previously, using patsy, but then normalizing: darray = np.array(design) weights = np.tile(success_fail.sum(axis=1)/success_fail.sum().sum(), (darray.shape[1],1)).transpose() dmean = (darray*weights).sum(axis=0) dvar = (((darray - dmean)**2)*weights).sum(axis=0)**.5 dvar[dvar==0] = 1 design_norm = (darray - dmean) / dvar design_norm[:,0] = 1 ## intercept stays at 1 Then I use the normalized version of the design matrix as input to the regression. It seems like the bias is still there, though the results are a bit different (see plot) Is this what you were suggesting? I?m also not sure I understand why the error would be higher on the training data for the less-regularized setting. Thanks again Rachel Doing a partial check on the math: AFAICS: The estimating equation or gradient condition for the intercept is unchanged even when the other parameters are penalized. This means that in the training sample the fraction of success or failures should coincide with the predicted probabilities of success or failures. The intercept should compensate for the penalization of the other parameters and for any transformation or standardization of the other explanatory variables. AFAICS from your code snippets. You are using the intercept as part of X and fit_intercept=False. This means your intercept is penalized. If you use the built-in fit_intercept=True, then the intercept is not penalized and the bias should go away. Josef On Dec 15, 2016, at 5:05 PM, Sean Violante > wrote: Sorry just saw you are not using the liblinear solver, agree with Sebastian, you should subtract mean not median On 15 Dec 2016 11:02 pm, "Sean Violante" > wrote: The problem is the (stupid!) liblinear solver that also penalises the intercept (in regularisation) . Use a different solver or change the intercept_scaling parameter On 15 Dec 2016 10:44 pm, "Sebastian Raschka" > wrote: Subtracting the median wouldn?t result in normalizing the usual sense, since subtracting a constant just shifts the values by a constant. Instead, for logistic regression & most optimizers, I would recommend subtracting the mean to center the features at mean zero and divide by the standard deviation to get ?z? scores (e.g., this can be done by the StandardScaler()). Best, Sebastian > On Dec 15, 2016, at 4:02 PM, Rachel Melamed > wrote: > > I just tried it and it did not appear to change the results at all? > I ran it as follows: > 1) Normalize dummy variables (by subtracting median) to make a matrix of about 10000 x 5 > > 2) For each of the 1000 output variables: > a. Each output variable uses the same dummy variables, but not all settings of covariates are observed for all output variables. So I create the design matrix using patsy per output variable to include pairwise interactions. Then, I have an around 10000 x 350 design matrix , and a matrix I call ?success_fail? that has for each setting the number of success and number of fail, so it is of size 10000 x 2 > > b. Run regression using: > > skdesign = np.vstack((design,design)) > > sklabel = np.hstack((np.ones(success_fail.shape[0]), > np.zeros(success_fail.shape[0]))) > > skweight = np.hstack((success_fail['success'], success_fail['fail'])) > > logregN = linear_model.LogisticRegression(C=1, > solver= 'lbfgs',fit_intercept=False) > logregN.fit(skdesign, sklabel, sample_weight=skweight) > > >> On Dec 15, 2016, at 2:16 PM, Alexey Dral > wrote: >> >> Could you try to normalize dataset after feature dummy encoding and see if it is reproducible behavior? >> >> 2016-12-15 22:03 GMT+03:00 Rachel Melamed >: >> Thanks for the reply. The covariates (?X") are all dummy/categorical variables. So I guess no, nothing is normalized. >> >>> On Dec 15, 2016, at 1:54 PM, Alexey Dral > wrote: >>> >>> Hi Rachel, >>> >>> Do you have your data normalized? >>> >>> 2016-12-15 20:21 GMT+03:00 Rachel Melamed >: >>> Hi all, >>> Does anyone have any suggestions for this problem: >>> http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results >>> >>> I am running around 1000 similar logistic regressions, with the same covariates but slightly different data and response variables. All of my response variables have a sparse successes (p(success) < .05 usually). >>> >>> I noticed that with the regularized regression, the results are consistently biased to predict more "successes" than is observed in the training data. When I relax the regularization, this bias goes away. The bias observed is unacceptable for my use case, but the more-regularized model does seem a bit better. >>> >>> Below, I plot the results for the 1000 different regressions for 2 different values of C: >>> >>> I looked at the parameter estimates for one of these regressions: below each point is one parameter. It seems like the intercept (the point on the bottom left) is too high for the C=1 model. >>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> >>> -- >>> Yours sincerely, >>> Alexey A. Dral >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> >> -- >> Yours sincerely, >> Alexey A. Dral >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Dec 19 00:13:58 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 19 Dec 2016 00:13:58 -0500 Subject: [scikit-learn] n_jobs for LogisticRegression Message-ID: Hi, I just got confused what exactly n_jobs does for LogisticRegression. Always thought that it was used for one-vs-rest learning, fitting the models for binary classification in parallel. However, it also seem to do sth in the multinomial case (at least according to the verbose option). in the docstring it says > n_jobs : int, optional > Number of CPU cores used during the cross-validation loop. If given > a value of -1, all cores are used. and I saw a logistic_regression_path being defined in the code. I am wondering, is this just a workaround for the LogisticRegressionCV, and should the n_jobs docstring in LogisticRegression be described as "Number of CPU cores used for model fitting? instead of ?during cross-validation,? or am I getting this wrong? Best, Sebastian From tom.duprelatour at orange.fr Mon Dec 19 10:14:37 2016 From: tom.duprelatour at orange.fr (Tom DLT) Date: Mon, 19 Dec 2016 16:14:37 +0100 Subject: [scikit-learn] n_jobs for LogisticRegression In-Reply-To: References: Message-ID: Hi, In LogisticRegression, n_jobs is only used for one-vs-rest parallelization. In LogisticRegressionCV, n_jobs is used for both one-vs-rest and cross-validation parallelizations. So in LogisticRegression with multi_class='multinomial', n_jobs should have no impact. The docstring should probably be updated as you mentioned. PR welcome :) Best, Tom 2016-12-19 6:13 GMT+01:00 Sebastian Raschka : > Hi, > > I just got confused what exactly n_jobs does for LogisticRegression. > Always thought that it was used for one-vs-rest learning, fitting the > models for binary classification in parallel. However, it also seem to do > sth in the multinomial case (at least according to the verbose option). in > the docstring it says > > > n_jobs : int, optional > > Number of CPU cores used during the cross-validation loop. If > given > > a value of -1, all cores are used. > > and I saw a logistic_regression_path being defined in the code. I am > wondering, is this just a workaround for the LogisticRegressionCV, and > should the n_jobs docstring in LogisticRegression > be described as "Number of CPU cores used for model fitting? instead of > ?during cross-validation,? or am I getting this wrong? > > Best, > Sebastian > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tevang3 at gmail.com Mon Dec 19 11:06:36 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Mon, 19 Dec 2016 17:06:36 +0100 Subject: [scikit-learn] combining arrays of features to train an MLP Message-ID: ?? Greetings, My dataset consists of objects which are characterised by their structural features which are encoded into a so called "fingerprint" form. There are several different types of fingerprints, each one encapsulating different type of information. I want to combine two specific types of fingerprints to train a MLP regressor. The first fingerprint consists of a 2048 bit array of the form: > ?FP? > 1 = array([ 1., 1., 0., ..., 0., 0., 1.], dtype=float32) The second is a 60 float number array of the form: FP2 = array([ 2.77494618, 0.98973243, 0.34638652, 2.88303715, > 1.31473857, > -0.56627112, 4.78847547, 2.29587913, -0.6786228 , 4.63391109, > ... > 0. , 0. , 5.89652792, 0. , 0. ]) At first I tried to fuse them into a single 1D array of 2048+60 columns but the predictions of the MLP were worse than the 2 different MLP models trained from one of the 2 fingerprint types individually. My question: is there a more effective way to combine the 2 fingerprints in order to indicate that they represent different type of information? To this end, I tried to create a 2-row array (1st row 2048 elements and 2nd row 60 elements) but sklearn complained: ? mlp.fit(x_train,y_train) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 618, in fit > return self._fit(X, y, incremental=False) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 330, in _fit > X, y = self._validate_input(X, y, incremental) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 1264, in _validate_input > multi_output=True, y_numeric=True) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line > 521, in check_X_y > ensure_min_features, warn_on_dtype, estimator) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line > 402, in check_array > array = array.astype(np.float64) > ValueError: setting an array element with a sequence. > ? ?Then I tried to ?create for each object of the dataset a 2D array of size 2x2048, by adding 1998 zeros in the second row in order both rows to be of equal size. However sklearn complained again: mlp.fit(x_train,y_train) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 618, in fit > return self._fit(X, y, incremental=False) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 330, in _fit > X, y = self._validate_input(X, y, incremental) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", > line 1264, in _validate_input > multi_output=True, y_numeric=True) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line > 521, in check_X_y > ensure_min_features, warn_on_dtype, estimator) > File > "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line > 405, in check_array > % (array.ndim, estimator_name)) > ValueError: Found array with dim 3. Estimator expected <= 2. In another case of fingerprints, lets name them FP3 and FP4, I observed that the MLP regressor created using FP3 yields better results when trained and evaluated using logarithmically transformed experimental values (the values in y_train and y_test 1D arrays), while the MLP regressor created using FP4 yielded better results using the original experimental values. So my second question is: when combining both FP3 and FP4 into a single array is there any way to designate to the MLP that the features that correspond to FP3 must reproduce the logarithmic transform of the experimental values while the features of FP4 the original untransformed experimental values? I would greatly appreciate any advice on any of my 2 queries. Thomas -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Dec 19 12:17:13 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 19 Dec 2016 12:17:13 -0500 Subject: [scikit-learn] combining arrays of features to train an MLP In-Reply-To: References: Message-ID: <8B7BDBF6-5780-4891-B7AC-F4B44C21D39D@gmail.com> Thanks, Thomas, that makes sense! Will submit a PR then to update the docstring. Best, Sebastian > On Dec 19, 2016, at 11:06 AM, Thomas Evangelidis wrote: > > ?? > Greetings, > > My dataset consists of objects which are characterised by their structural features which are encoded into a so called "fingerprint" form. There are several different types of fingerprints, each one encapsulating different type of information. I want to combine two specific types of fingerprints to train a MLP regressor. The first fingerprint consists of a 2048 bit array of the form: > > ?FP?1 = array([ 1., 1., 0., ..., 0., 0., 1.], dtype=float32) > > The second is a 60 float number array of the form: > > FP2 = array([ 2.77494618, 0.98973243, 0.34638652, 2.88303715, 1.31473857, > -0.56627112, 4.78847547, 2.29587913, -0.6786228 , 4.63391109, > ... > 0. , 0. , 5.89652792, 0. , 0. ]) > > At first I tried to fuse them into a single 1D array of 2048+60 columns but the predictions of the MLP were worse than the 2 different MLP models trained from one of the 2 fingerprint types individually. My question: is there a more effective way to combine the 2 fingerprints in order to indicate that they represent different type of information? > > To this end, I tried to create a 2-row array (1st row 2048 elements and 2nd row 60 elements) but sklearn complained: > > ? mlp.fit(x_train,y_train) > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 618, in fit > return self._fit(X, y, incremental=False) > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 330, in _fit > X, y = self._validate_input(X, y, incremental) > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 1264, in _validate_input > multi_output=True, y_numeric=True) > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 521, in check_X_y > ensure_min_features, warn_on_dtype, estimator) > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 402, in check_array > array = array.astype(np.float64) > ValueError: setting an array element with a sequence. > ? > > ?Then I tried to ?create for each object of the dataset a 2D array of size 2x2048, by adding 1998 zeros in the second row in order both rows to be of equal size. However sklearn complained again: > > > mlp.fit(x_train,y_train) > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 618, in fit > return self._fit(X, y, incremental=False) > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 330, in _fit > X, y = self._validate_input(X, y, incremental) > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 1264, in _validate_input > multi_output=True, y_numeric=True) > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 521, in check_X_y > ensure_min_features, warn_on_dtype, estimator) > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 405, in check_array > % (array.ndim, estimator_name)) > ValueError: Found array with dim 3. Estimator expected <= 2. > > > In another case of fingerprints, lets name them FP3 and FP4, I observed that the MLP regressor created using FP3 yields better results when trained and evaluated using logarithmically transformed experimental values (the values in y_train and y_test 1D arrays), while the MLP regressor created using FP4 yielded better results using the original experimental values. So my second question is: when combining both FP3 and FP4 into a single array is there any way to designate to the MLP that the features that correspond to FP3 must reproduce the logarithmic transform of the experimental values while the features of FP4 the original untransformed experimental values? > > > I would greatly appreciate any advice on any of my 2 queries. > Thomas > > > > > > > > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tevang3 at gmail.com Mon Dec 19 16:56:56 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Mon, 19 Dec 2016 22:56:56 +0100 Subject: [scikit-learn] combining arrays of features to train an MLP In-Reply-To: <8B7BDBF6-5780-4891-B7AC-F4B44C21D39D@gmail.com> References: <8B7BDBF6-5780-4891-B7AC-F4B44C21D39D@gmail.com> Message-ID: this means that both are feasible? On 19 December 2016 at 18:17, Sebastian Raschka wrote: > Thanks, Thomas, that makes sense! Will submit a PR then to update the > docstring. > > Best, > Sebastian > > > > On Dec 19, 2016, at 11:06 AM, Thomas Evangelidis > wrote: > > > > ?? > > Greetings, > > > > My dataset consists of objects which are characterised by their > structural features which are encoded into a so called "fingerprint" form. > There are several different types of fingerprints, each one encapsulating > different type of information. I want to combine two specific types of > fingerprints to train a MLP regressor. The first fingerprint consists of a > 2048 bit array of the form: > > > > ?FP?1 = array([ 1., 1., 0., ..., 0., 0., 1.], dtype=float32) > > > > The second is a 60 float number array of the form: > > > > FP2 = array([ 2.77494618, 0.98973243, 0.34638652, 2.88303715, > 1.31473857, > > -0.56627112, 4.78847547, 2.29587913, -0.6786228 , 4.63391109, > > ... > > 0. , 0. , 5.89652792, 0. , 0. ]) > > > > At first I tried to fuse them into a single 1D array of 2048+60 columns > but the predictions of the MLP were worse than the 2 different MLP models > trained from one of the 2 fingerprint types individually. My question: is > there a more effective way to combine the 2 fingerprints in order to > indicate that they represent different type of information? > > > > To this end, I tried to create a 2-row array (1st row 2048 elements and > 2nd row 60 elements) but sklearn complained: > > > > ? mlp.fit(x_train,y_train) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 618, in fit > > return self._fit(X, y, incremental=False) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 330, in _fit > > X, y = self._validate_input(X, y, incremental) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 1264, in _validate_input > > multi_output=True, y_numeric=True) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > line 521, in check_X_y > > ensure_min_features, warn_on_dtype, estimator) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > line 402, in check_array > > array = array.astype(np.float64) > > ValueError: setting an array element with a sequence. > > ? > > > > ?Then I tried to ?create for each object of the dataset a 2D array of > size 2x2048, by adding 1998 zeros in the second row in order both rows to > be of equal size. However sklearn complained again: > > > > > > mlp.fit(x_train,y_train) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 618, in fit > > return self._fit(X, y, incremental=False) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 330, in _fit > > X, y = self._validate_input(X, y, incremental) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 1264, in _validate_input > > multi_output=True, y_numeric=True) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > line 521, in check_X_y > > ensure_min_features, warn_on_dtype, estimator) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > line 405, in check_array > > % (array.ndim, estimator_name)) > > ValueError: Found array with dim 3. Estimator expected <= 2. > > > > > > In another case of fingerprints, lets name them FP3 and FP4, I observed > that the MLP regressor created using FP3 yields better results when trained > and evaluated using logarithmically transformed experimental values (the > values in y_train and y_test 1D arrays), while the MLP regressor created > using FP4 yielded better results using the original experimental values. So > my second question is: when combining both FP3 and FP4 into a single array > is there any way to designate to the MLP that the features that correspond > to FP3 must reproduce the logarithmic transform of the experimental values > while the features of FP4 the original untransformed experimental values? > > > > > > I would greatly appreciate any advice on any of my 2 queries. > > Thomas > > > > > > > > > > > > > > > > > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/1S081, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Mon Dec 19 17:42:51 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 19 Dec 2016 17:42:51 -0500 Subject: [scikit-learn] combining arrays of features to train an MLP In-Reply-To: References: <8B7BDBF6-5780-4891-B7AC-F4B44C21D39D@gmail.com> Message-ID: Oh, sorry, I just noticed that I was in the wrong thread ? meant answer a different Thomas :P. Regarding the fingerprints; scikit-learn?s estimators expect feature vectors as samples, so you can?t have a 3D array ? e.g., think of image classification: here you also enroll the n_pixels times m_pixels array into 1D arrays. The low performance can have mutliple issues. In case dimensionality is an issue, I?d maybe try stronger regularization first, or feature selection. If you are working with molecular structures, and you have enough of them, maybe also consider alternative feature representations, e.g,. learning from the graphs directly: http://papers.nips.cc/paper/5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints.pdf http://pubs.acs.org/doi/abs/10.1021/ci400187y Best, Sebastian > On Dec 19, 2016, at 4:56 PM, Thomas Evangelidis wrote: > > this means that both are feasible? > > On 19 December 2016 at 18:17, Sebastian Raschka wrote: > Thanks, Thomas, that makes sense! Will submit a PR then to update the docstring. > > Best, > Sebastian > > > > On Dec 19, 2016, at 11:06 AM, Thomas Evangelidis wrote: > > > > ?? > > Greetings, > > > > My dataset consists of objects which are characterised by their structural features which are encoded into a so called "fingerprint" form. There are several different types of fingerprints, each one encapsulating different type of information. I want to combine two specific types of fingerprints to train a MLP regressor. The first fingerprint consists of a 2048 bit array of the form: > > > > ?FP?1 = array([ 1., 1., 0., ..., 0., 0., 1.], dtype=float32) > > > > The second is a 60 float number array of the form: > > > > FP2 = array([ 2.77494618, 0.98973243, 0.34638652, 2.88303715, 1.31473857, > > -0.56627112, 4.78847547, 2.29587913, -0.6786228 , 4.63391109, > > ... > > 0. , 0. , 5.89652792, 0. , 0. ]) > > > > At first I tried to fuse them into a single 1D array of 2048+60 columns but the predictions of the MLP were worse than the 2 different MLP models trained from one of the 2 fingerprint types individually. My question: is there a more effective way to combine the 2 fingerprints in order to indicate that they represent different type of information? > > > > To this end, I tried to create a 2-row array (1st row 2048 elements and 2nd row 60 elements) but sklearn complained: > > > > ? mlp.fit(x_train,y_train) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 618, in fit > > return self._fit(X, y, incremental=False) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 330, in _fit > > X, y = self._validate_input(X, y, incremental) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 1264, in _validate_input > > multi_output=True, y_numeric=True) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 521, in check_X_y > > ensure_min_features, warn_on_dtype, estimator) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 402, in check_array > > array = array.astype(np.float64) > > ValueError: setting an array element with a sequence. > > ? > > > > ?Then I tried to ?create for each object of the dataset a 2D array of size 2x2048, by adding 1998 zeros in the second row in order both rows to be of equal size. However sklearn complained again: > > > > > > mlp.fit(x_train,y_train) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 618, in fit > > return self._fit(X, y, incremental=False) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 330, in _fit > > X, y = self._validate_input(X, y, incremental) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 1264, in _validate_input > > multi_output=True, y_numeric=True) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 521, in check_X_y > > ensure_min_features, warn_on_dtype, estimator) > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 405, in check_array > > % (array.ndim, estimator_name)) > > ValueError: Found array with dim 3. Estimator expected <= 2. > > > > > > In another case of fingerprints, lets name them FP3 and FP4, I observed that the MLP regressor created using FP3 yields better results when trained and evaluated using logarithmically transformed experimental values (the values in y_train and y_test 1D arrays), while the MLP regressor created using FP4 yielded better results using the original experimental values. So my second question is: when combining both FP3 and FP4 into a single array is there any way to designate to the MLP that the features that correspond to FP3 must reproduce the logarithmic transform of the experimental values while the features of FP4 the original untransformed experimental values? > > > > > > I would greatly appreciate any advice on any of my 2 queries. > > Thomas > > > > > > > > > > > > > > > > > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/1S081, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From se.raschka at gmail.com Mon Dec 19 17:44:00 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Mon, 19 Dec 2016 17:44:00 -0500 Subject: [scikit-learn] n_jobs for LogisticRegression In-Reply-To: References: Message-ID: <52BB82C8-1698-4131-9BD2-E54A655A7C74@gmail.com> Thanks, Tom, that makes sense. Submitted a PR to fix that. Best, Sebastian > On Dec 19, 2016, at 10:14 AM, Tom DLT wrote: > > Hi, > > In LogisticRegression, n_jobs is only used for one-vs-rest parallelization. > In LogisticRegressionCV, n_jobs is used for both one-vs-rest and cross-validation parallelizations. > > So in LogisticRegression with multi_class='multinomial', n_jobs should have no impact. > > The docstring should probably be updated as you mentioned. PR welcome :) > > Best, > Tom > > 2016-12-19 6:13 GMT+01:00 Sebastian Raschka : > Hi, > > I just got confused what exactly n_jobs does for LogisticRegression. Always thought that it was used for one-vs-rest learning, fitting the models for binary classification in parallel. However, it also seem to do sth in the multinomial case (at least according to the verbose option). in the docstring it says > > > n_jobs : int, optional > > Number of CPU cores used during the cross-validation loop. If given > > a value of -1, all cores are used. > > and I saw a logistic_regression_path being defined in the code. I am wondering, is this just a workaround for the LogisticRegressionCV, and should the n_jobs docstring in LogisticRegression > be described as "Number of CPU cores used for model fitting? instead of ?during cross-validation,? or am I getting this wrong? > > Best, > Sebastian > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From tevang3 at gmail.com Mon Dec 19 18:51:02 2016 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Tue, 20 Dec 2016 00:51:02 +0100 Subject: [scikit-learn] combining arrays of features to train an MLP In-Reply-To: References: <8B7BDBF6-5780-4891-B7AC-F4B44C21D39D@gmail.com> Message-ID: Thank you, these articles discuss about ML application of the types of fingerprints I working with! I will read them thoroughly to get some hints. In the meantime I tried to eliminate some features using RandomizedLasso and the performance escalated from R=0.067 using all 615 features to R=0.524 using only the 15 top ranked features. Naive question: does it make sense to use the RandomizedLasso to select the good features in order to train a MLP? I had the impression that RandomizedLasso uses multi-variate linear regression to fit the observed values to the experimental and rank the features. Another question: this dataset consists of 31 observations. The Pearson's R values that I reported above were calculated using cross-validation. Could someone claim that they are inaccurate because the number of features used for training the MLP is much larger than the number of observations? On 19 December 2016 at 23:42, Sebastian Raschka wrote: > Oh, sorry, I just noticed that I was in the wrong thread ? meant answer a > different Thomas :P. > > Regarding the fingerprints; scikit-learn?s estimators expect feature > vectors as samples, so you can?t have a 3D array ? e.g., think of image > classification: here you also enroll the n_pixels times m_pixels array into > 1D arrays. > > The low performance can have mutliple issues. In case dimensionality is an > issue, I?d maybe try stronger regularization first, or feature selection. > If you are working with molecular structures, and you have enough of them, > maybe also consider alternative feature representations, e.g,. learning > from the graphs directly: > > http://papers.nips.cc/paper/5954-convolutional-networks- > on-graphs-for-learning-molecular-fingerprints.pdf > http://pubs.acs.org/doi/abs/10.1021/ci400187y > > Best, > Sebastian > > > > On Dec 19, 2016, at 4:56 PM, Thomas Evangelidis > wrote: > > > > this means that both are feasible? > > > > On 19 December 2016 at 18:17, Sebastian Raschka > wrote: > > Thanks, Thomas, that makes sense! Will submit a PR then to update the > docstring. > > > > Best, > > Sebastian > > > > > > > On Dec 19, 2016, at 11:06 AM, Thomas Evangelidis > wrote: > > > > > > ?? > > > Greetings, > > > > > > My dataset consists of objects which are characterised by their > structural features which are encoded into a so called "fingerprint" form. > There are several different types of fingerprints, each one encapsulating > different type of information. I want to combine two specific types of > fingerprints to train a MLP regressor. The first fingerprint consists of a > 2048 bit array of the form: > > > > > > ?FP?1 = array([ 1., 1., 0., ..., 0., 0., 1.], dtype=float32) > > > > > > The second is a 60 float number array of the form: > > > > > > FP2 = array([ 2.77494618, 0.98973243, 0.34638652, 2.88303715, > 1.31473857, > > > -0.56627112, 4.78847547, 2.29587913, -0.6786228 , 4.63391109, > > > ... > > > 0. , 0. , 5.89652792, 0. , 0. > ]) > > > > > > At first I tried to fuse them into a single 1D array of 2048+60 > columns but the predictions of the MLP were worse than the 2 different MLP > models trained from one of the 2 fingerprint types individually. My > question: is there a more effective way to combine the 2 fingerprints in > order to indicate that they represent different type of information? > > > > > > To this end, I tried to create a 2-row array (1st row 2048 elements > and 2nd row 60 elements) but sklearn complained: > > > > > > ? mlp.fit(x_train,y_train) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 618, in fit > > > return self._fit(X, y, incremental=False) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 330, in _fit > > > X, y = self._validate_input(X, y, incremental) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 1264, in _validate_input > > > multi_output=True, y_numeric=True) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > line 521, in check_X_y > > > ensure_min_features, warn_on_dtype, estimator) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > line 402, in check_array > > > array = array.astype(np.float64) > > > ValueError: setting an array element with a sequence. > > > ? > > > > > > ?Then I tried to ?create for each object of the dataset a 2D array of > size 2x2048, by adding 1998 zeros in the second row in order both rows to > be of equal size. However sklearn complained again: > > > > > > > > > mlp.fit(x_train,y_train) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 618, in fit > > > return self._fit(X, y, incremental=False) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 330, in _fit > > > X, y = self._validate_input(X, y, incremental) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_ > network/multilayer_perceptron.py", line 1264, in _validate_input > > > multi_output=True, y_numeric=True) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > line 521, in check_X_y > > > ensure_min_features, warn_on_dtype, estimator) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", > line 405, in check_array > > > % (array.ndim, estimator_name)) > > > ValueError: Found array with dim 3. Estimator expected <= 2. > > > > > > > > > In another case of fingerprints, lets name them FP3 and FP4, I > observed that the MLP regressor created using FP3 yields better results > when trained and evaluated using logarithmically transformed experimental > values (the values in y_train and y_test 1D arrays), while the MLP > regressor created using FP4 yielded better results using the original > experimental values. So my second question is: when combining both FP3 and > FP4 into a single array is there any way to designate to the MLP that the > features that correspond to FP3 must reproduce the logarithmic transform of > the experimental values while the features of FP4 the original > untransformed experimental values? > > > > > > > > > I would greatly appreciate any advice on any of my 2 queries. > > > Thomas > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > ====================================================================== > > > Thomas Evangelidis > > > Research Specialist > > > CEITEC - Central European Institute of Technology > > > Masaryk University > > > Kamenice 5/A35/1S081, > > > 62500 Brno, Czech Republic > > > > > > email: tevang at pharm.uoa.gr > > > tevang3 at gmail.com > > > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/1S081, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- ====================================================================== Thomas Evangelidis Research Specialist CEITEC - Central European Institute of Technology Masaryk University Kamenice 5/A35/1S081, 62500 Brno, Czech Republic email: tevang at pharm.uoa.gr tevang3 at gmail.com website: https://sites.google.com/site/thomasevangelidishomepage/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From strategist922 at gmail.com Tue Dec 20 01:38:57 2016 From: strategist922 at gmail.com (James Chang) Date: Tue, 20 Dec 2016 14:38:57 +0800 Subject: [scikit-learn] Failed when make PDF document by "make latexpdf" Message-ID: Hi, Does anyone have issue when execute "make latexpdf" to get PDF format Doc ? or I can directly download the latest PDF format doc for the currently stable version scikit-learn v 0.18.1 in some where? PS. I run the commend under Mac OS X 10.12.1 Thanks in advance and best regards, James -------------- next part -------------- An HTML attachment was scrubbed... URL: From loic.esteve at ymail.com Tue Dec 20 02:29:32 2016 From: loic.esteve at ymail.com (=?UTF-8?B?TG/Dr2MgRXN0w6h2ZQ==?=) Date: Tue, 20 Dec 2016 08:29:32 +0100 Subject: [scikit-learn] Failed when make PDF document by "make latexpdf" In-Reply-To: References: Message-ID: <096ed674-8c0b-36d2-df30-0efe23095702@ymail.com> Hi, you can get the PDF documentation from the website, see attached screenshot. Cheers, Lo?c On 12/20/2016 07:38 AM, James Chang wrote: > Hi, > > Does anyone have issue when execute "make latexpdf" to get PDF format > Doc ? > > or I can directly download the latest PDF format doc for the currently > stable version > scikit-learn v 0.18.1 in some where? > > PS. > I run the commend under Mac OS X 10.12.1 > > Thanks in advance and best regards, > James > > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: Screenshot_20161220_082154.png Type: image/png Size: 285517 bytes Desc: not available URL: From strategist922 at gmail.com Tue Dec 20 02:46:55 2016 From: strategist922 at gmail.com (James Chang) Date: Tue, 20 Dec 2016 15:46:55 +0800 Subject: [scikit-learn] Failed when make PDF document by "make latexpdf" In-Reply-To: <096ed674-8c0b-36d2-df30-0efe23095702@ymail.com> References: <096ed674-8c0b-36d2-df30-0efe23095702@ymail.com> Message-ID: Hi Lo?c, thank you, finally I got the PDF File. Thanks and best regards, James 2016-12-20 15:29 GMT+08:00 Lo?c Est?ve via scikit-learn < scikit-learn at python.org>: > Hi, > > you can get the PDF documentation from the website, see attached > screenshot. > > Cheers, > Lo?c > > > On 12/20/2016 07:38 AM, James Chang wrote: > >> Hi, >> >> Does anyone have issue when execute "make latexpdf" to get PDF format >> Doc ? >> >> or I can directly download the latest PDF format doc for the currently >> stable version >> scikit-learn v 0.18.1 in some where? >> >> PS. >> I run the commend under Mac OS X 10.12.1 >> >> Thanks in advance and best regards, >> James >> >> >> >> >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From se.raschka at gmail.com Tue Dec 20 14:00:17 2016 From: se.raschka at gmail.com (Sebastian Raschka) Date: Tue, 20 Dec 2016 14:00:17 -0500 Subject: [scikit-learn] combining arrays of features to train an MLP In-Reply-To: References: <8B7BDBF6-5780-4891-B7AC-F4B44C21D39D@gmail.com> Message-ID: <8F3A4E7B-3A92-4B04-BDCC-994C4C9B2C78@gmail.com> Hi, Thomas, I haven?t looked what RandomizedLasso does exactly, but like you said, it is probably not ideal for combining it with an MLP. What In terms of regularization, I was more thinking of the L1 and L2 for the hidden layers, or dropout. However, given such a small sample size (and the small the sample/feature ratio), I think there are way too many (hyper/)parameters to fit in an MLP to get good results. I think you could be better off with a kernel SVM (if linear models don?t work well) or ensemble learning. Best, Sebastian > On Dec 19, 2016, at 6:51 PM, Thomas Evangelidis wrote: > > Thank you, these articles discuss about ML application of the types of fingerprints I working with! I will read them thoroughly to get some hints. > > In the meantime I tried to eliminate some features using RandomizedLasso and the performance escalated from R=0.067 using all 615 features to R=0.524 using only the 15 top ranked features. Naive question: does it make sense to use the RandomizedLasso to select the good features in order to train a MLP? I had the impression that RandomizedLasso uses multi-variate linear regression to fit the observed values to the experimental and rank the features. > > Another question: this dataset consists of 31 observations. The Pearson's R values that I reported above were calculated using cross-validation. Could someone claim that they are inaccurate because the number of features used for training the MLP is much larger than the number of observations? > > > On 19 December 2016 at 23:42, Sebastian Raschka wrote: > Oh, sorry, I just noticed that I was in the wrong thread ? meant answer a different Thomas :P. > > Regarding the fingerprints; scikit-learn?s estimators expect feature vectors as samples, so you can?t have a 3D array ? e.g., think of image classification: here you also enroll the n_pixels times m_pixels array into 1D arrays. > > The low performance can have mutliple issues. In case dimensionality is an issue, I?d maybe try stronger regularization first, or feature selection. > If you are working with molecular structures, and you have enough of them, maybe also consider alternative feature representations, e.g,. learning from the graphs directly: > > http://papers.nips.cc/paper/5954-convolutional-networks-on-graphs-for-learning-molecular-fingerprints.pdf > http://pubs.acs.org/doi/abs/10.1021/ci400187y > > Best, > Sebastian > > > > On Dec 19, 2016, at 4:56 PM, Thomas Evangelidis wrote: > > > > this means that both are feasible? > > > > On 19 December 2016 at 18:17, Sebastian Raschka wrote: > > Thanks, Thomas, that makes sense! Will submit a PR then to update the docstring. > > > > Best, > > Sebastian > > > > > > > On Dec 19, 2016, at 11:06 AM, Thomas Evangelidis wrote: > > > > > > ?? > > > Greetings, > > > > > > My dataset consists of objects which are characterised by their structural features which are encoded into a so called "fingerprint" form. There are several different types of fingerprints, each one encapsulating different type of information. I want to combine two specific types of fingerprints to train a MLP regressor. The first fingerprint consists of a 2048 bit array of the form: > > > > > > ?FP?1 = array([ 1., 1., 0., ..., 0., 0., 1.], dtype=float32) > > > > > > The second is a 60 float number array of the form: > > > > > > FP2 = array([ 2.77494618, 0.98973243, 0.34638652, 2.88303715, 1.31473857, > > > -0.56627112, 4.78847547, 2.29587913, -0.6786228 , 4.63391109, > > > ... > > > 0. , 0. , 5.89652792, 0. , 0. ]) > > > > > > At first I tried to fuse them into a single 1D array of 2048+60 columns but the predictions of the MLP were worse than the 2 different MLP models trained from one of the 2 fingerprint types individually. My question: is there a more effective way to combine the 2 fingerprints in order to indicate that they represent different type of information? > > > > > > To this end, I tried to create a 2-row array (1st row 2048 elements and 2nd row 60 elements) but sklearn complained: > > > > > > ? mlp.fit(x_train,y_train) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 618, in fit > > > return self._fit(X, y, incremental=False) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 330, in _fit > > > X, y = self._validate_input(X, y, incremental) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 1264, in _validate_input > > > multi_output=True, y_numeric=True) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 521, in check_X_y > > > ensure_min_features, warn_on_dtype, estimator) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 402, in check_array > > > array = array.astype(np.float64) > > > ValueError: setting an array element with a sequence. > > > ? > > > > > > ?Then I tried to ?create for each object of the dataset a 2D array of size 2x2048, by adding 1998 zeros in the second row in order both rows to be of equal size. However sklearn complained again: > > > > > > > > > mlp.fit(x_train,y_train) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 618, in fit > > > return self._fit(X, y, incremental=False) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 330, in _fit > > > X, y = self._validate_input(X, y, incremental) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", line 1264, in _validate_input > > > multi_output=True, y_numeric=True) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 521, in check_X_y > > > ensure_min_features, warn_on_dtype, estimator) > > > File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 405, in check_array > > > % (array.ndim, estimator_name)) > > > ValueError: Found array with dim 3. Estimator expected <= 2. > > > > > > > > > In another case of fingerprints, lets name them FP3 and FP4, I observed that the MLP regressor created using FP3 yields better results when trained and evaluated using logarithmically transformed experimental values (the values in y_train and y_test 1D arrays), while the MLP regressor created using FP4 yielded better results using the original experimental values. So my second question is: when combining both FP3 and FP4 into a single array is there any way to designate to the MLP that the features that correspond to FP3 must reproduce the logarithmic transform of the experimental values while the features of FP4 the original untransformed experimental values? > > > > > > > > > I would greatly appreciate any advice on any of my 2 queries. > > > Thomas > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > ====================================================================== > > > Thomas Evangelidis > > > Research Specialist > > > CEITEC - Central European Institute of Technology > > > Masaryk University > > > Kamenice 5/A35/1S081, > > > 62500 Brno, Czech Republic > > > > > > email: tevang at pharm.uoa.gr > > > tevang3 at gmail.com > > > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > > > -- > > ====================================================================== > > Thomas Evangelidis > > Research Specialist > > CEITEC - Central European Institute of Technology > > Masaryk University > > Kamenice 5/A35/1S081, > > 62500 Brno, Czech Republic > > > > email: tevang at pharm.uoa.gr > > tevang3 at gmail.com > > > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > -- > ====================================================================== > Thomas Evangelidis > Research Specialist > CEITEC - Central European Institute of Technology > Masaryk University > Kamenice 5/A35/1S081, > 62500 Brno, Czech Republic > > email: tevang at pharm.uoa.gr > tevang3 at gmail.com > > website: https://sites.google.com/site/thomasevangelidishomepage/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Wed Dec 21 05:03:42 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 21 Dec 2016 21:03:42 +1100 Subject: [scikit-learn] Bookmarklet to view documentation on CircleCI Message-ID: At https://gist.github.com/jnothman/bf76d02f60af6476221ec65c63c77e60 I've created a bookmarklet which, when viewing a pull request page for which the CircleCI build has finished, will identify the circle build number and open a new tab with the changed documentation files corresponding to that PR. -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Wed Dec 21 05:03:59 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 21 Dec 2016 21:03:59 +1100 Subject: [scikit-learn] Bookmarklet to view documentation on CircleCI In-Reply-To: References: Message-ID: I hope it's useful to someone else. On 21 December 2016 at 21:03, Joel Nothman wrote: > At https://gist.github.com/jnothman/bf76d02f60af6476221ec65c63c77e60 I've > created a bookmarklet which, when viewing a pull request page for which the > CircleCI build has finished, will identify the circle build number and open > a new tab with the changed documentation files corresponding to that PR. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nfliu at uw.edu Wed Dec 21 15:06:55 2016 From: nfliu at uw.edu (Nelson Liu) Date: Wed, 21 Dec 2016 20:06:55 +0000 Subject: [scikit-learn] Bookmarklet to view documentation on CircleCI In-Reply-To: References: Message-ID: This is great, thanks for sharing Joel! On Wed, Dec 21, 2016 at 12:08 AM Joel Nothman wrote: > I hope it's useful to someone else. > > On 21 December 2016 at 21:03, Joel Nothman wrote: > > At https://gist.github.com/jnothman/bf76d02f60af6476221ec65c63c77e60 I've > created a bookmarklet which, when viewing a pull request page for which the > CircleCI build has finished, will identify the circle build number and open > a new tab with the changed documentation files corresponding to that PR. > > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed Dec 21 17:33:38 2016 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 21 Dec 2016 23:33:38 +0100 Subject: [scikit-learn] Bookmarklet to view documentation on CircleCI In-Reply-To: References: Message-ID: <20161221223338.GA334300@phare.normalesup.org> It's super neat. It's a pity that I don't see a way of integrating it to the github interface. Ga?l On Wed, Dec 21, 2016 at 09:03:59PM +1100, Joel Nothman wrote: > I hope it's useful to someone else. > On 21 December 2016 at 21:03, Joel Nothman wrote: > At?https://gist.github.com/jnothman/bf76d02f60af6476221ec65c63c77e60 I've > created a bookmarklet which, when viewing a pull request page for which the > CircleCI build has finished, will identify the circle build number and open > a new tab with the changed documentation files corresponding to that PR. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From joel.nothman at gmail.com Wed Dec 21 19:48:00 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 22 Dec 2016 11:48:00 +1100 Subject: [scikit-learn] Bookmarklet to view documentation on CircleCI In-Reply-To: <20161221223338.GA334300@phare.normalesup.org> References: <20161221223338.GA334300@phare.normalesup.org> Message-ID: Well, you can as a browser extension. I just haven't bothered to investigate that technology when there's so much code to review and write. On 22 December 2016 at 09:33, Gael Varoquaux wrote: > It's super neat. It's a pity that I don't see a way of integrating it to > the github interface. > > Ga?l > > On Wed, Dec 21, 2016 at 09:03:59PM +1100, Joel Nothman wrote: > > I hope it's useful to someone else. > > > On 21 December 2016 at 21:03, Joel Nothman > wrote: > > > At https://gist.github.com/jnothman/bf76d02f60af6476221ec65c63c77e60 > I've > > created a bookmarklet which, when viewing a pull request page for > which the > > CircleCI build has finished, will identify the circle build number > and open > > a new tab with the changed documentation files corresponding to that > PR. > > > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailfordebu at gmail.com Mon Dec 26 13:28:54 2016 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Mon, 26 Dec 2016 23:58:54 +0530 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library Message-ID: Dear All, Greetings! I need some urgent guidance and help from you all in model scoring. What I mean by model scoring is around the following steps: 1. I have trained a Random Classifier model using scikit-learn (RandomForestClassifier library) 2. Then I have generated the True Positive and False Positive predictions on my test data set using predict_proba method (I have splitted my data into training and test samples in 80:20 ratio) 3. Finally, I have dumped the model into a pkl file. 4. Next in another instance, I have loaded the .pkl file 5. I have initiated job_lib.predict_proba method for predicting the True Positive and False positives on a different sample. I am terming this step as scoring whether I am predicting without retraining the model My question is when I generate the True Positive Rate on the test data set (as part of model training approach), the rate which I am getting is 10 ? 12%. But when I do the scoring (using the steps mentioned above), my True Positive Rate is shooting high upto 80%. Although, I am happy to get a very high TPR but my question is whether getting such a high TPR during the scoring phase is an expected outcome? In other words, whether achieving a high TPR through joblib is an accepted outcome vis-?-vis getting the TPR on training / test data set. Your views on the above ask will be really helpful as I am very confused whether to consider scoring the model using joblib. Otherwise is there any other alternative to joblib, which can help me to do scoring without retraining the model. Please let me know as per your earliest convenience as am a bit pressed Thanks for your help in advance! Cheers, Debu -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Dec 26 15:26:28 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 27 Dec 2016 07:26:28 +1100 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: Message-ID: Hi Debu, Your post is terminologically confusing, so I'm not sure I've understood your problem. Where is the "different sample" used for scoring coming from? Is it possible it is more related to the training data than the test sample? Joel On 27 December 2016 at 05:28, Debabrata Ghosh wrote: > Dear All, > > Greetings! > > I need some urgent guidance and help from > you all in model scoring. What I mean by model scoring is around the > following steps: > > > > 1. I have trained a Random Classifier model using scikit-learn > (RandomForestClassifier library) > 2. Then I have generated the True Positive and False Positive > predictions on my test data set using predict_proba method (I have splitted > my data into training and test samples in 80:20 ratio) > 3. Finally, I have dumped the model into a pkl file. > 4. Next in another instance, I have loaded the .pkl file > 5. I have initiated job_lib.predict_proba method for predicting the > True Positive and False positives on a different sample. I am terming this > step as scoring whether I am predicting without retraining the model > > My question is when I generate the True Positive Rate on > the test data set (as part of model training approach), the rate which I am > getting is 10 ? 12%. But when I do the scoring (using the steps mentioned > above), my True Positive Rate is shooting high upto 80%. Although, I am > happy to get a very high TPR but my question is whether getting such a high > TPR during the scoring phase is an expected outcome? In other words, > whether achieving a high TPR through joblib is an accepted outcome > vis-?-vis getting the TPR on training / test data set. > > Your views on the above ask will be really helpful as I > am very confused whether to consider scoring the model using joblib. > Otherwise is there any other alternative to joblib, which can help me to do > scoring without retraining the model. Please let me know as per your > earliest convenience as am a bit pressed > > > > Thanks for your help in advance! > > > > Cheers, > > Debu > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailfordebu at gmail.com Tue Dec 27 00:26:22 2016 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Tue, 27 Dec 2016 10:56:22 +0530 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: Message-ID: Hi Joel, Thanks for your quick feedback ? I certainly understand what you mean and please allow me to explain one more time through a sequence of steps corresponding to the approach I followed: 1. I considered a dataset containing 600 K (0.6 million) records for training my model using scikit learn?s Random Forest Classifier library 1. I did a training and test sample split on 600 k ? forming 480 K training dataset and 120 K test dataset (80:20 split) 1. I trained scikit learn?s Random Forest Classifier model on the 480 K (80% split) training sample 1. Then I ran prediction (predict_proba method of scikit learn?s RF library) on the 120 K test sample 1. I got a prediction result with True Positive Rate (TPR) as 10-12 % on probability thresholds above 0.5 1. I saved the above Random Forest Classifier model using scikit learn?s joblib library (dump method) in the form of a pickle file 1. I reloaded the model in a different python instance from the pickle file mentioned above and did my scoring , i.e., used joblib library load method and then instantiated prediction (predict_proba method) on the entire set of my original 600 K records 1. Now when I am running (scoring) my model using joblib.predict_proba on the entire set of original data (600 K), I am getting a True Positive rate of around 80%. 1. I did some further analysis and figured out that during the training process, when the model was predicting on the test sample of 120K it could only predict 10-12% of 120K data beyond a probability threshold of 0.5. When I am now trying to score my model on the entire set of 600 K records, it appears that the model is remembering some of it?s past behavior and data and accordingly throwing 80% True positive rate 1. When I tried to score the model using joblib.predict_proba on a completely disjoint dataset from the one used for training (i.e., no overlap between training and scoring data) then it?s giving me the right True Positive Rate (in the range of 10 ? 12%) *Here lies my question once again:* Should I be using 2 different input datasets (completely exclusive / disjoint) for training and scoring the models ? In case the input datasets for scoring and training overlaps then I get incorrect results. Will that be a fair assumption ? Another question ? is there an alternate model scoring library (apart from joblib, the one I am using) ? Thanks once again for your feedback in advance ! Cheers, Debu On Tue, Dec 27, 2016 at 1:56 AM, Joel Nothman wrote: > Hi Debu, > > Your post is terminologically confusing, so I'm not sure I've understood > your problem. Where is the "different sample" used for scoring coming from? > Is it possible it is more related to the training data than the test sample? > > Joel > > On 27 December 2016 at 05:28, Debabrata Ghosh > wrote: > >> Dear All, >> >> Greetings! >> >> I need some urgent guidance and help >> from you all in model scoring. What I mean by model scoring is around the >> following steps: >> >> >> >> 1. I have trained a Random Classifier model using scikit-learn >> (RandomForestClassifier library) >> 2. Then I have generated the True Positive and False Positive >> predictions on my test data set using predict_proba method (I have splitted >> my data into training and test samples in 80:20 ratio) >> 3. Finally, I have dumped the model into a pkl file. >> 4. Next in another instance, I have loaded the .pkl file >> 5. I have initiated job_lib.predict_proba method for predicting the >> True Positive and False positives on a different sample. I am terming this >> step as scoring whether I am predicting without retraining the model >> >> My question is when I generate the True Positive Rate on >> the test data set (as part of model training approach), the rate which I am >> getting is 10 ? 12%. But when I do the scoring (using the steps mentioned >> above), my True Positive Rate is shooting high upto 80%. Although, I am >> happy to get a very high TPR but my question is whether getting such a high >> TPR during the scoring phase is an expected outcome? In other words, >> whether achieving a high TPR through joblib is an accepted outcome >> vis-?-vis getting the TPR on training / test data set. >> >> Your views on the above ask will be really helpful as I >> am very confused whether to consider scoring the model using joblib. >> Otherwise is there any other alternative to joblib, which can help me to do >> scoring without retraining the model. Please let me know as per your >> earliest convenience as am a bit pressed >> >> >> >> Thanks for your help in advance! >> >> >> >> Cheers, >> >> Debu >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ahowe42 at gmail.com Tue Dec 27 02:18:42 2016 From: ahowe42 at gmail.com (Andrew Howe) Date: Tue, 27 Dec 2016 10:18:42 +0300 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: Message-ID: Hi Debu "Should I be using 2 different input datasets (completely exclusive / disjoint) for training and scoring the models ?" Yes - this is the reason for partitioning the data into training / testing sets. However, I can't imagine that it's the cause of your odd results. What is the total classification result in both training & testing (not just TPs)? Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD www.andrewhowe.com http://www.linkedin.com/in/ahowe42 https://www.researchgate.net/profile/John_Howe12/ I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Tue, Dec 27, 2016 at 8:26 AM, Debabrata Ghosh wrote: > Hi Joel, > > Thanks for your quick feedback ? I certainly understand > what you mean and please allow me to explain one more time through a > sequence of steps corresponding to the approach I followed: > > > > 1. I considered a dataset containing 600 K (0.6 million) records for > training my model using scikit learn?s Random Forest Classifier library > > > > 1. I did a training and test sample split on 600 k ? forming 480 K > training dataset and 120 K test dataset (80:20 split) > > > > 1. I trained scikit learn?s Random Forest Classifier model on the 480 > K (80% split) training sample > > > > 1. Then I ran prediction (predict_proba method of scikit learn?s RF > library) on the 120 K test sample > > > > 1. I got a prediction result with True Positive Rate (TPR) as 10-12 % > on probability thresholds above 0.5 > > > > 1. I saved the above Random Forest Classifier model using scikit > learn?s joblib library (dump method) in the form of a pickle file > > > > 1. I reloaded the model in a different python instance from the pickle > file mentioned above and did my scoring , i.e., used joblib library load > method and then instantiated prediction (predict_proba method) on the > entire set of my original 600 K records > > > > 1. Now when I am running (scoring) my model using joblib.predict_proba > on the entire set of original data (600 K), I am getting a True Positive > rate of around 80%. > > > > 1. I did some further analysis and figured out that during the > training process, when the model was predicting on the test sample of 120K > it could only predict 10-12% of 120K data beyond a probability threshold of > 0.5. When I am now trying to score my model on the entire set of 600 K > records, it appears that the model is remembering some of it?s past > behavior and data and accordingly throwing 80% True positive rate > > > > 1. When I tried to score the model using joblib.predict_proba on a > completely disjoint dataset from the one used for training (i.e., no > overlap between training and scoring data) then it?s giving me the right > True Positive Rate (in the range of 10 ? 12%) > > *Here lies my question once again:* Should I be using 2 > different input datasets (completely exclusive / disjoint) for training and > scoring the models ? In case the input datasets for scoring and training > overlaps then I get incorrect results. Will that be a fair assumption ? > > Another question ? is there an alternate model scoring library > (apart from joblib, the one I am using) ? > > > Thanks once again for your feedback in advance ! > > > Cheers, > > > Debu > > On Tue, Dec 27, 2016 at 1:56 AM, Joel Nothman > wrote: > >> Hi Debu, >> >> Your post is terminologically confusing, so I'm not sure I've understood >> your problem. Where is the "different sample" used for scoring coming from? >> Is it possible it is more related to the training data than the test sample? >> >> Joel >> >> On 27 December 2016 at 05:28, Debabrata Ghosh >> wrote: >> >>> Dear All, >>> >>> Greetings! >>> >>> I need some urgent guidance and help >>> from you all in model scoring. What I mean by model scoring is around the >>> following steps: >>> >>> >>> >>> 1. I have trained a Random Classifier model using scikit-learn >>> (RandomForestClassifier library) >>> 2. Then I have generated the True Positive and False Positive >>> predictions on my test data set using predict_proba method (I have splitted >>> my data into training and test samples in 80:20 ratio) >>> 3. Finally, I have dumped the model into a pkl file. >>> 4. Next in another instance, I have loaded the .pkl file >>> 5. I have initiated job_lib.predict_proba method for predicting the >>> True Positive and False positives on a different sample. I am terming this >>> step as scoring whether I am predicting without retraining the model >>> >>> My question is when I generate the True Positive Rate >>> on the test data set (as part of model training approach), the rate which I >>> am getting is 10 ? 12%. But when I do the scoring (using the steps >>> mentioned above), my True Positive Rate is shooting high upto 80%. >>> Although, I am happy to get a very high TPR but my question is whether >>> getting such a high TPR during the scoring phase is an expected outcome? In >>> other words, whether achieving a high TPR through joblib is an accepted >>> outcome vis-?-vis getting the TPR on training / test data set. >>> >>> Your views on the above ask will be really helpful as I >>> am very confused whether to consider scoring the model using joblib. >>> Otherwise is there any other alternative to joblib, which can help me to do >>> scoring without retraining the model. Please let me know as per your >>> earliest convenience as am a bit pressed >>> >>> >>> >>> Thanks for your help in advance! >>> >>> >>> >>> Cheers, >>> >>> Debu >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Tue Dec 27 04:51:39 2016 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Tue, 27 Dec 2016 10:51:39 +0100 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: Message-ID: <586239AB.80500@gmail.com> Hi Debu, On 27/12/16 08:18, Andrew Howe wrote: > 5. I got a prediction result with True Positive Rate (TPR) as 10-12 > % on probability thresholds above 0.5 Getting a high True Positive Rate (recall) is not a sufficient condition for a well behaved model. Though 0.1 recall is still pretty bad. You could look at the precision at the same time (or consider, for instance, the F1 score). > 7. I reloaded the model in a different python instance from the > pickle file mentioned above and did my scoring , i.e., used > joblib library load method and then instantiated prediction > (predict_proba method) on the entire set of my original 600 K > records > Another question ? is there an alternate model scoring > library (apart from joblib, the one I am using) ? Joblib is not a scoring library; once you load a model from disk with joblib you should get ~ the same RandomForestClassifier estimator object as before saving it. > 8. Now when I am running (scoring) my model using > joblib.predict_proba on the entire set of original data (600 K), > I am getting a True Positive rate of around 80%. That sounds normal, considering what you are doing. Your entire set consists of 80% of training set (for which the recall, I imagine, would be close to 1.0) and 20 % test set (with a recall of 0.1), so on average you would get a recall close to 0.8 for the complete set. Unless I missed something. > 9. I did some further analysis and figured out that during the > training process, when the model was predicting on the test > sample of 120K it could only predict 10-12% of 120K data beyond > a probability threshold of 0.5. When I am now trying to score my > model on the entire set of 600 K records, it appears that the > model is remembering some of it?s past behavior and data and > accordingly throwing 80% True positive rate It feels like your RandomForestClassifier is not properly tuned. A recall of 0.1 on the test set is quite low. It could be worth trying to tune it better (cf. https://stackoverflow.com/a/36109706 ), using some other metric than the recall to evaluate the performance. Roman From joel.nothman at gmail.com Tue Dec 27 05:52:30 2016 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 27 Dec 2016 21:52:30 +1100 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: <586239AB.80500@gmail.com> References: <586239AB.80500@gmail.com> Message-ID: Your model is overfit to the training data. Not to say that it's necessarily possible to get a better fit. The default settings for trees lean towards a tight fit, so you might modify their parameters to increase regularisation. Still, you should not expect that evaluating a model's performance on its training data will be indicative of its general performance. This is why we use held-out test sets and cross-validation. On 27 December 2016 at 20:51, Roman Yurchak wrote: > Hi Debu, > > On 27/12/16 08:18, Andrew Howe wrote: > > 5. I got a prediction result with True Positive Rate (TPR) as 10-12 > > % on probability thresholds above 0.5 > > Getting a high True Positive Rate (recall) is not a sufficient condition > for a well behaved model. Though 0.1 recall is still pretty bad. You > could look at the precision at the same time (or consider, for instance, > the F1 score). > > > 7. I reloaded the model in a different python instance from the > > pickle file mentioned above and did my scoring , i.e., used > > joblib library load method and then instantiated prediction > > (predict_proba method) on the entire set of my original 600 K > > records > > Another question ? is there an alternate model scoring > > library (apart from joblib, the one I am using) ? > > Joblib is not a scoring library; once you load a model from disk with > joblib you should get ~ the same RandomForestClassifier estimator object > as before saving it. > > > 8. Now when I am running (scoring) my model using > > joblib.predict_proba on the entire set of original data (600 K), > > I am getting a True Positive rate of around 80%. > > That sounds normal, considering what you are doing. Your entire set > consists of 80% of training set (for which the recall, I imagine, would > be close to 1.0) and 20 % test set (with a recall of 0.1), so on > average you would get a recall close to 0.8 for the complete set. Unless > I missed something. > > > > 9. I did some further analysis and figured out that during the > > training process, when the model was predicting on the test > > sample of 120K it could only predict 10-12% of 120K data beyond > > a probability threshold of 0.5. When I am now trying to score my > > model on the entire set of 600 K records, it appears that the > > model is remembering some of it?s past behavior and data and > > accordingly throwing 80% True positive rate > > It feels like your RandomForestClassifier is not properly tuned. A > recall of 0.1 on the test set is quite low. It could be worth trying to > tune it better (cf. https://stackoverflow.com/a/36109706 ), using some > other metric than the recall to evaluate the performance. > > > Roman > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailfordebu at gmail.com Tue Dec 27 12:17:05 2016 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Tue, 27 Dec 2016 22:47:05 +0530 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: <586239AB.80500@gmail.com> Message-ID: Dear Joel, Andrew and Roman, Thank you very much for your individual feedback ! It's very helpful indeed ! A few more points related to my model execution: 1. By the term "scoring" I meant the process of executing the model once again without retraining it. So , for training the model I used RandomForestClassifer library and for my scoring (execution without retraining) I have used joblib.dump and joblib.load 2. I have used the parameter n_estimator = 5000 while training my model. Besides it , I have used n_jobs = -1 and haven't used any other parameter 3. For my "scoring" activity (executing the model without retraining it) is there an alternate approach to joblib library ? 4. When I execute my scoring job (joblib method) on a dataset , which is completely different to my training dataset then I get similar True Positive Rate and False Positive Rate as of training 5. However, when I execute my scoring job on the same dataset used for training my model then I get very high TPR and FPR. Is there mechanism through which I can visualise the trees created by my RandomForestClassifer algorithm ? While I dumped the model using joblib.dump , there are a bunch of .npy files created. Will those contain the trees ? Thanks in advance ! Cheers, Debu On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman wrote: > Your model is overfit to the training data. Not to say that it's > necessarily possible to get a better fit. The default settings for trees > lean towards a tight fit, so you might modify their parameters to increase > regularisation. Still, you should not expect that evaluating a model's > performance on its training data will be indicative of its general > performance. This is why we use held-out test sets and cross-validation. > > On 27 December 2016 at 20:51, Roman Yurchak wrote: > >> Hi Debu, >> >> On 27/12/16 08:18, Andrew Howe wrote: >> > 5. I got a prediction result with True Positive Rate (TPR) as 10-12 >> > % on probability thresholds above 0.5 >> >> Getting a high True Positive Rate (recall) is not a sufficient condition >> for a well behaved model. Though 0.1 recall is still pretty bad. You >> could look at the precision at the same time (or consider, for instance, >> the F1 score). >> >> > 7. I reloaded the model in a different python instance from the >> > pickle file mentioned above and did my scoring , i.e., used >> > joblib library load method and then instantiated prediction >> > (predict_proba method) on the entire set of my original 600 K >> > records >> > Another question ? is there an alternate model scoring >> > library (apart from joblib, the one I am using) ? >> >> Joblib is not a scoring library; once you load a model from disk with >> joblib you should get ~ the same RandomForestClassifier estimator object >> as before saving it. >> >> > 8. Now when I am running (scoring) my model using >> > joblib.predict_proba on the entire set of original data (600 K), >> > I am getting a True Positive rate of around 80%. >> >> That sounds normal, considering what you are doing. Your entire set >> consists of 80% of training set (for which the recall, I imagine, would >> be close to 1.0) and 20 % test set (with a recall of 0.1), so on >> average you would get a recall close to 0.8 for the complete set. Unless >> I missed something. >> >> >> > 9. I did some further analysis and figured out that during the >> > training process, when the model was predicting on the test >> > sample of 120K it could only predict 10-12% of 120K data beyond >> > a probability threshold of 0.5. When I am now trying to score my >> > model on the entire set of 600 K records, it appears that the >> > model is remembering some of it?s past behavior and data and >> > accordingly throwing 80% True positive rate >> >> It feels like your RandomForestClassifier is not properly tuned. A >> recall of 0.1 on the test set is quite low. It could be worth trying to >> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some >> other metric than the recall to evaluate the performance. >> >> >> Roman >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Dec 27 12:48:29 2016 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Tue, 27 Dec 2016 18:48:29 +0100 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: <586239AB.80500@gmail.com> Message-ID: On 27 December 2016 at 18:17, Debabrata Ghosh wrote: > Dear Joel, Andrew and Roman, > Thank you very much > for your individual feedback ! It's very helpful indeed ! A few more points > related to my model execution: > > 1. By the term "scoring" I meant the process of executing the model once > again without retraining it. So , for training the model I used > RandomForestClassifer library and for my scoring (execution without > retraining) I have used joblib.dump and joblib.load > Go probably with the terms: training, validating, and testing. This is pretty much standard. Scoring is just the value of a metric given some data (training data, validation data, or testing data). > > 2. I have used the parameter n_estimator = 5000 while training my model. > Besides it , I have used n_jobs = -1 and haven't used any other parameter > You should probably check those other parameters and understand what are their effects. You should really check the link of Roman since GridSearchCV can help you to decide how to fix the parameters. http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV Additionally, 5000 trees seems a lot to me. > > 3. For my "scoring" activity (executing the model without retraining it) > is there an alternate approach to joblib library ? > Joblib only store data. There is not link with scoring (Check Roman answer) > > 4. When I execute my scoring job (joblib method) on a dataset , which is > completely different to my training dataset then I get similar True > Positive Rate and False Positive Rate as of training > It is what you should get. > > 5. However, when I execute my scoring job on the same dataset used for > training my model then I get very high TPR and FPR. > You are testing on some data which you used while training. Probably, one of the first rule is to not do that. If you want to evaluate in some way your classifier, have a separate set (test set) and only test on that one. As previously mentioned by Roman, 80% of your data are already known by the RandomForestClassifier and will be perfectly classified. > > Is there mechanism > through which I can visualise the trees created by my RandomForestClassifer > algorithm ? While I dumped the model using joblib.dump , there are a bunch > of .npy files created. Will those contain the trees ? > You can visualize the trees with sklearn.tree.export_graphviz: http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html The bunch of npy are the data needed to load the RandomForestClassifier which you previously dumped. > > Thanks in advance ! > > Cheers, > > Debu > > On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman > wrote: > >> Your model is overfit to the training data. Not to say that it's >> necessarily possible to get a better fit. The default settings for trees >> lean towards a tight fit, so you might modify their parameters to increase >> regularisation. Still, you should not expect that evaluating a model's >> performance on its training data will be indicative of its general >> performance. This is why we use held-out test sets and cross-validation. >> >> On 27 December 2016 at 20:51, Roman Yurchak >> wrote: >> >>> Hi Debu, >>> >>> On 27/12/16 08:18, Andrew Howe wrote: >>> > 5. I got a prediction result with True Positive Rate (TPR) as >>> 10-12 >>> > % on probability thresholds above 0.5 >>> >>> Getting a high True Positive Rate (recall) is not a sufficient condition >>> for a well behaved model. Though 0.1 recall is still pretty bad. You >>> could look at the precision at the same time (or consider, for instance, >>> the F1 score). >>> >>> > 7. I reloaded the model in a different python instance from the >>> > pickle file mentioned above and did my scoring , i.e., used >>> > joblib library load method and then instantiated prediction >>> > (predict_proba method) on the entire set of my original 600 K >>> > records >>> > Another question ? is there an alternate model scoring >>> > library (apart from joblib, the one I am using) ? >>> >>> Joblib is not a scoring library; once you load a model from disk with >>> joblib you should get ~ the same RandomForestClassifier estimator object >>> as before saving it. >>> >>> > 8. Now when I am running (scoring) my model using >>> > joblib.predict_proba on the entire set of original data (600 >>> K), >>> > I am getting a True Positive rate of around 80%. >>> >>> That sounds normal, considering what you are doing. Your entire set >>> consists of 80% of training set (for which the recall, I imagine, would >>> be close to 1.0) and 20 % test set (with a recall of 0.1), so on >>> average you would get a recall close to 0.8 for the complete set. Unless >>> I missed something. >>> >>> >>> > 9. I did some further analysis and figured out that during the >>> > training process, when the model was predicting on the test >>> > sample of 120K it could only predict 10-12% of 120K data beyond >>> > a probability threshold of 0.5. When I am now trying to score >>> my >>> > model on the entire set of 600 K records, it appears that the >>> > model is remembering some of it?s past behavior and data and >>> > accordingly throwing 80% True positive rate >>> >>> It feels like your RandomForestClassifier is not properly tuned. A >>> recall of 0.1 on the test set is quite low. It could be worth trying to >>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some >>> other metric than the recall to evaluate the performance. >>> >>> >>> Roman >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Ile-de-France Equipe PARIETAL guillaume.lemaitre at inria.f r --- https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailfordebu at gmail.com Tue Dec 27 13:38:29 2016 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Wed, 28 Dec 2016 00:08:29 +0530 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: <586239AB.80500@gmail.com> Message-ID: Thanks Guillaume for your quick feedback ! Appreciate it a lot. I will definitely try out the links you have given. Another quick one please. My objective is to execute the model without retraining it. Let me get you an example here to elaborate this - I train my model on a huge set of data (historic 6 months worth of data) and finalise my model. Now going forward I need to run my model against smaller set of data (daily data) and for that I wouldn't need to retrain my model daily. Given the above scenario, I wanted to confirm once more whether after training the model if I use joblib.dump and then while executing the model on daily basis, if I use joblib.load then is this a good approach. I am using joblib.dump(clf, 'model.pkl') and for loading , I am using joblib.load('model.pkl). I amn't leveraging any of the *.npy files generated in the folder. Now, as you mentioned that joblib is a mechanism to save the data but my objective is not to load the data used during the model training but only the algorithm so that I can run the model on a fresh set of data after loading data. And indeed my model is running fine after I execute the joblib.load ('model.pkl) command but I wanted to confirm what it's doing internally. Thanks in advance ! Cheers, Debu On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lema?tre wrote: > On 27 December 2016 at 18:17, Debabrata Ghosh > wrote: > >> Dear Joel, Andrew and Roman, >> Thank you very much >> for your individual feedback ! It's very helpful indeed ! A few more points >> related to my model execution: >> >> 1. By the term "scoring" I meant the process of executing the model once >> again without retraining it. So , for training the model I used >> RandomForestClassifer library and for my scoring (execution without >> retraining) I have used joblib.dump and joblib.load >> > > Go probably with the terms: training, validating, and testing. > This is pretty much standard. Scoring is just the value of a > metric given some data (training data, validation data, or > testing data). > > >> >> 2. I have used the parameter n_estimator = 5000 while training my model. >> Besides it , I have used n_jobs = -1 and haven't used any other parameter >> > > You should probably check those other parameters and understand > what are their effects. You should really check the link of Roman > since GridSearchCV can help you to decide how to fix the parameters. > http://scikit-learn.org/stable/modules/generated/sklearn.model_selection. > GridSearchCV.html#sklearn.model_selection.GridSearchCV > Additionally, 5000 trees seems a lot to me. > > >> >> 3. For my "scoring" activity (executing the model without retraining it) >> is there an alternate approach to joblib library ? >> > > Joblib only store data. There is not link with scoring (Check Roman answer) > > >> >> 4. When I execute my scoring job (joblib method) on a dataset , which is >> completely different to my training dataset then I get similar True >> Positive Rate and False Positive Rate as of training >> > > It is what you should get. > > >> >> 5. However, when I execute my scoring job on the same dataset used for >> training my model then I get very high TPR and FPR. >> > > You are testing on some data which you used while training. Probably, > one of the first rule is to not do that. If you want to evaluate in some > way your classifier, have a separate set (test set) and only test on that > one. As previously mentioned by Roman, 80% of your data are already > known by the RandomForestClassifier and will be perfectly classified. > > >> >> Is there mechanism >> through which I can visualise the trees created by my RandomForestClassifer >> algorithm ? While I dumped the model using joblib.dump , there are a bunch >> of .npy files created. Will those contain the trees ? >> > > You can visualize the trees with sklearn.tree.export_graphviz: > http://scikit-learn.org/stable/modules/generated/ > sklearn.tree.export_graphviz.html > > The bunch of npy are the data needed to load the RandomForestClassifier > which > you previously dumped. > > >> >> Thanks in advance ! >> >> Cheers, >> >> Debu >> >> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman >> wrote: >> >>> Your model is overfit to the training data. Not to say that it's >>> necessarily possible to get a better fit. The default settings for trees >>> lean towards a tight fit, so you might modify their parameters to increase >>> regularisation. Still, you should not expect that evaluating a model's >>> performance on its training data will be indicative of its general >>> performance. This is why we use held-out test sets and cross-validation. >>> >>> On 27 December 2016 at 20:51, Roman Yurchak >>> wrote: >>> >>>> Hi Debu, >>>> >>>> On 27/12/16 08:18, Andrew Howe wrote: >>>> > 5. I got a prediction result with True Positive Rate (TPR) as >>>> 10-12 >>>> > % on probability thresholds above 0.5 >>>> >>>> Getting a high True Positive Rate (recall) is not a sufficient condition >>>> for a well behaved model. Though 0.1 recall is still pretty bad. You >>>> could look at the precision at the same time (or consider, for instance, >>>> the F1 score). >>>> >>>> > 7. I reloaded the model in a different python instance from the >>>> > pickle file mentioned above and did my scoring , i.e., used >>>> > joblib library load method and then instantiated prediction >>>> > (predict_proba method) on the entire set of my original 600 K >>>> > records >>>> > Another question ? is there an alternate model scoring >>>> > library (apart from joblib, the one I am using) ? >>>> >>>> Joblib is not a scoring library; once you load a model from disk with >>>> joblib you should get ~ the same RandomForestClassifier estimator object >>>> as before saving it. >>>> >>>> > 8. Now when I am running (scoring) my model using >>>> > joblib.predict_proba on the entire set of original data (600 >>>> K), >>>> > I am getting a True Positive rate of around 80%. >>>> >>>> That sounds normal, considering what you are doing. Your entire set >>>> consists of 80% of training set (for which the recall, I imagine, would >>>> be close to 1.0) and 20 % test set (with a recall of 0.1), so on >>>> average you would get a recall close to 0.8 for the complete set. Unless >>>> I missed something. >>>> >>>> >>>> > 9. I did some further analysis and figured out that during the >>>> > training process, when the model was predicting on the test >>>> > sample of 120K it could only predict 10-12% of 120K data >>>> beyond >>>> > a probability threshold of 0.5. When I am now trying to score >>>> my >>>> > model on the entire set of 600 K records, it appears that the >>>> > model is remembering some of it?s past behavior and data and >>>> > accordingly throwing 80% True positive rate >>>> >>>> It feels like your RandomForestClassifier is not properly tuned. A >>>> recall of 0.1 on the test set is quite low. It could be worth trying to >>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some >>>> other metric than the recall to evaluate the performance. >>>> >>>> >>>> Roman >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Ile-de-France > Equipe PARIETAL > guillaume.lemaitre at inria.f r --- > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Dec 27 14:12:39 2016 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Tue, 27 Dec 2016 20:12:39 +0100 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: <586239AB.80500@gmail.com> Message-ID: On 27 December 2016 at 19:38, Debabrata Ghosh wrote: > Thanks Guillaume for your quick feedback ! Appreciate it a lot. > > I will definitely try out the links you have given. Another quick one > please. My objective is to execute the model without retraining it. Let me > get you an example here to elaborate this - I train my model on a huge set > of data (historic 6 months worth of data) and finalise my model. Now going > forward I need to run my model against smaller set of data (daily data) and > for that I wouldn't need to retrain my model daily. > So you just need to dump the model after training (which actually what you did). > > Given the above scenario, I wanted to confirm once more whether after > training the model if I use joblib.dump and then while executing the model > on daily basis, if I use joblib.load then is this a good approach. I am > using joblib.dump(clf, 'model.pkl') and for loading , I am using > joblib.load('model.pkl). I amn't leveraging any of the *.npy files > generated in the folder. > So, you need to train and dump the estimator. To predict with the dumped model, you need to load and use predict/predict_proba, etc. The npy file are the file associated to your model. In the case of a random forest you need to keep the parameter of each trees. Having 5000 trees, you should have many npy. The data themselves are not dumped. > > Now, as you mentioned that joblib is a mechanism to save the data but my > objective is not to load the data used during the model training but only > the algorithm so that I can run the model on a fresh set of data after > loading data. And indeed my model is running fine after I execute the > joblib.load ('model.pkl) command but I wanted to confirm what it's doing > internally. > > Thanks in advance ! > > Cheers, > > Debu > > On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lema?tre < > g.lemaitre58 at gmail.com> wrote: > >> On 27 December 2016 at 18:17, Debabrata Ghosh >> wrote: >> >>> Dear Joel, Andrew and Roman, >>> Thank you very much >>> for your individual feedback ! It's very helpful indeed ! A few more points >>> related to my model execution: >>> >>> 1. By the term "scoring" I meant the process of executing the model once >>> again without retraining it. So , for training the model I used >>> RandomForestClassifer library and for my scoring (execution without >>> retraining) I have used joblib.dump and joblib.load >>> >> >> Go probably with the terms: training, validating, and testing. >> This is pretty much standard. Scoring is just the value of a >> metric given some data (training data, validation data, or >> testing data). >> >> >>> >>> 2. I have used the parameter n_estimator = 5000 while training my model. >>> Besides it , I have used n_jobs = -1 and haven't used any other parameter >>> >> >> You should probably check those other parameters and understand >> what are their effects. You should really check the link of Roman >> since GridSearchCV can help you to decide how to fix the parameters. >> http://scikit-learn.org/stable/modules/generated/sklearn. >> model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV >> Additionally, 5000 trees seems a lot to me. >> >> >>> >>> 3. For my "scoring" activity (executing the model without retraining it) >>> is there an alternate approach to joblib library ? >>> >> >> Joblib only store data. There is not link with scoring (Check Roman >> answer) >> >> >>> >>> 4. When I execute my scoring job (joblib method) on a dataset , which is >>> completely different to my training dataset then I get similar True >>> Positive Rate and False Positive Rate as of training >>> >> >> It is what you should get. >> >> >>> >>> 5. However, when I execute my scoring job on the same dataset used for >>> training my model then I get very high TPR and FPR. >>> >> >> You are testing on some data which you used while training. Probably, >> one of the first rule is to not do that. If you want to evaluate in some >> way your classifier, have a separate set (test set) and only test on that >> one. As previously mentioned by Roman, 80% of your data are already >> known by the RandomForestClassifier and will be perfectly classified. >> >> >>> >>> Is there mechanism >>> through which I can visualise the trees created by my RandomForestClassifer >>> algorithm ? While I dumped the model using joblib.dump , there are a bunch >>> of .npy files created. Will those contain the trees ? >>> >> >> You can visualize the trees with sklearn.tree.export_graphviz: >> http://scikit-learn.org/stable/modules/generated/sklearn. >> tree.export_graphviz.html >> >> The bunch of npy are the data needed to load the RandomForestClassifier >> which >> you previously dumped. >> >> >>> >>> Thanks in advance ! >>> >>> Cheers, >>> >>> Debu >>> >>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman >>> wrote: >>> >>>> Your model is overfit to the training data. Not to say that it's >>>> necessarily possible to get a better fit. The default settings for trees >>>> lean towards a tight fit, so you might modify their parameters to increase >>>> regularisation. Still, you should not expect that evaluating a model's >>>> performance on its training data will be indicative of its general >>>> performance. This is why we use held-out test sets and cross-validation. >>>> >>>> On 27 December 2016 at 20:51, Roman Yurchak >>>> wrote: >>>> >>>>> Hi Debu, >>>>> >>>>> On 27/12/16 08:18, Andrew Howe wrote: >>>>> > 5. I got a prediction result with True Positive Rate (TPR) as >>>>> 10-12 >>>>> > % on probability thresholds above 0.5 >>>>> >>>>> Getting a high True Positive Rate (recall) is not a sufficient >>>>> condition >>>>> for a well behaved model. Though 0.1 recall is still pretty bad. You >>>>> could look at the precision at the same time (or consider, for >>>>> instance, >>>>> the F1 score). >>>>> >>>>> > 7. I reloaded the model in a different python instance from the >>>>> > pickle file mentioned above and did my scoring , i.e., used >>>>> > joblib library load method and then instantiated prediction >>>>> > (predict_proba method) on the entire set of my original 600 K >>>>> > records >>>>> > Another question ? is there an alternate model scoring >>>>> > library (apart from joblib, the one I am using) ? >>>>> >>>>> Joblib is not a scoring library; once you load a model from disk with >>>>> joblib you should get ~ the same RandomForestClassifier estimator >>>>> object >>>>> as before saving it. >>>>> >>>>> > 8. Now when I am running (scoring) my model using >>>>> > joblib.predict_proba on the entire set of original data (600 >>>>> K), >>>>> > I am getting a True Positive rate of around 80%. >>>>> >>>>> That sounds normal, considering what you are doing. Your entire set >>>>> consists of 80% of training set (for which the recall, I imagine, would >>>>> be close to 1.0) and 20 % test set (with a recall of 0.1), so on >>>>> average you would get a recall close to 0.8 for the complete set. >>>>> Unless >>>>> I missed something. >>>>> >>>>> >>>>> > 9. I did some further analysis and figured out that during the >>>>> > training process, when the model was predicting on the test >>>>> > sample of 120K it could only predict 10-12% of 120K data >>>>> beyond >>>>> > a probability threshold of 0.5. When I am now trying to >>>>> score my >>>>> > model on the entire set of 600 K records, it appears that the >>>>> > model is remembering some of it?s past behavior and data and >>>>> > accordingly throwing 80% True positive rate >>>>> >>>>> It feels like your RandomForestClassifier is not properly tuned. A >>>>> recall of 0.1 on the test set is quite low. It could be worth trying to >>>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some >>>>> other metric than the recall to evaluate the performance. >>>>> >>>>> >>>>> Roman >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Guillaume Lemaitre >> INRIA Saclay - Ile-de-France >> Equipe PARIETAL >> guillaume.lemaitre at inria.f r --- >> https://glemaitre.github.io/ >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Ile-de-France Equipe PARIETAL guillaume.lemaitre at inria.f r --- https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailfordebu at gmail.com Tue Dec 27 13:47:58 2016 From: mailfordebu at gmail.com (mailfordebu at gmail.com) Date: Wed, 28 Dec 2016 00:17:58 +0530 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: <586239AB.80500@gmail.com> Message-ID: Hi Guillaume, And when I say that I have been able to run my models using joblib.load, I meant that I have run using joblib.load on a completely different dataset compared to the one I used for model training. And I got very similar result to joblib.load run as compared to the output from RandomForestClassifier run. Please advise on my last note accordingly. Cheers, Debu Sent from my iPhone > On 28-Dec-2016, at 12:08 AM, Debabrata Ghosh wrote: > > Guillaume From g.lemaitre58 at gmail.com Tue Dec 27 18:07:46 2016 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Wed, 28 Dec 2016 00:07:46 +0100 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: <586239AB.80500@gmail.com> Message-ID: I am not sure to understand your terminology. While calling joblib.load, you actually load the RandomForestClassifier. Therefore, calling predict from the estimator loaded with joblib is identical as using the RandomForestClassifier which you trained at the first place. I think that it would be much simpler if you can post snippet (short), to illustrate your thoughts and avoid confusions. Cheers, On 27 December 2016 at 19:47, wrote: > Hi Guillaume, > And when I say that I have been able to run my > models using joblib.load, I meant that I have run using joblib.load on a > completely different dataset compared to the one I used for model training. > And I got very similar result to joblib.load run as compared to the output > from RandomForestClassifier run. Please advise on my last note accordingly. > Cheers, > Debu > > Sent from my iPhone > > > On 28-Dec-2016, at 12:08 AM, Debabrata Ghosh > wrote: > > > > Guillaume > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre INRIA Saclay - Ile-de-France Equipe PARIETAL guillaume.lemaitre at inria.f r --- https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From gen.tang86 at gmail.com Wed Dec 28 12:05:15 2016 From: gen.tang86 at gmail.com (gen tang) Date: Thu, 29 Dec 2016 01:05:15 +0800 Subject: [scikit-learn] A math mistake in spectral_embedding Message-ID: Hi, everyone. I am quite new to this mail list. I think that I found a math mistake in spectral_embedding function. And I created a issue in github https://github.com/scikit-learn/scikit-learn/issues/8129 However, github can't show mathmatics equation, I send you pdf version(in attachment) of bug description by mail. Can anyone verify this math detail? Thanks a lot Cheers Gen -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: spectral_embedding_bug.pdf Type: application/pdf Size: 139825 bytes Desc: not available URL: From mailfordebu at gmail.com Wed Dec 28 14:25:16 2016 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Thu, 29 Dec 2016 00:55:16 +0530 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: <586239AB.80500@gmail.com> Message-ID: Hi Guillaume, With respect to the following point you mentioned: You can visualize the trees with sklearn.tree.export_graphviz: http://scikit-learn.org/stable/modules/generated/sklearn.tre e.export_graphviz.html I couldn't find a direct method for exporting the RandomForestClassifier trees. Accordingly, I attempted for a workaround using the following code but still no success: clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1) clf.fit(p_features_train,p_labels_train) for i, tree in enumerate(clf.estimators_): with open('tree_' + str(i) + '.dot', 'w') as dotfile: tree.export_graphviz(clf, dotfile) Would you please be able to help me with the piece of code which I need to execute for exporting the RandomForestClassifier trees. Cheers, Debu On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lema?tre wrote: > On 27 December 2016 at 18:17, Debabrata Ghosh > wrote: > >> Dear Joel, Andrew and Roman, >> Thank you very much >> for your individual feedback ! It's very helpful indeed ! A few more points >> related to my model execution: >> >> 1. By the term "scoring" I meant the process of executing the model once >> again without retraining it. So , for training the model I used >> RandomForestClassifer library and for my scoring (execution without >> retraining) I have used joblib.dump and joblib.load >> > > Go probably with the terms: training, validating, and testing. > This is pretty much standard. Scoring is just the value of a > metric given some data (training data, validation data, or > testing data). > > >> >> 2. I have used the parameter n_estimator = 5000 while training my model. >> Besides it , I have used n_jobs = -1 and haven't used any other parameter >> > > You should probably check those other parameters and understand > what are their effects. You should really check the link of Roman > since GridSearchCV can help you to decide how to fix the parameters. > http://scikit-learn.org/stable/modules/generated/sklearn.model_selection. > GridSearchCV.html#sklearn.model_selection.GridSearchCV > Additionally, 5000 trees seems a lot to me. > > >> >> 3. For my "scoring" activity (executing the model without retraining it) >> is there an alternate approach to joblib library ? >> > > Joblib only store data. There is not link with scoring (Check Roman answer) > > >> >> 4. When I execute my scoring job (joblib method) on a dataset , which is >> completely different to my training dataset then I get similar True >> Positive Rate and False Positive Rate as of training >> > > It is what you should get. > > >> >> 5. However, when I execute my scoring job on the same dataset used for >> training my model then I get very high TPR and FPR. >> > > You are testing on some data which you used while training. Probably, > one of the first rule is to not do that. If you want to evaluate in some > way your classifier, have a separate set (test set) and only test on that > one. As previously mentioned by Roman, 80% of your data are already > known by the RandomForestClassifier and will be perfectly classified. > > >> >> Is there mechanism >> through which I can visualise the trees created by my RandomForestClassifer >> algorithm ? While I dumped the model using joblib.dump , there are a bunch >> of .npy files created. Will those contain the trees ? >> > > You can visualize the trees with sklearn.tree.export_graphviz: > http://scikit-learn.org/stable/modules/generated/ > sklearn.tree.export_graphviz.html > > The bunch of npy are the data needed to load the RandomForestClassifier > which > you previously dumped. > > >> >> Thanks in advance ! >> >> Cheers, >> >> Debu >> >> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman >> wrote: >> >>> Your model is overfit to the training data. Not to say that it's >>> necessarily possible to get a better fit. The default settings for trees >>> lean towards a tight fit, so you might modify their parameters to increase >>> regularisation. Still, you should not expect that evaluating a model's >>> performance on its training data will be indicative of its general >>> performance. This is why we use held-out test sets and cross-validation. >>> >>> On 27 December 2016 at 20:51, Roman Yurchak >>> wrote: >>> >>>> Hi Debu, >>>> >>>> On 27/12/16 08:18, Andrew Howe wrote: >>>> > 5. I got a prediction result with True Positive Rate (TPR) as >>>> 10-12 >>>> > % on probability thresholds above 0.5 >>>> >>>> Getting a high True Positive Rate (recall) is not a sufficient condition >>>> for a well behaved model. Though 0.1 recall is still pretty bad. You >>>> could look at the precision at the same time (or consider, for instance, >>>> the F1 score). >>>> >>>> > 7. I reloaded the model in a different python instance from the >>>> > pickle file mentioned above and did my scoring , i.e., used >>>> > joblib library load method and then instantiated prediction >>>> > (predict_proba method) on the entire set of my original 600 K >>>> > records >>>> > Another question ? is there an alternate model scoring >>>> > library (apart from joblib, the one I am using) ? >>>> >>>> Joblib is not a scoring library; once you load a model from disk with >>>> joblib you should get ~ the same RandomForestClassifier estimator object >>>> as before saving it. >>>> >>>> > 8. Now when I am running (scoring) my model using >>>> > joblib.predict_proba on the entire set of original data (600 >>>> K), >>>> > I am getting a True Positive rate of around 80%. >>>> >>>> That sounds normal, considering what you are doing. Your entire set >>>> consists of 80% of training set (for which the recall, I imagine, would >>>> be close to 1.0) and 20 % test set (with a recall of 0.1), so on >>>> average you would get a recall close to 0.8 for the complete set. Unless >>>> I missed something. >>>> >>>> >>>> > 9. I did some further analysis and figured out that during the >>>> > training process, when the model was predicting on the test >>>> > sample of 120K it could only predict 10-12% of 120K data >>>> beyond >>>> > a probability threshold of 0.5. When I am now trying to score >>>> my >>>> > model on the entire set of 600 K records, it appears that the >>>> > model is remembering some of it?s past behavior and data and >>>> > accordingly throwing 80% True positive rate >>>> >>>> It feels like your RandomForestClassifier is not properly tuned. A >>>> recall of 0.1 on the test set is quite low. It could be worth trying to >>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some >>>> other metric than the recall to evaluate the performance. >>>> >>>> >>>> Roman >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Ile-de-France > Equipe PARIETAL > guillaume.lemaitre at inria.f r --- > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Wed Dec 28 14:34:43 2016 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Wed, 28 Dec 2016 20:34:43 +0100 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: <586239AB.80500@gmail.com> Message-ID: after the fit you need this call: for idx_tree, tree in enumerate(clf.estimators_): export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) On 28 December 2016 at 20:25, Debabrata Ghosh wrote: > Hi Guillaume, > With respect to the following point you > mentioned: > You can visualize the trees with sklearn.tree.export_graphviz: > http://scikit-learn.org/stable/modules/generated/sklearn.tre > e.export_graphviz.html > > I couldn't find a direct method for exporting the RandomForestClassifier > trees. Accordingly, I attempted for a workaround using the following code > but still no success: > > clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1) > clf.fit(p_features_train,p_labels_train) > for i, tree in enumerate(clf.estimators_): > with open('tree_' + str(i) + '.dot', 'w') as dotfile: > tree.export_graphviz(clf, dotfile) > > Would you please be able to help me with the piece of code which I need to > execute for exporting the RandomForestClassifier trees. > > Cheers, > > Debu > > > On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lema?tre < > g.lemaitre58 at gmail.com> wrote: > >> On 27 December 2016 at 18:17, Debabrata Ghosh >> wrote: >> >>> Dear Joel, Andrew and Roman, >>> Thank you very much >>> for your individual feedback ! It's very helpful indeed ! A few more points >>> related to my model execution: >>> >>> 1. By the term "scoring" I meant the process of executing the model once >>> again without retraining it. So , for training the model I used >>> RandomForestClassifer library and for my scoring (execution without >>> retraining) I have used joblib.dump and joblib.load >>> >> >> Go probably with the terms: training, validating, and testing. >> This is pretty much standard. Scoring is just the value of a >> metric given some data (training data, validation data, or >> testing data). >> >> >>> >>> 2. I have used the parameter n_estimator = 5000 while training my model. >>> Besides it , I have used n_jobs = -1 and haven't used any other parameter >>> >> >> You should probably check those other parameters and understand >> what are their effects. You should really check the link of Roman >> since GridSearchCV can help you to decide how to fix the parameters. >> http://scikit-learn.org/stable/modules/generated/sklearn. >> model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV >> Additionally, 5000 trees seems a lot to me. >> >> >>> >>> 3. For my "scoring" activity (executing the model without retraining it) >>> is there an alternate approach to joblib library ? >>> >> >> Joblib only store data. There is not link with scoring (Check Roman >> answer) >> >> >>> >>> 4. When I execute my scoring job (joblib method) on a dataset , which is >>> completely different to my training dataset then I get similar True >>> Positive Rate and False Positive Rate as of training >>> >> >> It is what you should get. >> >> >>> >>> 5. However, when I execute my scoring job on the same dataset used for >>> training my model then I get very high TPR and FPR. >>> >> >> You are testing on some data which you used while training. Probably, >> one of the first rule is to not do that. If you want to evaluate in some >> way your classifier, have a separate set (test set) and only test on that >> one. As previously mentioned by Roman, 80% of your data are already >> known by the RandomForestClassifier and will be perfectly classified. >> >> >>> >>> Is there mechanism >>> through which I can visualise the trees created by my RandomForestClassifer >>> algorithm ? While I dumped the model using joblib.dump , there are a bunch >>> of .npy files created. Will those contain the trees ? >>> >> >> You can visualize the trees with sklearn.tree.export_graphviz: >> http://scikit-learn.org/stable/modules/generated/sklearn. >> tree.export_graphviz.html >> >> The bunch of npy are the data needed to load the RandomForestClassifier >> which >> you previously dumped. >> >> >>> >>> Thanks in advance ! >>> >>> Cheers, >>> >>> Debu >>> >>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman >>> wrote: >>> >>>> Your model is overfit to the training data. Not to say that it's >>>> necessarily possible to get a better fit. The default settings for trees >>>> lean towards a tight fit, so you might modify their parameters to increase >>>> regularisation. Still, you should not expect that evaluating a model's >>>> performance on its training data will be indicative of its general >>>> performance. This is why we use held-out test sets and cross-validation. >>>> >>>> On 27 December 2016 at 20:51, Roman Yurchak >>>> wrote: >>>> >>>>> Hi Debu, >>>>> >>>>> On 27/12/16 08:18, Andrew Howe wrote: >>>>> > 5. I got a prediction result with True Positive Rate (TPR) as >>>>> 10-12 >>>>> > % on probability thresholds above 0.5 >>>>> >>>>> Getting a high True Positive Rate (recall) is not a sufficient >>>>> condition >>>>> for a well behaved model. Though 0.1 recall is still pretty bad. You >>>>> could look at the precision at the same time (or consider, for >>>>> instance, >>>>> the F1 score). >>>>> >>>>> > 7. I reloaded the model in a different python instance from the >>>>> > pickle file mentioned above and did my scoring , i.e., used >>>>> > joblib library load method and then instantiated prediction >>>>> > (predict_proba method) on the entire set of my original 600 K >>>>> > records >>>>> > Another question ? is there an alternate model scoring >>>>> > library (apart from joblib, the one I am using) ? >>>>> >>>>> Joblib is not a scoring library; once you load a model from disk with >>>>> joblib you should get ~ the same RandomForestClassifier estimator >>>>> object >>>>> as before saving it. >>>>> >>>>> > 8. Now when I am running (scoring) my model using >>>>> > joblib.predict_proba on the entire set of original data (600 >>>>> K), >>>>> > I am getting a True Positive rate of around 80%. >>>>> >>>>> That sounds normal, considering what you are doing. Your entire set >>>>> consists of 80% of training set (for which the recall, I imagine, would >>>>> be close to 1.0) and 20 % test set (with a recall of 0.1), so on >>>>> average you would get a recall close to 0.8 for the complete set. >>>>> Unless >>>>> I missed something. >>>>> >>>>> >>>>> > 9. I did some further analysis and figured out that during the >>>>> > training process, when the model was predicting on the test >>>>> > sample of 120K it could only predict 10-12% of 120K data >>>>> beyond >>>>> > a probability threshold of 0.5. When I am now trying to >>>>> score my >>>>> > model on the entire set of 600 K records, it appears that the >>>>> > model is remembering some of it?s past behavior and data and >>>>> > accordingly throwing 80% True positive rate >>>>> >>>>> It feels like your RandomForestClassifier is not properly tuned. A >>>>> recall of 0.1 on the test set is quite low. It could be worth trying to >>>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using some >>>>> other metric than the recall to evaluate the performance. >>>>> >>>>> >>>>> Roman >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Guillaume Lemaitre >> INRIA Saclay - Ile-de-France >> Equipe PARIETAL >> guillaume.lemaitre at inria.f r --- >> https://glemaitre.github.io/ >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Ile-de-France Equipe PARIETAL guillaume.lemaitre at inria.f r --- https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailfordebu at gmail.com Wed Dec 28 23:38:21 2016 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Thu, 29 Dec 2016 10:08:21 +0530 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: <586239AB.80500@gmail.com> Message-ID: Hi Guillaume, Thanks for your feedback ! I am still getting an error, while attempting to print the trees. Here is a snapshot of my code. I know I may be missing something very silly, but still wanted to check and see how this works. >>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1) >>> clf.fit(p_features_train,p_labels_train) RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=5000, n_jobs=-1, oob_score=False, random_state=None, verbose=0, warm_start=False) >>> for idx_tree, tree in enumerate(clf.estimators_): ... export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) ... Traceback (most recent call last): File "", line 2, in NameError: name 'export_graphviz' is not defined >>> for idx_tree, tree in enumerate(clf.estimators_): ... tree.export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) ... Traceback (most recent call last): File "", line 2, in AttributeError: 'DecisionTreeClassifier' object has no attribute 'export_graphviz' Just to give you a background about the libraries, I have imported the following libraries: from sklearn.ensemble import RandomForestClassifier from sklearn import tree Thanks again as always ! Cheers, On Thu, Dec 29, 2016 at 1:04 AM, Guillaume Lema?tre wrote: > after the fit you need this call: > for idx_tree, tree in enumerate(clf.estimators_): > export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) > > > > On 28 December 2016 at 20:25, Debabrata Ghosh > wrote: > >> Hi Guillaume, >> With respect to the following point you >> mentioned: >> You can visualize the trees with sklearn.tree.export_graphviz: >> http://scikit-learn.org/stable/modules/generated/sklearn.tre >> e.export_graphviz.html >> >> I couldn't find a direct method for exporting the RandomForestClassifier >> trees. Accordingly, I attempted for a workaround using the following code >> but still no success: >> >> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1) >> clf.fit(p_features_train,p_labels_train) >> for i, tree in enumerate(clf.estimators_): >> with open('tree_' + str(i) + '.dot', 'w') as dotfile: >> tree.export_graphviz(clf, dotfile) >> >> Would you please be able to help me with the piece of code which I need >> to execute for exporting the RandomForestClassifier trees. >> >> Cheers, >> >> Debu >> >> >> On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lema?tre < >> g.lemaitre58 at gmail.com> wrote: >> >>> On 27 December 2016 at 18:17, Debabrata Ghosh >>> wrote: >>> >>>> Dear Joel, Andrew and Roman, >>>> Thank you very >>>> much for your individual feedback ! It's very helpful indeed ! A few more >>>> points related to my model execution: >>>> >>>> 1. By the term "scoring" I meant the process of executing the model >>>> once again without retraining it. So , for training the model I used >>>> RandomForestClassifer library and for my scoring (execution without >>>> retraining) I have used joblib.dump and joblib.load >>>> >>> >>> Go probably with the terms: training, validating, and testing. >>> This is pretty much standard. Scoring is just the value of a >>> metric given some data (training data, validation data, or >>> testing data). >>> >>> >>>> >>>> 2. I have used the parameter n_estimator = 5000 while training my >>>> model. Besides it , I have used n_jobs = -1 and haven't used any other >>>> parameter >>>> >>> >>> You should probably check those other parameters and understand >>> what are their effects. You should really check the link of Roman >>> since GridSearchCV can help you to decide how to fix the parameters. >>> http://scikit-learn.org/stable/modules/generated/sklearn.mod >>> el_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV >>> Additionally, 5000 trees seems a lot to me. >>> >>> >>>> >>>> 3. For my "scoring" activity (executing the model without retraining >>>> it) is there an alternate approach to joblib library ? >>>> >>> >>> Joblib only store data. There is not link with scoring (Check Roman >>> answer) >>> >>> >>>> >>>> 4. When I execute my scoring job (joblib method) on a dataset , which >>>> is completely different to my training dataset then I get similar True >>>> Positive Rate and False Positive Rate as of training >>>> >>> >>> It is what you should get. >>> >>> >>>> >>>> 5. However, when I execute my scoring job on the same dataset used for >>>> training my model then I get very high TPR and FPR. >>>> >>> >>> You are testing on some data which you used while training. Probably, >>> one of the first rule is to not do that. If you want to evaluate in some >>> way your classifier, have a separate set (test set) and only test on that >>> one. As previously mentioned by Roman, 80% of your data are already >>> known by the RandomForestClassifier and will be perfectly classified. >>> >>> >>>> >>>> Is there mechanism >>>> through which I can visualise the trees created by my RandomForestClassifer >>>> algorithm ? While I dumped the model using joblib.dump , there are a bunch >>>> of .npy files created. Will those contain the trees ? >>>> >>> >>> You can visualize the trees with sklearn.tree.export_graphviz: >>> http://scikit-learn.org/stable/modules/generated/sklearn.tre >>> e.export_graphviz.html >>> >>> The bunch of npy are the data needed to load the RandomForestClassifier >>> which >>> you previously dumped. >>> >>> >>>> >>>> Thanks in advance ! >>>> >>>> Cheers, >>>> >>>> Debu >>>> >>>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman >>>> wrote: >>>> >>>>> Your model is overfit to the training data. Not to say that it's >>>>> necessarily possible to get a better fit. The default settings for trees >>>>> lean towards a tight fit, so you might modify their parameters to increase >>>>> regularisation. Still, you should not expect that evaluating a model's >>>>> performance on its training data will be indicative of its general >>>>> performance. This is why we use held-out test sets and cross-validation. >>>>> >>>>> On 27 December 2016 at 20:51, Roman Yurchak >>>>> wrote: >>>>> >>>>>> Hi Debu, >>>>>> >>>>>> On 27/12/16 08:18, Andrew Howe wrote: >>>>>> > 5. I got a prediction result with True Positive Rate (TPR) as >>>>>> 10-12 >>>>>> > % on probability thresholds above 0.5 >>>>>> >>>>>> Getting a high True Positive Rate (recall) is not a sufficient >>>>>> condition >>>>>> for a well behaved model. Though 0.1 recall is still pretty bad. You >>>>>> could look at the precision at the same time (or consider, for >>>>>> instance, >>>>>> the F1 score). >>>>>> >>>>>> > 7. I reloaded the model in a different python instance from the >>>>>> > pickle file mentioned above and did my scoring , i.e., used >>>>>> > joblib library load method and then instantiated prediction >>>>>> > (predict_proba method) on the entire set of my original 600 >>>>>> K >>>>>> > records >>>>>> > Another question ? is there an alternate model scoring >>>>>> > library (apart from joblib, the one I am using) ? >>>>>> >>>>>> Joblib is not a scoring library; once you load a model from disk with >>>>>> joblib you should get ~ the same RandomForestClassifier estimator >>>>>> object >>>>>> as before saving it. >>>>>> >>>>>> > 8. Now when I am running (scoring) my model using >>>>>> > joblib.predict_proba on the entire set of original data >>>>>> (600 K), >>>>>> > I am getting a True Positive rate of around 80%. >>>>>> >>>>>> That sounds normal, considering what you are doing. Your entire set >>>>>> consists of 80% of training set (for which the recall, I imagine, >>>>>> would >>>>>> be close to 1.0) and 20 % test set (with a recall of 0.1), so on >>>>>> average you would get a recall close to 0.8 for the complete set. >>>>>> Unless >>>>>> I missed something. >>>>>> >>>>>> >>>>>> > 9. I did some further analysis and figured out that during the >>>>>> > training process, when the model was predicting on the test >>>>>> > sample of 120K it could only predict 10-12% of 120K data >>>>>> beyond >>>>>> > a probability threshold of 0.5. When I am now trying to >>>>>> score my >>>>>> > model on the entire set of 600 K records, it appears that >>>>>> the >>>>>> > model is remembering some of it?s past behavior and data and >>>>>> > accordingly throwing 80% True positive rate >>>>>> >>>>>> It feels like your RandomForestClassifier is not properly tuned. A >>>>>> recall of 0.1 on the test set is quite low. It could be worth trying >>>>>> to >>>>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using >>>>>> some >>>>>> other metric than the recall to evaluate the performance. >>>>>> >>>>>> >>>>>> Roman >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> -- >>> Guillaume Lemaitre >>> INRIA Saclay - Ile-de-France >>> Equipe PARIETAL >>> guillaume.lemaitre at inria.f r --- >>> https://glemaitre.github.io/ >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Ile-de-France > Equipe PARIETAL > guillaume.lemaitre at inria.f r --- > https://glemaitre.github.io/ > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From naopon at gmail.com Wed Dec 28 23:50:36 2016 From: naopon at gmail.com (Naoya Kanai) Date: Wed, 28 Dec 2016 20:50:36 -0800 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: <586239AB.80500@gmail.com> Message-ID: The ?tree? name is clashing between the sklearn.tree module and the DecisionTreeClassifier objects in the loop. You can change the import to from sklearn.tree import export_graphviz and modify the method call accordingly. ? On Wed, Dec 28, 2016 at 8:38 PM, Debabrata Ghosh wrote: > Hi Guillaume, > Thanks for your feedback ! I am > still getting an error, while attempting to print the trees. Here is a > snapshot of my code. I know I may be missing something very silly, but > still wanted to check and see how this works. > > >>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1) > >>> clf.fit(p_features_train,p_labels_train) > RandomForestClassifier(bootstrap=True, class_weight=None, > criterion='gini', > max_depth=None, max_features='auto', max_leaf_nodes=None, > min_samples_leaf=1, min_samples_split=2, > min_weight_fraction_leaf=0.0, n_estimators=5000, n_jobs=-1, > oob_score=False, random_state=None, verbose=0, > warm_start=False) > >>> for idx_tree, tree in enumerate(clf.estimators_): > ... export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) > ... > Traceback (most recent call last): > File "", line 2, in > NameError: name 'export_graphviz' is not defined > >>> for idx_tree, tree in enumerate(clf.estimators_): > ... tree.export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) > ... > Traceback (most recent call last): > File "", line 2, in > AttributeError: 'DecisionTreeClassifier' object has no attribute > 'export_graphviz' > > Just to give you a background about the libraries, I have imported the > following libraries: > > from sklearn.ensemble import RandomForestClassifier > from sklearn import tree > > Thanks again as always ! > > Cheers, > > On Thu, Dec 29, 2016 at 1:04 AM, Guillaume Lema?tre < > g.lemaitre58 at gmail.com> wrote: > >> after the fit you need this call: >> for idx_tree, tree in enumerate(clf.estimators_): >> export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) >> >> >> >> On 28 December 2016 at 20:25, Debabrata Ghosh >> wrote: >> >>> Hi Guillaume, >>> With respect to the following point you >>> mentioned: >>> You can visualize the trees with sklearn.tree.export_graphviz: >>> http://scikit-learn.org/stable/modules/generated/sklearn.tre >>> e.export_graphviz.html >>> >>> I couldn't find a direct method for exporting the RandomForestClassifier >>> trees. Accordingly, I attempted for a workaround using the following code >>> but still no success: >>> >>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1) >>> clf.fit(p_features_train,p_labels_train) >>> for i, tree in enumerate(clf.estimators_): >>> with open('tree_' + str(i) + '.dot', 'w') as dotfile: >>> tree.export_graphviz(clf, dotfile) >>> >>> Would you please be able to help me with the piece of code which I need >>> to execute for exporting the RandomForestClassifier trees. >>> >>> Cheers, >>> >>> Debu >>> >>> >>> On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lema?tre < >>> g.lemaitre58 at gmail.com> wrote: >>> >>>> On 27 December 2016 at 18:17, Debabrata Ghosh >>>> wrote: >>>> >>>>> Dear Joel, Andrew and Roman, >>>>> Thank you very >>>>> much for your individual feedback ! It's very helpful indeed ! A few more >>>>> points related to my model execution: >>>>> >>>>> 1. By the term "scoring" I meant the process of executing the model >>>>> once again without retraining it. So , for training the model I used >>>>> RandomForestClassifer library and for my scoring (execution without >>>>> retraining) I have used joblib.dump and joblib.load >>>>> >>>> >>>> Go probably with the terms: training, validating, and testing. >>>> This is pretty much standard. Scoring is just the value of a >>>> metric given some data (training data, validation data, or >>>> testing data). >>>> >>>> >>>>> >>>>> 2. I have used the parameter n_estimator = 5000 while training my >>>>> model. Besides it , I have used n_jobs = -1 and haven't used any other >>>>> parameter >>>>> >>>> >>>> You should probably check those other parameters and understand >>>> what are their effects. You should really check the link of Roman >>>> since GridSearchCV can help you to decide how to fix the parameters. >>>> http://scikit-learn.org/stable/modules/generated/sklearn.mod >>>> el_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV >>>> Additionally, 5000 trees seems a lot to me. >>>> >>>> >>>>> >>>>> 3. For my "scoring" activity (executing the model without retraining >>>>> it) is there an alternate approach to joblib library ? >>>>> >>>> >>>> Joblib only store data. There is not link with scoring (Check Roman >>>> answer) >>>> >>>> >>>>> >>>>> 4. When I execute my scoring job (joblib method) on a dataset , which >>>>> is completely different to my training dataset then I get similar True >>>>> Positive Rate and False Positive Rate as of training >>>>> >>>> >>>> It is what you should get. >>>> >>>> >>>>> >>>>> 5. However, when I execute my scoring job on the same dataset used for >>>>> training my model then I get very high TPR and FPR. >>>>> >>>> >>>> You are testing on some data which you used while training. Probably, >>>> one of the first rule is to not do that. If you want to evaluate in some >>>> way your classifier, have a separate set (test set) and only test on >>>> that >>>> one. As previously mentioned by Roman, 80% of your data are already >>>> known by the RandomForestClassifier and will be perfectly classified. >>>> >>>> >>>>> >>>>> Is there mechanism >>>>> through which I can visualise the trees created by my RandomForestClassifer >>>>> algorithm ? While I dumped the model using joblib.dump , there are a bunch >>>>> of .npy files created. Will those contain the trees ? >>>>> >>>> >>>> You can visualize the trees with sklearn.tree.export_graphviz: >>>> http://scikit-learn.org/stable/modules/generated/sklearn.tre >>>> e.export_graphviz.html >>>> >>>> The bunch of npy are the data needed to load the RandomForestClassifier >>>> which >>>> you previously dumped. >>>> >>>> >>>>> >>>>> Thanks in advance ! >>>>> >>>>> Cheers, >>>>> >>>>> Debu >>>>> >>>>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman >>>>> wrote: >>>>> >>>>>> Your model is overfit to the training data. Not to say that it's >>>>>> necessarily possible to get a better fit. The default settings for trees >>>>>> lean towards a tight fit, so you might modify their parameters to increase >>>>>> regularisation. Still, you should not expect that evaluating a model's >>>>>> performance on its training data will be indicative of its general >>>>>> performance. This is why we use held-out test sets and cross-validation. >>>>>> >>>>>> On 27 December 2016 at 20:51, Roman Yurchak >>>>>> wrote: >>>>>> >>>>>>> Hi Debu, >>>>>>> >>>>>>> On 27/12/16 08:18, Andrew Howe wrote: >>>>>>> > 5. I got a prediction result with True Positive Rate (TPR) as >>>>>>> 10-12 >>>>>>> > % on probability thresholds above 0.5 >>>>>>> >>>>>>> Getting a high True Positive Rate (recall) is not a sufficient >>>>>>> condition >>>>>>> for a well behaved model. Though 0.1 recall is still pretty bad. You >>>>>>> could look at the precision at the same time (or consider, for >>>>>>> instance, >>>>>>> the F1 score). >>>>>>> >>>>>>> > 7. I reloaded the model in a different python instance from >>>>>>> the >>>>>>> > pickle file mentioned above and did my scoring , i.e., used >>>>>>> > joblib library load method and then instantiated prediction >>>>>>> > (predict_proba method) on the entire set of my original >>>>>>> 600 K >>>>>>> > records >>>>>>> > Another question ? is there an alternate model >>>>>>> scoring >>>>>>> > library (apart from joblib, the one I am using) ? >>>>>>> >>>>>>> Joblib is not a scoring library; once you load a model from disk with >>>>>>> joblib you should get ~ the same RandomForestClassifier estimator >>>>>>> object >>>>>>> as before saving it. >>>>>>> >>>>>>> > 8. Now when I am running (scoring) my model using >>>>>>> > joblib.predict_proba on the entire set of original data >>>>>>> (600 K), >>>>>>> > I am getting a True Positive rate of around 80%. >>>>>>> >>>>>>> That sounds normal, considering what you are doing. Your entire set >>>>>>> consists of 80% of training set (for which the recall, I imagine, >>>>>>> would >>>>>>> be close to 1.0) and 20 % test set (with a recall of 0.1), so on >>>>>>> average you would get a recall close to 0.8 for the complete set. >>>>>>> Unless >>>>>>> I missed something. >>>>>>> >>>>>>> >>>>>>> > 9. I did some further analysis and figured out that during >>>>>>> the >>>>>>> > training process, when the model was predicting on the test >>>>>>> > sample of 120K it could only predict 10-12% of 120K data >>>>>>> beyond >>>>>>> > a probability threshold of 0.5. When I am now trying to >>>>>>> score my >>>>>>> > model on the entire set of 600 K records, it appears that >>>>>>> the >>>>>>> > model is remembering some of it?s past behavior and data >>>>>>> and >>>>>>> > accordingly throwing 80% True positive rate >>>>>>> >>>>>>> It feels like your RandomForestClassifier is not properly tuned. A >>>>>>> recall of 0.1 on the test set is quite low. It could be worth trying >>>>>>> to >>>>>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using >>>>>>> some >>>>>>> other metric than the recall to evaluate the performance. >>>>>>> >>>>>>> >>>>>>> Roman >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> >>>> -- >>>> Guillaume Lemaitre >>>> INRIA Saclay - Ile-de-France >>>> Equipe PARIETAL >>>> guillaume.lemaitre at inria.f r --- >>>> https://glemaitre.github.io/ >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> Guillaume Lemaitre >> INRIA Saclay - Ile-de-France >> Equipe PARIETAL >> guillaume.lemaitre at inria.f r --- >> https://glemaitre.github.io/ >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mailfordebu at gmail.com Thu Dec 29 00:00:56 2016 From: mailfordebu at gmail.com (Debabrata Ghosh) Date: Thu, 29 Dec 2016 10:30:56 +0530 Subject: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library In-Reply-To: References: <586239AB.80500@gmail.com> Message-ID: Thanks Naoya ! This has worked and I am able to generate the .dot files. Cheers, Debu On Thu, Dec 29, 2016 at 10:20 AM, Naoya Kanai wrote: > The ?tree? name is clashing between the sklearn.tree module and the > DecisionTreeClassifier objects in the loop. > > You can change the import to > > from sklearn.tree import export_graphviz > > and modify the method call accordingly. > ? > > On Wed, Dec 28, 2016 at 8:38 PM, Debabrata Ghosh > wrote: > >> Hi Guillaume, >> Thanks for your feedback ! I am >> still getting an error, while attempting to print the trees. Here is a >> snapshot of my code. I know I may be missing something very silly, but >> still wanted to check and see how this works. >> >> >>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1) >> >>> clf.fit(p_features_train,p_labels_train) >> RandomForestClassifier(bootstrap=True, class_weight=None, >> criterion='gini', >> max_depth=None, max_features='auto', max_leaf_nodes=None, >> min_samples_leaf=1, min_samples_split=2, >> min_weight_fraction_leaf=0.0, n_estimators=5000, n_jobs=-1, >> oob_score=False, random_state=None, verbose=0, >> warm_start=False) >> >>> for idx_tree, tree in enumerate(clf.estimators_): >> ... export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) >> ... >> Traceback (most recent call last): >> File "", line 2, in >> NameError: name 'export_graphviz' is not defined >> >>> for idx_tree, tree in enumerate(clf.estimators_): >> ... tree.export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) >> ... >> Traceback (most recent call last): >> File "", line 2, in >> AttributeError: 'DecisionTreeClassifier' object has no attribute >> 'export_graphviz' >> >> Just to give you a background about the libraries, I have imported the >> following libraries: >> >> from sklearn.ensemble import RandomForestClassifier >> from sklearn import tree >> >> Thanks again as always ! >> >> Cheers, >> >> On Thu, Dec 29, 2016 at 1:04 AM, Guillaume Lema?tre < >> g.lemaitre58 at gmail.com> wrote: >> >>> after the fit you need this call: >>> for idx_tree, tree in enumerate(clf.estimators_): >>> export_graphviz(tree, out_file='{}.dot'.format(idx_tree)) >>> >>> >>> >>> On 28 December 2016 at 20:25, Debabrata Ghosh >>> wrote: >>> >>>> Hi Guillaume, >>>> With respect to the following point you >>>> mentioned: >>>> You can visualize the trees with sklearn.tree.export_graphviz: >>>> http://scikit-learn.org/stable/modules/generated/sklearn.tre >>>> e.export_graphviz.html >>>> >>>> I couldn't find a direct method for exporting the >>>> RandomForestClassifier trees. Accordingly, I attempted for a workaround >>>> using the following code but still no success: >>>> >>>> clf = RandomForestClassifier(n_estimators=5000, n_jobs=-1) >>>> clf.fit(p_features_train,p_labels_train) >>>> for i, tree in enumerate(clf.estimators_): >>>> with open('tree_' + str(i) + '.dot', 'w') as dotfile: >>>> tree.export_graphviz(clf, dotfile) >>>> >>>> Would you please be able to help me with the piece of code which I need >>>> to execute for exporting the RandomForestClassifier trees. >>>> >>>> Cheers, >>>> >>>> Debu >>>> >>>> >>>> On Tue, Dec 27, 2016 at 11:18 PM, Guillaume Lema?tre < >>>> g.lemaitre58 at gmail.com> wrote: >>>> >>>>> On 27 December 2016 at 18:17, Debabrata Ghosh >>>>> wrote: >>>>> >>>>>> Dear Joel, Andrew and Roman, >>>>>> Thank you very >>>>>> much for your individual feedback ! It's very helpful indeed ! A few more >>>>>> points related to my model execution: >>>>>> >>>>>> 1. By the term "scoring" I meant the process of executing the model >>>>>> once again without retraining it. So , for training the model I used >>>>>> RandomForestClassifer library and for my scoring (execution without >>>>>> retraining) I have used joblib.dump and joblib.load >>>>>> >>>>> >>>>> Go probably with the terms: training, validating, and testing. >>>>> This is pretty much standard. Scoring is just the value of a >>>>> metric given some data (training data, validation data, or >>>>> testing data). >>>>> >>>>> >>>>>> >>>>>> 2. I have used the parameter n_estimator = 5000 while training my >>>>>> model. Besides it , I have used n_jobs = -1 and haven't used any other >>>>>> parameter >>>>>> >>>>> >>>>> You should probably check those other parameters and understand >>>>> what are their effects. You should really check the link of Roman >>>>> since GridSearchCV can help you to decide how to fix the parameters. >>>>> http://scikit-learn.org/stable/modules/generated/sklearn.mod >>>>> el_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV >>>>> Additionally, 5000 trees seems a lot to me. >>>>> >>>>> >>>>>> >>>>>> 3. For my "scoring" activity (executing the model without retraining >>>>>> it) is there an alternate approach to joblib library ? >>>>>> >>>>> >>>>> Joblib only store data. There is not link with scoring (Check Roman >>>>> answer) >>>>> >>>>> >>>>>> >>>>>> 4. When I execute my scoring job (joblib method) on a dataset , which >>>>>> is completely different to my training dataset then I get similar True >>>>>> Positive Rate and False Positive Rate as of training >>>>>> >>>>> >>>>> It is what you should get. >>>>> >>>>> >>>>>> >>>>>> 5. However, when I execute my scoring job on the same dataset used >>>>>> for training my model then I get very high TPR and FPR. >>>>>> >>>>> >>>>> You are testing on some data which you used while training. Probably, >>>>> one of the first rule is to not do that. If you want to evaluate in >>>>> some >>>>> way your classifier, have a separate set (test set) and only test on >>>>> that >>>>> one. As previously mentioned by Roman, 80% of your data are already >>>>> known by the RandomForestClassifier and will be perfectly classified. >>>>> >>>>> >>>>>> >>>>>> Is there mechanism >>>>>> through which I can visualise the trees created by my RandomForestClassifer >>>>>> algorithm ? While I dumped the model using joblib.dump , there are a bunch >>>>>> of .npy files created. Will those contain the trees ? >>>>>> >>>>> >>>>> You can visualize the trees with sklearn.tree.export_graphviz: >>>>> http://scikit-learn.org/stable/modules/generated/sklearn.tre >>>>> e.export_graphviz.html >>>>> >>>>> The bunch of npy are the data needed to load the >>>>> RandomForestClassifier which >>>>> you previously dumped. >>>>> >>>>> >>>>>> >>>>>> Thanks in advance ! >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Debu >>>>>> >>>>>> On Tue, Dec 27, 2016 at 4:22 PM, Joel Nothman >>>>> > wrote: >>>>>> >>>>>>> Your model is overfit to the training data. Not to say that it's >>>>>>> necessarily possible to get a better fit. The default settings for trees >>>>>>> lean towards a tight fit, so you might modify their parameters to increase >>>>>>> regularisation. Still, you should not expect that evaluating a model's >>>>>>> performance on its training data will be indicative of its general >>>>>>> performance. This is why we use held-out test sets and cross-validation. >>>>>>> >>>>>>> On 27 December 2016 at 20:51, Roman Yurchak >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Debu, >>>>>>>> >>>>>>>> On 27/12/16 08:18, Andrew Howe wrote: >>>>>>>> > 5. I got a prediction result with True Positive Rate (TPR) >>>>>>>> as 10-12 >>>>>>>> > % on probability thresholds above 0.5 >>>>>>>> >>>>>>>> Getting a high True Positive Rate (recall) is not a sufficient >>>>>>>> condition >>>>>>>> for a well behaved model. Though 0.1 recall is still pretty bad. You >>>>>>>> could look at the precision at the same time (or consider, for >>>>>>>> instance, >>>>>>>> the F1 score). >>>>>>>> >>>>>>>> > 7. I reloaded the model in a different python instance from >>>>>>>> the >>>>>>>> > pickle file mentioned above and did my scoring , i.e., >>>>>>>> used >>>>>>>> > joblib library load method and then instantiated >>>>>>>> prediction >>>>>>>> > (predict_proba method) on the entire set of my original >>>>>>>> 600 K >>>>>>>> > records >>>>>>>> > Another question ? is there an alternate model >>>>>>>> scoring >>>>>>>> > library (apart from joblib, the one I am using) ? >>>>>>>> >>>>>>>> Joblib is not a scoring library; once you load a model from disk >>>>>>>> with >>>>>>>> joblib you should get ~ the same RandomForestClassifier estimator >>>>>>>> object >>>>>>>> as before saving it. >>>>>>>> >>>>>>>> > 8. Now when I am running (scoring) my model using >>>>>>>> > joblib.predict_proba on the entire set of original data >>>>>>>> (600 K), >>>>>>>> > I am getting a True Positive rate of around 80%. >>>>>>>> >>>>>>>> That sounds normal, considering what you are doing. Your entire set >>>>>>>> consists of 80% of training set (for which the recall, I imagine, >>>>>>>> would >>>>>>>> be close to 1.0) and 20 % test set (with a recall of 0.1), so on >>>>>>>> average you would get a recall close to 0.8 for the complete set. >>>>>>>> Unless >>>>>>>> I missed something. >>>>>>>> >>>>>>>> >>>>>>>> > 9. I did some further analysis and figured out that during >>>>>>>> the >>>>>>>> > training process, when the model was predicting on the >>>>>>>> test >>>>>>>> > sample of 120K it could only predict 10-12% of 120K data >>>>>>>> beyond >>>>>>>> > a probability threshold of 0.5. When I am now trying to >>>>>>>> score my >>>>>>>> > model on the entire set of 600 K records, it appears that >>>>>>>> the >>>>>>>> > model is remembering some of it?s past behavior and data >>>>>>>> and >>>>>>>> > accordingly throwing 80% True positive rate >>>>>>>> >>>>>>>> It feels like your RandomForestClassifier is not properly tuned. A >>>>>>>> recall of 0.1 on the test set is quite low. It could be worth >>>>>>>> trying to >>>>>>>> tune it better (cf. https://stackoverflow.com/a/36109706 ), using >>>>>>>> some >>>>>>>> other metric than the recall to evaluate the performance. >>>>>>>> >>>>>>>> >>>>>>>> Roman >>>>>>>> _______________________________________________ >>>>>>>> scikit-learn mailing list >>>>>>>> scikit-learn at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> scikit-learn mailing list >>>>>>> scikit-learn at python.org >>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> scikit-learn mailing list >>>>>> scikit-learn at python.org >>>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Guillaume Lemaitre >>>>> INRIA Saclay - Ile-de-France >>>>> Equipe PARIETAL >>>>> guillaume.lemaitre at inria.f r --- >>>>> https://glemaitre.github.io/ >>>>> >>>>> _______________________________________________ >>>>> scikit-learn mailing list >>>>> scikit-learn at python.org >>>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>>> >>> >>> >>> -- >>> Guillaume Lemaitre >>> INRIA Saclay - Ile-de-France >>> Equipe PARIETAL >>> guillaume.lemaitre at inria.f r --- >>> https://glemaitre.github.io/ >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg315 at hotmail.fr Thu Dec 29 09:00:39 2016 From: greg315 at hotmail.fr (greg g) Date: Thu, 29 Dec 2016 14:00:39 +0000 Subject: [scikit-learn] numpy.amin behaviour with multidimensionnal arrays Message-ID: Hi, I would like to understand the behaviour of the scipy.spatial.kdtree class that uses numpy.amin function. In the numpy.amin description, we find that it returns the "minimum value along a given axis" What does it mean exactly ? Thanks for any help Gregory -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmschreiber91 at gmail.com Thu Dec 29 14:22:45 2016 From: jmschreiber91 at gmail.com (Jacob Schreiber) Date: Thu, 29 Dec 2016 11:22:45 -0800 Subject: [scikit-learn] numpy.amin behaviour with multidimensionnal arrays In-Reply-To: References: Message-ID: It means that instead of returning the minimum value anywhere in the entire matrix, it will return the minimum value for each column or each row depending on which axis you put in, so a vector instead of a scalar. On Thu, Dec 29, 2016 at 6:00 AM, greg g wrote: > Hi, > > I would like to understand the behaviour of the scipy.spatial.kdtree > class that uses numpy.amin function. > > In the numpy.amin description, we find that it returns the "minimum value > along a given axis" > > What does it mean exactly ? > > > Thanks for any help > > Gregory > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Dec 29 15:03:41 2016 From: t3kcit at gmail.com (Andy) Date: Thu, 29 Dec 2016 15:03:41 -0500 Subject: [scikit-learn] Bookmarklet to view documentation on CircleCI In-Reply-To: References: <20161221223338.GA334300@phare.normalesup.org> Message-ID: <33ad4356-42f7-264a-1e94-1548649a9bc2@gmail.com> On 12/21/2016 07:48 PM, Joel Nothman wrote: > Well, you can as a browser extension. I just haven't bothered to > investigate that technology when there's so much code to review and write. > Can you post it to the docs or maybe more appropriately to the wiki where it's easier to discover and link to? From greg315 at hotmail.fr Thu Dec 29 17:06:50 2016 From: greg315 at hotmail.fr (greg g) Date: Thu, 29 Dec 2016 22:06:50 +0000 Subject: [scikit-learn] numpy.amin behaviour with multidimensionnal arrays In-Reply-To: References: , Message-ID: Thanks Is this a numpy specific terminology ? For a multidimensionnal array with dimension=n and size l1 x l2 x ... x ln, does "along axis=0" mean that l2 x..x ln operations are performed scrolling first dimension, each operation on l1 elements, and that an array with dimension n-1 and size l2 x..x ln containing the operations results is returned? ( Finally I'm not sure this sentence really clarify ... ;-) ) ________________________________ De : scikit-learn de la part de Jacob Schreiber Envoy? : jeudi 29 d?cembre 2016 20:22 ? : Scikit-learn user and developer mailing list Objet : Re: [scikit-learn] numpy.amin behaviour with multidimensionnal arrays It means that instead of returning the minimum value anywhere in the entire matrix, it will return the minimum value for each column or each row depending on which axis you put in, so a vector instead of a scalar. On Thu, Dec 29, 2016 at 6:00 AM, greg g > wrote: Hi, I would like to understand the behaviour of the scipy.spatial.kdtree class that uses numpy.amin function. In the numpy.amin description, we find that it returns the "minimum value along a given axis" What does it mean exactly ? Thanks for any help Gregory _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jni.soma at gmail.com Thu Dec 29 17:32:45 2016 From: jni.soma at gmail.com (Juan Nunez-Iglesias) Date: Fri, 30 Dec 2016 09:32:45 +1100 Subject: [scikit-learn] numpy.amin behaviour with multidimensionnal arrays In-Reply-To: References: Message-ID: <8072f094-fff9-48a2-8184-10173ba4d51c@Spark> Hi Greg, I don't know how specific it is to NumPy, but that's definitely the correct way to talk about it in NumPy, and your understanding in your example is spot-on. This is true of many NumPy functions. Juan. On 30 Dec. 2016, 9:08 AM +1100, greg g , wrote: > Thanks > Is this a numpy specific terminology ? > For a?multidimensionnal array with dimension=n and size l1 x l2 x ... x ln, does "along axis=0" mean that l2 x..x ln operations are performed scrolling first dimension, each operation on l1 elements,? and that an?array with dimension n-1 and size l2 x..x ln containing the operations results is returned? > ( Finally I'm not sure this sentence really clarify ... ;-) ) > > De : scikit-learn de la part de Jacob Schreiber > Envoy? : jeudi 29 d?cembre 2016 20:22 > ? : Scikit-learn user and developer mailing list > Objet : Re: [scikit-learn] numpy.amin behaviour with multidimensionnal arrays > > It means that instead of returning the minimum value anywhere in the entire matrix, it will return the minimum value for each column or each row depending on which axis you put in, so a vector instead of a scalar. > > > On Thu, Dec 29, 2016 at 6:00 AM, greg g wrote: > > > Hi, > > > I would like to understand the behaviour of the scipy.spatial.kdtree class that uses numpy.amin function. > > > In the numpy.amin description, we find that it returns the "minimum value along a given axis" > > > What does it mean exactly ? > > > > > > Thanks for any help > > > Gregory > > > > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From greg315 at hotmail.fr Fri Dec 30 04:56:43 2016 From: greg315 at hotmail.fr (greg g) Date: Fri, 30 Dec 2016 09:56:43 +0000 Subject: [scikit-learn] Your machine learning practical applications ? Message-ID: Hi, Beginning with sklearn , i'm interested in practical applications of this library and applications of machine learning in general. I would be grateful to hear 'real stories' about ML and eventually share them in an upcoming non-technical blog. Thanks if you can share some cases ... Gregory -------------- next part -------------- An HTML attachment was scrubbed... URL: